Figure 1. A fixed classifier-free guidance scale fails to balance semantic fidelity and structural preservation. AdaCFG adapts the guidance trajectory to each input, producing robust edits across diverse target domains.
Text-guided image-to-image translation aims to edit a source image according to a textual prompt while preserving its structure. However, existing approaches often rely on a fixed classifier-free guidance scale and a single prompt input, leading to unstable results and a poor balance between semantic fidelity and structural preservation.
In this work, we propose a unified framework that improves both the controllability and stability of text-driven diffusion editing without requiring fine-tuning or paired training data. Our method introduces two key components: (1) an adaptive guidance scheduler that dynamically modulates the classifier-free guidance scale over timesteps based on the input image and prompt, and (2) a prompt ensemble mechanism that generates and ranks multiple semantically aligned prompt variants to mitigate prompt sensitivity. Together, these components form a plug-and-play framework that significantly improves editing consistency and visual quality. Extensive experiments on NuScenes, AFHQ, and CelebA-HQ demonstrate that our method consistently outperforms existing approaches across diverse scenarios.
A fixed classifier-free guidance (CFG) scale cannot adapt to the varying difficulty of different editing tasks: it often causes either content distortion or insufficient semantic changes. AdaCFG instead predicts an input-specific guidance trajectory over diffusion timesteps, and robustly handles prompt phrasing via an LLM-driven prompt ensemble.
Figure 2. Overall pipeline of AdaCFG. The adaptive guidance scheduler predicts an input-specific guidance trajectory and combines it with a prompt ensemble to steer a pre-trained diffusion editor.
Figure 3. Prompt ensemble. The LLM expands a single target prompt into multiple semantically consistent variants; outputs are aggregated to mitigate prompt sensitivity.
Figure 4. Qualitative comparison on NuScenes. AdaCFG produces more faithful edits while better preserving scene structure than fixed-CFG baselines.
Figure 5. Qualitative comparison on AFHQ and CelebA-HQ. Our method generalizes beyond driving scenes and remains effective on animal and human faces.
Table 1. Quantitative comparisons on NuScenes, AFHQ, and CelebA-HQ. We report CLIP↑ and DINO↑ to assess semantic alignment and structural preservation, and Align↑† — a unified GPT-4o based metric following HQ-Edit — for overall edit quality.
| Method | NuScenes | AFHQ | CelebA-HQ | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CLIP↑ | DINO↑ | Align↑† | CLIP↑ | DINO↑ | Align↑† | CLIP↑ | DINO↑ | Align↑† | |
| S-CFG [38] | 18.4 | 33.1 | 14.4 | 20.7 | 51.7 | 32.4 | 19.2 | 61.8 | 19.4 |
| CFG++ [37] | 15.2 | 18.6 | 19.8 | 21.2 | 18.6 | 50.8 | 19.0 | 39.1 | 35.7 |
| SDEdit [9] | 21.6 | 46.3 | 38.5 | 20.7 | 71.5 | 63.7 | 18.7 | 78.0 | 70.6 |
| CycleDiff [16] | 22.4 | 55.5 | 69.3 | 23.9 | 44.8 | 85.7 | 23.9 | 59.0 | 80.2 |
| HQ-Edit [24] | 21.9 | 53.9 | 57.8 | 22.6 | 57.8 | 65.8 | 20.2 | 60.1 | 67.8 |
| P2P+NTI [8, 18] | 23.2 | 72.0 | 40.4 | 25.5 | 19.9 | 75.5 | 24.5 | 32.0 | 53.4 |
| IP-Adapter [17] | 20.4 | 31.9 | 44.5 | 22.9 | 34.5 | 61.3 | 21.2 | 38.7 | 72.3 |
| P2P-zero [15] | 20.3 | 57.1 | 54.2 | 19.0 | 76.4 | 43.6 | 18.5 | 72.1 | 51.5 |
| I-P2P [11] | 19.9 | 79.3 | 76.1 | 24.3 | 65.8 | 72.2 | 21.5 | 88.8 | 76.7 |
| + AdaCFG | 22.1 | 78.2 | 86.2 | 26.5 | 45.8 | 83.6 | 23.4 | 86.7 | 84.7 |
| gain | +2.2 | −1.1 | +10.1 | +2.2 | −20.0 | +11.4 | +1.9 | −2.1 | +8.0 |
| PnP [7] | 25.0 | 51.0 | 85.5 | 24.5 | 63.8 | 79.3 | 23.8 | 71.9 | 85.8 |
| + AdaCFG | 24.8 | 59.5 | 92.4 | 26.4 | 53.2 | 90.7 | 24.8 | 74.2 | 87.9 |
| gain | −0.2 | +8.5 | +6.9 | +1.9 | −10.6 | +11.4 | +1.0 | +2.3 | +2.1 |
† Align is a holistic edit-quality score computed with GPT-4o, based on the source image, target prompt, and edited output (following the HQ-Edit protocol).
Figure 6. Trade-off analysis on balancing semantic alignment (CLIP↑) and structural preservation (DINO↑). AdaCFG-augmented variants (red stars) consistently occupy the top-right region, indicating superior balance.
@article{son2026adacfg,
author = {Son, Bongguk and Jeon, Sangryul},
journal = {IEEE Access},
title = {Adaptive Classifier-Free Guidance for Robust Image-to-Image Translation},
year = {2026},
volume = {14},
pages = {23556--23576},
doi = {10.1109/ACCESS.2026.3655782}
}