Adaptive Classifier-Free Guidance for
Robust Image-to-Image Translation

Bongguk Son1, Sangryul Jeon1,†
1School of Computer Science and Engineering, Pusan National University
† Corresponding author
IEEE Access, Vol. 14, 2026
Teaser

Figure 1. A fixed classifier-free guidance scale fails to balance semantic fidelity and structural preservation. AdaCFG adapts the guidance trajectory to each input, producing robust edits across diverse target domains.

Abstract

Text-guided image-to-image translation aims to edit a source image according to a textual prompt while preserving its structure. However, existing approaches often rely on a fixed classifier-free guidance scale and a single prompt input, leading to unstable results and a poor balance between semantic fidelity and structural preservation.

In this work, we propose a unified framework that improves both the controllability and stability of text-driven diffusion editing without requiring fine-tuning or paired training data. Our method introduces two key components: (1) an adaptive guidance scheduler that dynamically modulates the classifier-free guidance scale over timesteps based on the input image and prompt, and (2) a prompt ensemble mechanism that generates and ranks multiple semantically aligned prompt variants to mitigate prompt sensitivity. Together, these components form a plug-and-play framework that significantly improves editing consistency and visual quality. Extensive experiments on NuScenes, AFHQ, and CelebA-HQ demonstrate that our method consistently outperforms existing approaches across diverse scenarios.

Method

A fixed classifier-free guidance (CFG) scale cannot adapt to the varying difficulty of different editing tasks: it often causes either content distortion or insufficient semantic changes. AdaCFG instead predicts an input-specific guidance trajectory over diffusion timesteps, and robustly handles prompt phrasing via an LLM-driven prompt ensemble.

Overall pipeline

Figure 2. Overall pipeline of AdaCFG. The adaptive guidance scheduler predicts an input-specific guidance trajectory and combines it with a prompt ensemble to steer a pre-trained diffusion editor.

  • Adaptive Guidance Scheduler. From the source image and prompt, we predict an initial guidance scale and a velocity term. A monotonically decreasing schedule injects strong semantic influence in early steps (shaping global content) and gradually weakens it in later steps (preserving fine structural details).
  • Prompt Ensemble. We use a Large Language Model to generate semantically consistent prompt variants and aggregate their outputs, removing the need for manual prompt engineering.
  • Extensive Evaluation. Consistent gains over state-of-the-art baselines on NuScenes (driving), AFHQ (animal faces), and CelebA-HQ (human faces) across both objective metrics and a user study.
Prompt ensemble

Figure 3. Prompt ensemble. The LLM expands a single target prompt into multiple semantically consistent variants; outputs are aggregated to mitigate prompt sensitivity.

Qualitative Results

Driving Scenes (NuScenes)

NuScenes qualitative comparison

Figure 4. Qualitative comparison on NuScenes. AdaCFG produces more faithful edits while better preserving scene structure than fixed-CFG baselines.

Face Domains (AFHQ & CelebA-HQ)

Face qualitative comparison

Figure 5. Qualitative comparison on AFHQ and CelebA-HQ. Our method generalizes beyond driving scenes and remains effective on animal and human faces.

Quantitative Results

Table 1. Quantitative comparisons on NuScenes, AFHQ, and CelebA-HQ. We report CLIP↑ and DINO↑ to assess semantic alignment and structural preservation, and Align↑† — a unified GPT-4o based metric following HQ-Edit — for overall edit quality.

Method NuScenes AFHQ CelebA-HQ
CLIP↑ DINO↑ Align↑† CLIP↑ DINO↑ Align↑† CLIP↑ DINO↑ Align↑†
S-CFG [38] 18.433.114.4 20.751.732.4 19.261.819.4
CFG++ [37] 15.218.619.8 21.218.650.8 19.039.135.7
SDEdit [9] 21.646.338.5 20.771.563.7 18.778.070.6
CycleDiff [16] 22.455.569.3 23.944.885.7 23.959.080.2
HQ-Edit [24] 21.953.957.8 22.657.865.8 20.260.167.8
P2P+NTI [8, 18] 23.272.040.4 25.519.975.5 24.532.053.4
IP-Adapter [17] 20.431.944.5 22.934.561.3 21.238.772.3
P2P-zero [15] 20.357.154.2 19.076.443.6 18.572.151.5
I-P2P [11] 19.979.376.1 24.365.872.2 21.588.876.7
+ AdaCFG 22.178.286.2 26.545.883.6 23.486.784.7
gain +2.2−1.1+10.1 +2.2−20.0+11.4 +1.9−2.1+8.0
PnP [7] 25.051.085.5 24.563.879.3 23.871.985.8
+ AdaCFG 24.859.592.4 26.453.290.7 24.874.287.9
gain −0.2+8.5+6.9 +1.9−10.6+11.4 +1.0+2.3+2.1

† Align is a holistic edit-quality score computed with GPT-4o, based on the source image, target prompt, and edited output (following the HQ-Edit protocol).

Trade-off analysis plot

Figure 6. Trade-off analysis on balancing semantic alignment (CLIP↑) and structural preservation (DINO↑). AdaCFG-augmented variants (red stars) consistently occupy the top-right region, indicating superior balance.

Citation

@article{son2026adacfg,
  author  = {Son, Bongguk and Jeon, Sangryul},
  journal = {IEEE Access},
  title   = {Adaptive Classifier-Free Guidance for Robust Image-to-Image Translation},
  year    = {2026},
  volume  = {14},
  pages   = {23556--23576},
  doi     = {10.1109/ACCESS.2026.3655782}
}