Adaptive Classifier-Free Guidance for Robust Image-to-Image Translation

Figure 1. A fixed classifier-free guidance scale fails to balance semantic fidelity and structural preservation. AdaCFG adapts the guidance trajectory to each input, producing robust edits across diverse target domains.

Abstract

Text-guided image-to-image translation aims to edit a source image according to a textual prompt while preserving its structure. However, existing approaches often rely on a fixed classifier-free guidance scale and a single prompt input, leading to unstable results and a poor balance between semantic fidelity and structural preservation.

In this work, we propose a unified framework that improves both the controllability and stability of text-driven diffusion editing without requiring fine-tuning or paired training data. Our method introduces two key components: (1) an adaptive guidance scheduler that dynamically modulates the classifier-free guidance scale over timesteps based on the input image and prompt, and (2) a prompt ensemble mechanism that generates and ranks multiple semantically aligned prompt variants to mitigate prompt sensitivity. Together, these components form a plug-and-play framework that significantly improves editing consistency and visual quality. Extensive experiments on NuScenes, AFHQ, and CelebA-HQ demonstrate that our method consistently outperforms existing approaches across diverse scenarios.

Method

A fixed classifier-free guidance (CFG) scale cannot adapt to the varying difficulty of different editing tasks: it often causes either content distortion or insufficient semantic changes. AdaCFG instead predicts an input-specific guidance trajectory over diffusion timesteps, and robustly handles prompt phrasing via an LLM-driven prompt ensemble.

Figure 2. Overall pipeline of AdaCFG. The adaptive guidance scheduler predicts an input-specific guidance trajectory and combines it with a prompt ensemble to steer a pre-trained diffusion editor.

Adaptive Guidance Scheduler. From the source image and prompt, we predict an initial guidance scale and a velocity term. A monotonically decreasing schedule injects strong semantic influence in early steps (shaping global content) and gradually weakens it in later steps (preserving fine structural details).
Prompt Ensemble. We use a Large Language Model to generate semantically consistent prompt variants and aggregate their outputs, removing the need for manual prompt engineering.
Extensive Evaluation. Consistent gains over state-of-the-art baselines on NuScenes (driving), AFHQ (animal faces), and CelebA-HQ (human faces) across both objective metrics and a user study.

Figure 3. Prompt ensemble. The LLM expands a single target prompt into multiple semantically consistent variants; outputs are aggregated to mitigate prompt sensitivity.

Qualitative Results

Driving Scenes (NuScenes)

Figure 4. Qualitative comparison on NuScenes. AdaCFG produces more faithful edits while better preserving scene structure than fixed-CFG baselines.

Face Domains (AFHQ & CelebA-HQ)

Figure 5. Qualitative comparison on AFHQ and CelebA-HQ. Our method generalizes beyond driving scenes and remains effective on animal and human faces.

Quantitative Results

Table 1. Quantitative comparisons on NuScenes, AFHQ, and CelebA-HQ. We report CLIP↑ and DINO↑ to assess semantic alignment and structural preservation, and Align↑† — a unified GPT-4o based metric following HQ-Edit — for overall edit quality.

Method	NuScenes			AFHQ			CelebA-HQ
Method	CLIP↑	DINO↑	Align↑†	CLIP↑	DINO↑	Align↑†	CLIP↑	DINO↑	Align↑†
S-CFG [38]	18.4	33.1	14.4	20.7	51.7	32.4	19.2	61.8	19.4
CFG++ [37]	15.2	18.6	19.8	21.2	18.6	50.8	19.0	39.1	35.7
SDEdit [9]	21.6	46.3	38.5	20.7	71.5	63.7	18.7	78.0	70.6
CycleDiff [16]	22.4	55.5	69.3	23.9	44.8	85.7	23.9	59.0	80.2
HQ-Edit [24]	21.9	53.9	57.8	22.6	57.8	65.8	20.2	60.1	67.8
P2P+NTI [8, 18]	23.2	72.0	40.4	25.5	19.9	75.5	24.5	32.0	53.4
IP-Adapter [17]	20.4	31.9	44.5	22.9	34.5	61.3	21.2	38.7	72.3
P2P-zero [15]	20.3	57.1	54.2	19.0	76.4	43.6	18.5	72.1	51.5
I-P2P [11]	19.9	79.3	76.1	24.3	65.8	72.2	21.5	88.8	76.7
+ AdaCFG	22.1	78.2	86.2	26.5	45.8	83.6	23.4	86.7	84.7
gain	+2.2	−1.1	+10.1	+2.2	−20.0	+11.4	+1.9	−2.1	+8.0
PnP [7]	25.0	51.0	85.5	24.5	63.8	79.3	23.8	71.9	85.8
+ AdaCFG	24.8	59.5	92.4	26.4	53.2	90.7	24.8	74.2	87.9
gain	−0.2	+8.5	+6.9	+1.9	−10.6	+11.4	+1.0	+2.3	+2.1

† Align is a holistic edit-quality score computed with GPT-4o, based on the source image, target prompt, and edited output (following the HQ-Edit protocol).

Figure 6. Trade-off analysis on balancing semantic alignment (CLIP↑) and structural preservation (DINO↑). AdaCFG-augmented variants (red stars) consistently occupy the top-right region, indicating superior balance.

Citation

@article{son2026adacfg,
  author  = {Son, Bongguk and Jeon, Sangryul},
  journal = {IEEE Access},
  title   = {Adaptive Classifier-Free Guidance for Robust Image-to-Image Translation},
  year    = {2026},
  volume  = {14},
  pages   = {23556--23576},
  doi     = {10.1109/ACCESS.2026.3655782}
}