Information

  • Title: Diffusion Models already have a Semantic Latent Space (ICLR 2023)

  • Reference

  • Author: Sehwan Park

  • Last updated on Nov. 18, 2023

Diffusion Models already have a Semantic Latent Space#

Abstract#

Diffusion model์€ ๋งŽ์€ domain์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ generative process๋ฅผ controlํ•˜๋Š” semantic latent space๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” diffusion model์†์—์„œ semantic latent space๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ธฐ ์œ„ํ•œ asymmetric reverse process(asyrp)๋ฅผ ์ œ์•ˆํ•˜๊ณ  h-space๋ผ๊ณ  ๋ช…์นญํ•œ semantic latent space์˜ ์ข‹์€ ํŠน์„ฑ(homogeneity, linearity, robustness, consistency across timesteps)๋“ค์„ ๋ณด์—ฌ์ค€๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ editing strength์™€ quality deficiency๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‚ผ๊ณ  ๋” ์ข‹์€ image-image translation์„ ์œ„ํ•œ Generative Process Design์„ ์†Œ๊ฐœํ•œ๋‹ค.

1. Introduction#

Asyrp_1

Fig. 431 Manipulation approaches for diffusion models#

(a) Image guidance๋Š” unconditionalํ•œ latent variable์— guiding image์˜ latent variable์„ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ latent variable์„ ๋‘˜ ๋‹ค ์ด์šฉํ•˜๋ฉด์„œ ๋ช…ํ™•ํ•˜๊ฒŒ controlํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค.

(b) Classifier guidance๋Š” diffusion model์— classifier๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ generative process๋ฅผ ๊ฑฐ์น˜๋Š” ๋™์•ˆ latent variable์ด ์–ด๋–ค class์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๊ณ  target class์— ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก score๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ latent variable๋“ค์— ๋Œ€ํ•ด classify๋ฅผ ์‹คํ–‰ํ•ด์•ผ ํ•˜๊ธฐ์— pretrained model์„ ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ํž˜๋“ค์–ด ์ง์ ‘ ํ•™์Šต์„ ์‹œ์ผœ์•ผ ํ•˜๊ธฐ์— ์‹œ๊ฐ„์ ์œผ๋กœ, ๋น„์šฉ์ ์œผ๋กœ ๋ถ€๋‹ด์ด ๋œ๋‹ค.

(c) DiffusionCLIP

(d) Diffusion Models already have a Semantic Latent Space๋Š” original image์˜ ํŠน์„ฑ์„ editํ•˜๊ธฐ ์œ„ํ•œ ์•„์ฃผ ์ข‹์€ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” semantic latent space๋ฅผ frozen diffusion model์—์„œ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ  ์ด๋ฅผ h-space๋ผ๊ณ  ์นญํ•œ๋‹ค. h-space์—๋Š” ๋‹ค์–‘ํ•œ ์ข‹์€ ํŠน์„ฑ๋“ค์ด ์กด์žฌํ•œ๋‹ค. versatile editing๊ณผ quality boosting์„ ์œ„ํ•ด ์ƒˆ๋กœ์šด generative process๋ฅผ designํ•˜์—ฌ ์ œ์•ˆํ•œ๋‹ค. h-space๋Š” frozen pretrained diffusion model์—์„œ semantic latent space๋กœ์จ์˜ ์ฒซ ๋ฐœ๊ฒฌ์‚ฌ๋ก€์ด๋‹ค.

2. Background#

2.1 Denoising Diffusion Probability Model(DDPM)#

DDPM์—์„œ๋Š” ์ž„์˜์˜ time step t๋กœ ๋ถ€ํ„ฐ noise๊ฐ€ ๊ปด์žˆ๋Š” image xt์˜ ฯตt๊ฐ€ ์–ผ๋งŒํผ์ธ์ง€ ์˜ˆ์ธกํ•œ๋‹ค. ์˜ˆ์ธกํ•œ ฯตt๋ฅผ ์ด์šฉํ•˜์—ฌ noise๊ฐ€ ์ผ๋ถ€ ์ œ๊ฑฐ๋œ ์ด์ „ step์˜ mean(ฮผฮธ(xt))์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ  variance(โˆ‘ฮธ(xt))๋Š” constantํ•œ ๊ฐ’์œผ๋กœ ๊ณ ์ •์‹œํ‚จ๋‹ค. DDPM์—์„œ ์ œ์‹œํ•œ forward process์™€ reverse process๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. DDPM์—์„œ์˜ ฯƒt2=ฮฒt์ด๋‹ค.

q(xt|xtโˆ’1)=N(xt;ฮฑtxtโˆ’1,(1โˆ’ฮฑt)I)
pฮธ(xtโˆ’1|xt):=N(ฮผฮธ(xt),โˆ‘ฮธ(xt))
xtโˆ’1=11โˆ’ฮฒt(xtโˆ’ฮฒt1โˆ’ฮฑtฯตtฮธ(xt))+ฯƒtzt

2.2 Denoising Diffusion Implicit Model(DDIM)#

DDIM์—์„œ๋Š” non-Markovian process๋ฅผ ์ด์šฉํ•ด ๋˜ ๋‹ค๋ฅธ ๊ด€์ ์˜ reverse process๋ฅผ ์ œ์‹œํ•˜์˜€๊ณ , DDPM๊ณผ DDIM ๋ชจ๋‘ generalํ•˜๊ฒŒ ์ ์šฉ๋˜๋Š” Diffusion process์— ๋Œ€ํ•œ ์‹์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ฯƒt=ฮท(1โˆ’ฮฑtโˆ’1)/(1โˆ’ฮฑt)1โˆ’ฮฑt/ฮฑtโˆ’1์ด๋‹ค.

ฮท=1์ธ ๊ฒฝ์šฐ DDPM์ด ๋˜๊ณ  stochasticํ•ด์ง€๋ฉฐ, ฮท=0์ธ ๊ฒฝ์šฐ DDIM์ด ๋˜๊ณ  deterministicํ•ด์ง„๋‹ค.

qฯƒ(xtโˆ’1|xt,x0)=N(ฮฑtโˆ’1x0+1โˆ’ฮฑtโˆ’1โˆ’ฯƒt2โ‹…xtโˆ’ฮฑtx01โˆ’ฮฑt,ฯƒt2I)
xtโˆ’1=ฮฑtโˆ’1(xtโˆ’1โˆ’ฮฑtฯตtฮธ(xt)ฮฑt)โŸpredicted x0+1โˆ’ฮฑtโˆ’1โˆ’ฯƒt2โ‹…ฯตtฮธ(xt)โŸdirection pointing to xt+ฯƒtzt

2.3 Image Manipulation with CLIP#

CLIP์€ Image Encoder์™€ Text Encoder๋ฅผ ์ด์šฉํ•˜์—ฌ image์™€ text๊ฐ„์˜ embedding์„ ํ•™์Šตํ•œ๋‹ค. ํŽธ์ง‘๋œ ์ด๋ฏธ์ง€์™€ ๋Œ€์ƒ ์„ค๋ช… ๊ฐ„์˜ cosine distance๋ฅผ ์ง์ ‘ ์ตœ์†Œํ™”ํ•˜๋Š” ๋Œ€์‹  cosine distance๋ฅผ ์‚ฌ์šฉํ•œ directional loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ mode collapse์—†์ด ๊ท ์ผํ•œ editing์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

ฮ”T=ET(ytarget)โˆ’ET(ysource)
ฮ”I=EI(xedit)โˆ’EI(xsource)

Ldirection(xedit,ytarget;xsource,ysource):=1โˆ’ฮ”Iโ‹…ฮ”Tโˆฅฮ”Iโˆฅโˆฅฮ”Tโˆฅ

3. Discovering Semantic Latent Space In Diffusion Models#

Editiing์„ ํ•˜๋Š” ๊ณผ์ •์—์„œ naive approach๋ฅผ ํ†ตํ•ด์„œ๋Š” editing์ด ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋Š”๋‹ค. ์ด chapter์—์„œ๋Š” ์™œ ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋Š”์ง€์— ๋Œ€ํ•œ ์„ค๋ช…์„ ํ•˜๊ณ  ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์ƒˆ๋กœ์šด controllableํ•œ ํ•œ reverse process์ธ Asymmetric Reverse Process(Asyrp)๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

DDIM์—์„œ xtโˆ’1์— ๋Œ€ํ•œ ์ˆ˜์‹์„ ์„ค๋ช…ํ•˜์˜€๋Š”๋ฐ ์ด chapter๋ถ€ํ„ฐ๋Š” โ€œpredicted x0โ€๋ถ€๋ถ„์„ Pt(ฯตtฮธ(xt)) ์ฆ‰ Pt๋ผ๊ณ  ์„ค์ •ํ•˜๊ณ , โ€œdirection pointing to xtโ€๋ถ€๋ถ„์„ Dt(ฯตtฮธ(xt)) ์ฆ‰ Dt๋ผ๊ณ  ์„ค์ •ํ•˜์˜€๋‹ค.

Pt๋Š” latent variable๋กœ ๋ถ€ํ„ฐ x0๋ฅผ ์˜ˆ์ธกํ•˜๋Š” reverse process์™€ ๊ฐ™์€ ์—ญํ• ์„ ๋‹ด๋‹นํ•˜๊ณ  Dt๋Š” ๋‹ค์‹œ noise๋ฅผ ์ถ”๊ฐ€ํ•ด latent variable๋กœ ๋Œ์•„๊ฐ€๊ธฐ์— forward process์™€ ๊ฐ™์€ ์—ญํ• ์„ ๋‹ด๋‹นํ•œ๋‹ค.

xtโˆ’1=ฮฑtโˆ’1(xtโˆ’1โˆ’ฮฑtฯตtฮธ(xt)ฮฑt)โŸPt(ฯตtฮธ(xt))+1โˆ’ฮฑtโˆ’1โˆ’ฯƒt2โ‹…ฯตtฮธ(xt)โŸDt(ฯตtฮธ(xt))+ฯƒtzt
xtโˆ’1=ฮฑtโˆ’1Pt(ฯตtฮธ(xt))+Dt(ฯตtฮธ(xt))+ฯƒtzt

3.1 Problem#

xT๋กœ ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ image x0๋ฅผ given text prompts์— ๋งž๊ฒŒ manipulate์‹œํ‚ค๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ 2.3์—์„œ ์†Œ๊ฐœํ•œ Ldirection์„ optimizeํ•˜๋„๋ก xT๋ฅผ updateํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ distorted images๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ถ€์ •ํ™•ํ•œ manipulation์„ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

์ด์— ๋Œ€ํ•œ ๋Œ€์•ˆ์œผ๋กœ, ๋ชจ๋“  sampling step์—์„œ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ manipulateํ•˜๋„๋ก ฯตtฮธ๋ฅผ shiftํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ x0๋ฅผ ์™„์ „ํžˆ manipulateํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์™œ๋ƒํ•˜๋ฉด Pt์™€ Dt์—์„œ ๋‘˜๋‹ค shifted๋œ ฯต~tฮธ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ์— cancel out๋˜์–ด ๊ฒฐ๊ตญ latent variable์—์„œ๋Š” ๊ธฐ์กด๊ณผ ๋‹ค๋ฆ„์ด ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ž์„ธํ•œ ์ฆ๋ช…์€ Proof of Theroem์„ ๋ณด๋ฉด ๋œ๋‹ค.

Proof of Theroem)

Define ฮฑt=โˆs=1t(1โˆ’ฮฒs), x~tโˆ’1=ฮฑtโˆ’1Pt(ฯต~tฮธ(xt))+Dt(ฯต~tฮธ(xt))+ฯƒtzt

= ฮฑtโˆ’1(xtโˆ’1โˆ’ฮฑt(ฯตtฮธ(xt)+ฮ”ฯตt)ฮฑt)โŸPt(ฯต~tฮธ)+1โˆ’ฮฑtโˆ’1โˆ’ฯƒt2โ‹…(ฯตtฮธ(xt)+ฮ”ฯตt)โŸDt(ฯต~tฮธ)+ฯƒtzt

= ฮฑtโˆ’1Pt(ฯตtฮธ(xt))+Dt(ฯตtฮธ(xt))โˆ’ฮฑtโˆ’11โˆ’ฮฑtฮฑtโ‹…ฮ”ฯตt+1โˆ’ฮฑtโˆ’1โ‹…ฮ”ฯตt

ฮฑtโˆ’1Pt(ฯตtฮธ(xt))+Dt(ฯตtฮธ(xt))๋Š” ๊ธฐ์กด DDIM์—์„œ์˜ xtโˆ’1์— ๋Œ€ํ•œ ์‹์ด๊ณ  ์œ„ ์‹์˜ ฮ”ฯตtํ•ญ๋งŒ ๋”ฐ๋กœ ๋ฌถ์–ด์„œ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

= xtโˆ’1+(โˆ’1โˆ’ฮฑt1โˆ’ฮฒt+1โˆ’ฮฑtโˆ’1)โ‹…ฮ”ฯตt

= xtโˆ’1+(โˆ’1โˆ’ฮฑt1โˆ’ฮฒt+1โˆ’โˆs=1tโˆ’1(1โˆ’ฮฒs)1โˆ’ฮฒt1โˆ’ฮฒt)โ‹…ฮ”ฯตt

1โˆ’โˆs=1tโˆ’1(1โˆ’ฮฒs)1โˆ’ฮฒt๋ฅผ root๋ฅผ ๋ฌถ์–ด์„œ ๋‚ด๋ถ€๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด 1โˆ’ฮฑtโˆ’ฮฒt์ด๋ฏ€๋กœ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

= xtโˆ’1+(1โˆ’ฮฑtโˆ’ฮฒtโˆ’1โˆ’ฮฑt1โˆ’ฮฒt)โ‹…ฮ”ฯตt

โˆดฮ”xt=xtโˆ’1~โˆ’xtโˆ’1=1โˆ’ฮฑtโˆ’ฮฒtโˆ’1โˆ’ฮฑt1โˆ’ฮฒt)โ‹…ฮ”ฯตt

shifted epsilon์„ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ๋ถ„์ž๋ฅผ ๋ณด๋ฉด ฮฒt๋Š” ๋งค์šฐ ์ž‘๊ธฐ์— ๊ฑฐ์˜ 0์— ์ˆ˜๋ ดํ•˜๊ธฐ์— ๊ฒฐ๊ตญ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ์—†์Œ์„ ๋ณด์ธ๋‹ค.
์ฆ‰ ฯต-space์—์„œ์˜ manipulation ํšจ๊ณผ๋Š” ๋งค์šฐ ์ข‹์ง€ ์•Š์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Asyrp_2

Fig. 432 No Manipulation Effect with shifted epsilon#

3.2 Asymmetric Reverse Process(Asyrp)#

chapter 3.1์—์„œ ฯต-space์—์„œ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ Asyrp๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ ๋น„๋Œ€์นญ์ ์ธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ x0๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Pt์—์„œ๋Š” shifted epsilon์„ ์‚ฌ์šฉํ•˜๊ณ , latent variable๋กœ ๋Œ์•„๊ฐ€๋Š” Dt์—์„œ๋Š” non-shifted epsilon์„ ์‚ฌ์šฉํ•ด์„œ ์ „์ฒด์ ์ธ ๋ณ€ํ™”๋ฅผ ์ค€๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰, Pt๋งŒmodifyํ•˜๊ณ  Dt๋Š” ์œ ์ง€ํ•œ๋‹ค. Asyrp๋ฅผ ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

xtโˆ’1=ฮฑtโˆ’1Pt(ฯต~tฮธ(xt))+Dt(ฯตtฮธ(xt))

Loss์‹ ๋˜ํ•œ chapter 2.3์—์„œ ์ œ์‹œํ•œ Ldirection์„ ์‚ฌ์šฉํ•˜์—ฌ ์žฌ๊ตฌ์„ฑํ•˜์˜€๋‹ค. modify๋ฅผ ํ•˜์ง€ ์•Š์€ Ptsource์™€ modifiy๋ฅผ ํ•œ Ptedit์„ ์‚ฌ์šฉํ•œ๋‹ค. Loss์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

L(t)=ฮปCLIP(Ptedit,yref;Ptsource,ysource)+ฮปrecon|Pteditโˆ’Ptsource|

์ „์ฒด์ ์ธ reverse process๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๊ณ„๊ฐ€ ๋˜์—ˆ๋‹ค. ์ด์ œ shifted epsilon์ธ ฯต~tฮธ(xt)๋ฅผ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์–ป์„ ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ์„ค๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ €์ž๋“ค์€ ๊ธฐ์กด์˜ ฯต-space์—์„œ ๋ณ€ํ™”๋ฅผ ์ฃผ๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋” ์ข‹์€ result๋ฅผ ๋ณด์ด๊ณ , nice properties๋ฅผ ๊ฐ€์ง€๋Š” h-space์—์„œ ๋ณ€ํ™”๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค.

3.3 h-space#

ฯตtฮธ๋Š” diffusion models์˜ backbone์ธ U-Net์—์„œ ๋„์ถœ๋œ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Image manipulation์„ ์œ„ํ•ด ฯตtฮธ๋ฅผ controlํ•˜๋Š” space๋ฅผ U-Net์˜ bottleneck ์ฆ‰, ๊ฐ€์žฅ ๊นŠ์€ feature map์ธ ht๋กœ ์ •ํ•˜์˜€๋‹ค. ์ด๋ฅผ h-space๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. h-space๋Š” ฯต-space๋ณด๋‹ค ๋” ์ž‘์€ spatial resolutions์„ ๊ฐ€์ง€๊ณ  high-level semantic๋ฅผ ๊ฐ€์ง„๋‹ค. ๋˜ํ•œ ฯต-space์—์„œ๋Š” ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์—†๋Š” ๋งค์šฐ niceํ•œ ํŠน์„ฑ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

Asyrp_3

Fig. 433 U-Net structure and h-space#

h-space์˜ ํฌ๊ธฐ๋Š” 82ร—512์ด๊ณ  ฯต-space์˜ ํฌ๊ธฐ๋Š” 2562ร—3์œผ๋กœ h-space์—์„œ์˜ control์ด ๋” ์ง€๋ฐฐ์ ์ด๊ณ  robustํ•จ์„ ์ถ”์ธกํ•  ์ˆ˜ ์žˆ๋‹ค(์‹ค์ œ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…์„ ํ•จ). h-space๋Š” skip-connection์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์œผ๋ฉฐ ๊ฐ€์žฅ ์••์ถ•๋œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ณต๊ฐ„์ด๋ฉฐ image๋ฅผ controlํ•˜๋Š”๋ฐ์— ์žˆ์–ด ๋งค์šฐ ์ข‹์€ ํŠน์„ฑ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์‹ค์ œ ์ €์ž๋“ค์€ h-space๋ฅผ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•ด U-Net์˜ ๋ชจ๋“  feature map์„ h-space๋กœ ์„ค์ •ํ•ด๋‘๊ณ  ์‹คํ—˜์„ ํ•ด๋ณด์•˜๋Š”๋ฐ ์œ„์˜ ๊ทธ๋ฆผ์„ ๊ธฐ์ค€์œผ๋กœ 8th layer์ด์ „์˜ feature map์„ h-space๋กœ ์ง€์ •ํ•œ ๊ฒฝ์šฐ์—๋Š” manipulaton์ด ์ ๊ฒŒ ์ด๋ฃจ์–ด์กŒ๊ณ , 8th layer ์ดํ›„์˜ feature map์„ h-space๋กœ ์ง€์ •ํ•œ ๊ฒฝ์šฐ์—๋Š” ๋„ˆ๋ฌด ๊ณผํ•œ manipulation์ด ์ด๋ฃจ์–ด์ง€๊ฑฐ๋‚˜ ์•„์˜ˆ distorted image๊ฐ€ ์ƒ์„ฑ๋˜์—ˆ๋‹ค. h-space๋งŒ์˜ ํŠน์„ฑ์€ chapter5์—์„œ ์„ค๋ช…ํ•œ๋‹ค.

3.4 Implicit Neural Directions#

Asyrp_4

Fig. 434 Illustration of f(t)#

ฮ”ht๊ฐ€ image๋ฅผ manipulatingํ•˜๋Š”๋ฐ ์„ฑ๊ณตํ–ˆ์Œ์—๋„, ์ˆ˜๋งŽ์€ timestep์—์„œ ๋งค๋ฒˆ optimizingํ•˜๊ธฐ๋ž€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋Œ€์‹ ์— ๋…ผ๋ฌธ์—์„œ๋Š” ht๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ฮ”h๋ฅผ ์ถœ๋ ฅํ•ด์ฃผ๋Š” ์ž‘์€ neural network์ธ f(t)๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. f(t)๋Š” ฮ”ht๋ฅผ ๋งค๋ฒˆ ๋ชจ๋“  timestep์—์„œ optimizingํ•ด์ค˜์•ผ ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ์‹œ๊ฐ„๋„ ๋น ๋ฅด๊ณ  setting๊ฐ’๋“ค์— ๋Œ€ํ•ด robustํ•˜๋‹ค. ๋˜ํ•œ ์ฃผ์–ด์ง„ timestep๊ณผ bottleneck feature์ธ ht์— ๋Œ€ํ•ด ฮ”ht๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๊ธฐ์— unseen timestep๊ณผ bottleneck feature์— ๋Œ€ํ•ด์„œ๋„ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Š” acceleratedํ•œ ๊ณผ์ •์—์„œ๋„ ํฐ ํšจ๊ณผ๋ฅผ ๋ณธ๋‹ค. training scheme์ด ์–ด๋–ป๋“  ๊ฐ„์— ๊ฒฐ๊ตญ ๋ถ€์—ฌํ•˜๋Š” โˆ‘ฮ”ht๋งŒ ๋ณด์กด๋œ๋‹ค๋ฉด, ์–ด๋– ํ•œ length๋ฅผ ์„ค๊ณ„ํ•ด๋„ ๋น„์Šทํ•œ manipulationํšจ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

h-space์—์„œ epsilon์„ controlํ•ด์„œ asyrp ์ด์šฉํ•˜๋Š” ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ดํ•ด๋ฅผ ์œ„ํ•ด ฯต-space์™€ h-space์—์„œ์˜ shifted epsilon ฯต~tฮธ(xt)์„ ๋น„๊ตํ•˜์˜€๋‹ค.

  • ฯต-space์—์„œ์˜ shifted epsilon

    ฯต~tฮธ(xt)=ฯตtฮธ(xt)+ฮ”ฯตt

  • h-space์—์„œ์˜ shifted epsilon

    ฯต~tฮธ(xt)=ฯตtฮธ(xt|ฮ”ht)

xtโˆ’1=ฮฑtโˆ’1Pt(ฯตtฮธ(xt|ฮ”ht))+Dt(ฯตtฮธ(xt))
Asyrp_5

Fig. 435 Asymmetric Reverse Process#

4. Generative Process Design#

Asyrp_6

Fig. 436 Intuition for choosing the intervals for editing and quality boosting#

Perception prioritized training of diffusion models(Choi et al)์—์„œ๋Š” Diffusion model์ด early stage์—์„œ๋Š” high-level context๋ฅผ generateํ•˜๊ณ , later stage์—์„œ๋Š” imperceptible fine details๋ฅผ generateํ•œ๋‹ค๊ณ  ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” early stage์—์„œ editing์„ ์ง„ํ–‰ํ•˜๋Š” editing process์™€ later stage์—์„œ imperceptible fine details๋ฅผ ์ง„ํ–‰ํ•˜๋Š” quality boosting์„ ์œ„ํ•œ ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ ์„œ ์ƒˆ๋กœ์šด Generative Process Design์„ ์ œ์‹œํ•œ๋‹ค.

4.1 Editing Process With Asyrp#

Editing Process์—์„œ๋Š” high-level context๊ฐ€ generate๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ ์ „์ฒด timestep[0,T]์—์„œ Editing Process๋ฅผ ์œ„ํ•œ editing interval์„ [T, tedit]์œผ๋กœ ์„ค์ •ํ•˜์˜€๋‹ค. tedit์˜ ์‹œ์ ์„ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด LPIPS ์ธก์ •์ง€ํ‘œ๋ฅผ ์ด์šฉํ•œ๋‹ค. LPIPS(x,Pt)๋Š” t์‹œ์ ์—์„œ ์˜ˆ์ธกํ•œ x0์™€ target์ด ๋˜๋Š” original image๊ฐ„์˜ perceptual distance๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๋”ฐ๋ผ์„œ LPIPS๋ฅผ ๋‚จ์€ reverse process์„ ํ†ตํ•ด editing ํ•ด์•ผ ํ•  ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ๋ผ๊ณ  ๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. ์ฒซ step T์˜ LPIPS๋กœ ๋ถ€ํ„ฐ tedit์‹œ์ ์—์„œ์˜ LPIPS ์ฐจ์ด๋Š” Editing Process์—์„œ ์–ผ๋งŒํผ์˜ perceptual change๋ฅผ ์ฃผ์—ˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ด ๊ฐ’์„ editing strength(ฯตt)๋ผ๊ณ  ์ •์˜ํ•œ๋‹ค.

ฮพt=LPIPS(x,PT)โˆ’LPIPS(x,Pt)

Editing interval์ด ์ž‘์œผ๋ฉด ฮพt๊ฐ€ ์ž‘์•„์ง€๋ฉฐ ๋ณ€ํ™”๊ฐ€ ๋งŽ์ด ์ผ์–ด๋‚˜์ง€ ์•Š๊ณ  ๋ฐ˜๋ฉด, Editing interval์ด ํฌ๋ฉด ฮพt๊ฐ€ ์ปค์ง€๊ณ  ๋ณ€ํ™”๊ฐ€ ๋งŽ์ด ์ผ์–ด๋‚œ๋‹ค. ๋”ฐ๋ผ์„œ ์ถฉ๋ถ„ํ•œ ๋ณ€ํ™”๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋Š” ํ•œ์—์„œ ๊ฐ€์žฅ ์ตœ์†Œ์˜ Editing interval์„ ์ฐพ๋Š” ๊ฒƒ์ด tedit์„ ๊ฒฐ์ •ํ•˜๋Š” ์ตœ๊ณ ์˜ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ €์ž๋“ค์€ ์‹คํ—˜์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด LPIPS(x,Pt) = 0.33์ธ t์‹œ์ ์„ tedit์œผ๋กœ ๊ฒฐ์ •ํ•˜์˜€๋‹ค.

Asyrp_7

Fig. 437 Results based on various LPIPS(x,Ptedit)#

Asyrp_8

Fig. 438 Importance of choosing proper tedit#

๋ช‡๋ช‡ ํŠน์„ฑ๋“ค์€ ๋‹ค๋ฅธ ํŠน์„ฑ๋“ค์— ๋น„ํ•ด visual change๋ฅผ ๋งŽ์ด ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด source image์— ๋Œ€ํ•ด smileํ•œ attribute๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ๋ณด๋‹ค pixar style์˜ attribute์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ์€ visual change๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์—๋Š” Editing interval์„ ๋” ๊ธธ๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์—๋Š” LPIPS(x,Pt) = 0.33 - ฮด๋ฅผ ๋งŒ์กฑํ•˜๋Š” t๋ฅผ tedit์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ์ด ๋•Œ, ฮด=0.33d(ET(ysource),ET(ytarget))์ด๋‹ค. ET๋Š” CLIP text embedding์„ ์ง„ํ–‰ํ•˜๋Š” Text Encoder๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, d๋Š” cosine distance๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์„ ํ†ตํ•ด ๋” ๋งŽ์€ visual change๋ฅผ ์š”๊ตฌํ•˜๋Š” attributes์— ๋Œ€ํ•ด์„œ๋Š” tedit์ด ๋” ์ž‘์Œ(Editing Interval์ด ๊น€)์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Asyrp_9

Fig. 439 Flexible tedit based on the amount of visual changes.#

4.2 Quality Boosting With Stochastic Noise Injection#

DDIM์€ ฮท=0์œผ๋กœ ์„ค์ •ํ•˜๋ฉฐ stochasticity๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ inversion์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜์˜€๋‹ค. Elucidating the design space of diffusionbased generative models(Karras et al.)์—์„œ๋Š” stochasticity๊ฐ€ image quality๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค๊ณ  ์ฆ๋ช…ํ•˜์˜€๋‹ค. ์ด์— ๋”ฐ๋ผ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Generative Process์— stochastic noise๋ฅผ ์ฃผ์ž…ํ•˜๋Š” quality boosting ๋‹จ๊ณ„๋ฅผ ์„ค์ •ํ•˜๊ณ  boosting interval์€ [tboost, 0]์ด๋‹ค.

Boosting Interval์— ๋”ฐ๋ผ image quality๋ฅผ controlํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, Boosting Interval์ด ๊ธธ๊ฒŒ๋˜๋ฉด, Quality๋Š” ์ฆ๊ฐ€ํ•˜์ง€๋งŒ Interval๋™์•ˆ ๊ณ„์†ํ•ด์„œ stochastic noise๋ฅผ ์ฃผ์ž…ํ•ด์•ผ ํ•˜๊ธฐ์— content๊ฐ€ ๋ณ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ถฉ๋ถ„ํ•œ quality boosting์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ content์— ์ตœ์†Œํ•œ์˜ ๋ณ€ํ™”๋งŒ์„ ์ค„ ์ˆ˜ ์žˆ๋„๋ก tboost๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ์ €์ž๋“ค์€ image์— ๊ปด์žˆ๋Š” noise๋ฅผ quality boosting์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๋ถ€๋ถ„์œผ๋กœ ๋ณด์•˜์œผ๋ฉฐ target์ด ๋˜๋Š” original image๋กœ ๋ถ€ํ„ฐ t์‹œ์ ์˜ image xt์— ์–ผ๋งŒํผ์˜ noise๊ฐ€ ๊ปด์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ง€ํ‘œ๋กœ quality deficiency ฮณt๋ฅผ ์ด์šฉํ•œ๋‹ค.

ฮณt=LPIPS(x,xt)

์—ฌ๊ธฐ์„œ๋Š” editing strength์™€๋Š” ๋‹ค๋ฅด๊ฒŒ time step์— ๋”ฐ๋ผ ์˜ˆ์ธกํ•œ x0์ธ Pt๊ฐ€ ์•„๋‹Œ latent variable xt๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ €์ž๋“ค์€ noise๋ฅผ ํŒ๋‹จํ•˜๋Š”๋ฐ์— ์žˆ์–ด์„œ semantics๋ณด๋‹ค๋Š” actual image๋ฅผ ๊ณ ๋ คํ–ˆ๊ธฐ์— ์œ„์™€ ๊ฐ™์ด ์„ค์ •ํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค. ์ €์ž๋“ค์€ ์‹คํ—˜์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ฮณt = 1.2์ธ t์‹œ์ ์„ tboost๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

Asyrp_10

Fig. 440 Results based on various ฮณtboost#

Asyrp_11

Fig. 441 Quality comparison based on the presence of quality boosting#

4.3 Overall Process of Image Editing#

Generalํ•œ Diffusion model์—์„œ์˜ Generative Process๋ฅผ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

xtโˆ’1=ฮฑtโˆ’1Pt(ฯตtฮธ)+Dt(ฯตtฮธ)+ฯƒtzt(where,ฯƒt=ฮท(1โˆ’ฮฑtโˆ’1)/(1โˆ’ฮฑt)1โˆ’ฮฑt/ฮฑtโˆ’1)

ฮท = 0์ธ ๊ฒฝ์šฐ์—๋Š” DDIM์ด ๋˜๋ฉฐ, stochastic noise๋ฅผ ๋”ํ•˜๋Š” ๋ถ€๋ถ„์ด ์‚ฌ๋ผ์ ธ deterministicํ•ด์ง„๋‹ค. ฮท = 1์ธ ๊ฒฝ์šฐ์—๋Š” DDPM์ด ๋˜๋ฉฐ, stochasticํ•œ ํŠน์„ฑ์ด ์žˆ๋‹ค. Asyrp(Assymetric Reverse Process)์—์„œ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ DDIM์„ ์‚ฌ์šฉํ•˜๋ฉฐ Pt์—์„œ h-space๋ฅผ ํ†ตํ•ด control๋œ ฯตtฮธ(xt|ft)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Diffusion Models already have a Semantic Latent Space์—์„œ ์ œ์‹œํ•œ Generative Process๋ฅผ ์ „์ฒด์ ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Asyrp_12

Fig. 442 Quality comparison based on the presence of quality boosting#

์ฒ˜์Œ๋ถ€ํ„ฐ tedit์‹œ์ ๊นŒ์ง€๋Š” Asyrp๋ฅผ ์ด์šฉํ•ด Editing Process๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์ด ํ›„ DDIM ๋ฐฉ์‹์„ ํ†ตํ•ด Denoising์„ ์ง„ํ–‰ํ•˜๋‹ค๊ฐ€ tboost์‹œ์ ๋ถ€ํ„ฐ ๋๋‚  ๋•Œ๊นŒ์ง€ stochastic noise๋ฅผ ์ฃผ์ž…ํ•˜๋Š” DDPM ๋ฐฉ์‹์„ ์ด์šฉํ•ด Quality boosting์„ ์ง„ํ–‰ํ•œ๋‹ค.

Asyrp_13

Fig. 443 Overview of Generative Process#

5. Experiments#

CelebA-HQ (Karras et al., 2018) ๋ฐ LSUN-bedroom/-church (Yu et al., 2015) ๋ฐ์ดํ„ฐ์…‹์—์„œ DDPM++ (Song et al., 2020b) (Meng et al., 2021); AFHQ-dog (Choi et al., 2020) ๋ฐ์ดํ„ฐ์…‹์—์„œ iDDPM (Nichol & Dhariwal, 2021); ๊ทธ๋ฆฌ๊ณ  METFACES (Karras et al., 2020) ๋ฐ์ดํ„ฐ์…‹์—์„œ ADM with P2-weighting (Dhariwal & Nichol, 2021) (Choi et al., 2022)์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผฐ๋‹ค๊ณ  ํ•œ๋‹ค. ๋ชจ๋“  model๋“ค์€ pretrained checkpoint๋ฅผ ํ™œ์šฉํ–ˆ์œผ๋ฉฐ frozen์ƒํƒœ๋ฅผ ์œ ์ง€์‹œ์ผฐ๋‹ค๊ณ  ํ•œ๋‹ค.

5.1 Versatility of h-space with Asyrp#

Asyrp_14

Fig. 444 Editing results of Asyrp on various datasets#

์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด, ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ attribute๋“ค์˜ ํŠน์„ฑ์„ ์ž˜ ๋ฐ˜์˜ํ•ด์„œ image๋ฅผ manipulateํ–ˆ๋‹ค๋Š” ์ ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์‹ฌ์ง€์–ด {department, factory, temple} attribute์€ training data์— ํฌํ•จ์ด ๋˜์–ด์žˆ์ง€ ์•Š์•˜์Œ์—๋„ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜จ ์ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. model์„ fine tuningํ•˜์ง€ ์•Š๊ณ  inferenceํ•˜๋Š” ๊ณผ์ •์—์„œ h-space๋ฅผ ํ†ตํ•ด epsilon์„ controlํ•˜๊ณ  Asyrp๋ฅผ ์ด์šฉํ•ด ์„ฑ๋Šฅ์„ ๋ƒˆ๋‹ค๋Š” ์ ์ด ๊ฐ€์žฅ ํฐ ์žฅ์ ์ด๋‹ค.

5.2 Quantitive Comparison#

Asyrp model์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค๋ฅธ model๋“ค๊ณผ ๋น„๊ตํ•˜๋Š” ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋Š”๋ฐ diffusion model ์ „์ฒด๋ฅผ fine-tuningํ•˜์—ฌ image์„ editingํ•˜๋Š” DiffsionCLIP model๊ณผ ๋น„๊ตํ•˜์˜€๋‹ค. Asyrp์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์Œ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Asyrp_15

Fig. 445 Asyrp vs DiffusionCLIP on both CelebA-HQ seen-domain attributes and unseen-domain attributes#

5.3 Analysis on h-space#

  1. Homogeneity

    Asyrp_16

    Fig. 446 Homogeneity of h-space#

    ์œ„์˜ ๊ทธ๋ฆผ์˜ (a)๋Š” Real image์— smiling attribute์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋œ ฮ”ht์™€ ฮ”ฯตt๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ฐ™์€ ๊ฐ’์„ ๋‹ค๋ฅธ Real image์— ์ ์šฉ์‹œ์ผฐ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋ฅผ (b)์— ๋‚˜ํƒ€๋‚ด์—ˆ๋Š”๋ฐ, ฮ”ht๋ฅผ ์ ์šฉํ•œ๊ฒฝ์šฐ smiling face๋กœ ์ž˜ ๋ฐ”๋€Œ๋Š” ๋ฐ˜๋ฉด, ฮ”ฯตt์„ ์ ์šฉํ•œ ๊ฒฝ์šฐ์—๋Š” image distortion์ด ๋ฐœ์ƒํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  2. Linearity

    Asyrp_17

    Fig. 447 Linearity of h-space - Linear Scaling#

    ฮ”h๋ฅผ linearly scaling์„ ํ•˜๋Š” ๊ฒƒ์€ editing์„ ํ•˜๋Š”๋ฐ์— ์žˆ์–ด visual attribute change์˜ ์–‘์— ๋ฐ˜์˜๋œ๋‹ค. ์ฆ‰, ฮ”h๋ฅผ ร—1, ร—2, ร—3๋ฐฐ /dots ํ•จ์— ๋”ฐ๋ผ result image์—์„œ ๋ฐ˜์˜๋˜๋Š” attribute๋˜ํ•œ ์ด์— ๋งž๊ฒŒ ๋ณ€ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์—์„œ ํ‘œํ˜„๋˜์–ด ์žˆ๋“ฏ์ด negative scaling์— ๋Œ€ํ•ด์„œ๋Š” training์„ ํ•˜์ง€ ์•Š์•˜์Œ์—๋„ ์ž˜ ์ ์šฉ ๋œ๋‹ค๋Š” ์ ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

    Asyrp_17

    Fig. 448 Linearity of h-space - Linear Combination#

    ์„œ๋กœ ๋‹ค๋ฅธ attributes์— ๋Œ€ํ•œ ฮ”h๋ฅผ ํ•ฉ์ณ์„œ ๋ถ€์—ฌ๋ฅผ ํ–ˆ์„ ๊ฒฝ์šฐ์—๋„ ๊ฐ๊ฐ์˜ attribute๋“ค์ด image์— ์ž˜ ๋ฐ˜์˜์ด ๋œ๋‹ค๋Š” ์ ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  3. Robustness

    Asyrp_17

    Fig. 449 Robustness of h-space#

    ์œ„์˜ ๊ทธ๋ฆผ์€ h-space์™€ ฯตโˆ’space์—์„œ random noise๋ฅผ ์ฃผ์ž…ํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค. h-space์˜ ๊ฒฝ์šฐ์—๋Š” random noise๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์–ด๋„ image์— ํฐ ๋ณ€ํ™”๊ฐ€ ์—†์œผ๋ฉฐ ๋งŽ์€ noise๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์„ ๊ฒฝ์šฐ์—๋„ image distortion์€ ๊ฑฐ์˜ ์—†๊ณ  semantic change๋งŒ ๋ฐœ์ƒํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ฯตโˆ’space์˜ ๊ฒฝ์šฐ์—๋Š” random noise๊ฐ€ ์ถ”๊ฐ€๋œ ๊ฒฝ์šฐ image distortion์ด ์‹ฌํ•˜๊ฒŒ ๋ฐœ์ƒํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด h-space๊ฐ€ ์–ผ๋งˆ๋‚˜ robustnessํ•œ์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  4. Consistency across time steps

    Asyrp_17

    Fig. 450 Consistency across times steps of h-space#

    h-space์˜ homogeneousํ•œ ์„ฑ์งˆ์„ ํ†ตํ•ด ๊ฐ™์€ attribute์— ๋Œ€ํ•œ ฮ”h๋ฅผ ๋‹ค๋ฅธ image์— ์ ์šฉ์‹œ์ผฐ์„ ๋•Œ์—๋„ ์ž˜ ๋ฐ˜์˜์ด ๋Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ €์ž๋“ค์€ ฮ”ht๋“ค์— ๋Œ€ํ•œ ํ‰๊ท ์ธ ฮ”htmean์„ ์ ์šฉ์‹œ์ผฐ์„ ๊ฒฝ์šฐ์—๋„ result๊ฐ€ ๊ฑฐ์˜ ๋น„์Šทํ•จ์„ ๋ณด์ธ๋‹ค. Chapter4์—์„œ ์ œ์‹œํ•œ Generative Process๋ฅผ ๋น„์ถ”์–ด ๋ณด์•˜์„ ๋•Œ, ฮ”ht๋Š” Editing Process์—์„œ๋งŒ ์ ์šฉ์„ ์‹œํ‚จ๋‹ค. ์ด ๊ฒฝ์šฐ, ์ ์šฉํ•˜๋Š” ฮ”ht๋ฅผ ฮ”htglobal์ด๋ผ๊ณ  ์นญํ•˜๋ฉฐ, ์ ์šฉํ•˜๋Š” ฮ”ht๊ฐ€ interval๋™์•ˆ ๊ฐ™์€ ํฌ๊ธฐ ๋งŒํผ ์ ์šฉ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๊ฒฝ์šฐ, ฮ”hglobal=1Teโˆ‘t ฮ”htmean์ด๋ผ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋‹ค. ์ด ๊ฒฝ์šฐ์—๋„ ๊ฒฐ๊ณผ๋Š” ๋น„์Šทํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒฐ๊ตญ ์›ํ•˜๋Š” attribute์— ๋Œ€ํ•ด ์ฃผ์ž…ํ•ด์•ผ ํ•  ฮ”h์–‘๋งŒ ๊ฐ™๋‹ค๋ฉด, ์›ํ•˜๋Š” editing ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋น„๋ก ์ด ๋…ผ๋ฌธ์—์„œ๋Š” best quality manipulation์„ ์œ„ํ•ด ฮ”ht๋ฅผ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ, ฮ”htmean๊ณผ ฮ”hglobal์— ๋Œ€ํ•ด ๋” ์—ฐ๊ตฌ๋ฅผ ํ•ด ๋ณผ ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.

6. Conclusion#

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Pretrained Diffusion models์—์„œ latent semantic space์ธ h-space๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๊ณ  h-space์—์„œ์˜ Asyrp(Asymmetric Reverse Process)์™€ ์ƒˆ๋กญ๊ฒŒ ์ œ์•ˆํ•œ Reverse Process ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์„ฑ๊ณต์ ์ธ image editing์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜์˜€๋‹ค. Diffusion model์—์„œ์˜ semanticํ•œ latent space์— ๋Œ€ํ•œ ์ฒซ ์ œ์•ˆ์„ ํ•œ ๋…ผ๋ฌธ์ด๋‹ค. h-space๋Š” GAN์˜ latent space์™€ ์œ ์‚ฌํ•œ ํŠน์„ฑ์„ ๊ฐ–์ถ”๊ณ  ์žˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ h-space์˜ ํŠน์„ฑ์œผ๋กœ๋Š” Homogeneity, Linearity, Robustness, Consistency across timesteps์ด ์žˆ๋‹ค.