Information

AnimateDiff#

๐Ÿ“Œ ๋…ผ๋ฌธ์˜ ์˜์˜
In this paper, we present AnimateDiff, a practical framework for animating personalized T2I models without requiring model-specific tuning.
AnimateDiff = public personalized T2I models + domain adapter & plug-and-play Motion Module + MotionLoRA

0. Abstract#

T2I diffusion model๊ณผ DreamBooth๋‚˜ LoRA์™€ ๊ฐ™์€ ๊ฐœ์ธํ™” ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•จ์— ๋”ฐ๋ผ ์‚ฌ๋žŒ๋“ค์€ ์ ์ ˆํ•œ ๋น„์šฉ์„ ์ง€๋ถˆํ•˜์—ฌ ๊ณ ํ™”์งˆ์˜ ์›ํ•˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ธฐ์กด ๊ณ ํ™”์งˆ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ(personalized T2I)์— ์›€์ง์ž„์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ต๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ถ”๊ฐ€์ ์ธ ํ›ˆ๋ จ(model-specific tuning)์—†์ด ๊ธฐ์กด ๊ณ ํ™”์งˆ ์ด๋ฏธ์ง€ ์ƒ์„ฑ๋ชจ๋ธ์— ์›€์ง์ž„์„ ์ถ”๊ฐ€ํ•˜๋Š” ์‹ค์šฉ์ ์ธ ํ”„๋ ˆ์ž„ ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ํ”„๋ ˆ์ž„ ์›Œํฌ์˜ ํ•ต์‹ฌ์€ plug-and-play motion module์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ด motion module์„ ํ•œ๋ฒˆ ํ•™์Šตํ•˜๋ฉด, ์–ด๋–ค ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๊ณผ๋„ ์œตํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜๋ฉด motion module์€ real-world ๋น„๋””์˜ค๋กœ ๋ถ€ํ„ฐ ํšจ๊ณผ์ ์œผ๋กœ motion prior๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•œ๋ฒˆ ํ•™์Šต๋œ motion module์€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์— ๋ง๋ถ™์—ฌ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ AnimateDiff๋ฅผ ์œ„ํ•œ ๊ฐ„๋‹จํ•œ ํŒŒ์ธํŠœ๋‹ ๋ฐฉ์‹์ธ MotionLoRA๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ motion module์ด ์ €๋น„์šฉ์œผ๋กœ ์ƒˆ๋กœ์šด ์›€์ง์ž„ ํŒจํ„ด์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. (ex. ์ดฌ์˜ ๊ธฐ๋ฒ•) AnimateDiff์™€ MotionLoRA๋ฅผ ๊ณต๊ฐœ๋œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์— ๋ถ€์ฐฉํ•˜์—ฌ ์‹คํ—˜ํ–ˆ์œผ๋ฉฐ ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐฉ์‹์ด ์ด๋ฏธ์ง€ ํ€„๋ฆฌํ‹ฐ์™€ ๋‹ค์–‘ํ•œ ์›€์ง์ž„์„ ๋ณด์ „ํ•˜๋ฉด์„œ๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ์• ๋‹ˆ๋ฉ”์ด์…˜ ํด๋ฆฝ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.

inference_pipeline

Fig. 601 inference pipeline#

  • Core Framework

    • public T2I models

      • personalized T2Is from the same base T2I (SD1.5)

        • can download finetuned T2I from civitai or hugging face

    • domain adapter

      • LoRA๊ธฐ๋ฐ˜ domain adapter๋ฅผ base T2I ๋ชจ๋ธ์— ๋”ํ•ด video dataset์„ ํ•™์Šตํ• ๋•Œ ๋ฐœ์ƒํ• ์ˆ˜ ์žˆ๋Š” domain gap์„ ์ค„์˜€๋‹ค.

      • ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” domain gap์ด๋ž€ video์˜ ๊ฐ ํ”„๋ ˆ์ž„์„ ๋‚˜๋ˆ„์–ด ์ด๋ฏธ์ง€๋กœ ๋ดค์„๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” motion blur, compression artifacts, watermarks๋“ฑ์„ ๋งํ•œ๋‹ค.

    • training strategy of a plug-and-play motion module

      • learns transferable motion priors from real-world videothrough proposed training strategy

      • ํ•œ๋ฒˆ ํ•™์Šตํ•˜๊ณ  ๋‚˜๋ฉด ๋‹ค๋ฅธ T2I๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉํ•ด animation generator๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    • MotionLoRA

      • adapt the pre-trained motion module to specific motion patterns

1. Introduction#

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋””ํ“จ์ „ ๋ชจ๋ธ(T2I diffusion models)์˜ ๋ฐœ์ „์œผ๋กœ ๋งŽ์€ ์˜ˆ์ˆ ๊ฐ€์™€ ์•„๋งˆ์ถ”์–ด๋“ค์ด ์‹œ๊ฐ ์ปจํ…์ธ ๋ฅผ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ธฐ์กด T2I ๋ชจ๋ธ์˜ ์ƒ์„ฑ๋Šฅ๋ ฅ(creativity)๋ฅผ ์ž๊ทนํ•˜๊ธฐ ์œ„ํ•ด DreamBooth์™€ LoRA์™€ ๊ฐ™์€ ๊ฐ€๋ฒผ์šด ๊ฐœ์ธํ™” ๋ฐฉ์‹๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹๋“ค์€ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ ๋‹นํ•œ ํ•˜๋“œ์›จ์–ด์—์„œ๋„ customized finetuning์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค. ๊ทธ๋กœ์ธํ•ด ์‚ฌ์šฉ์ž๋“ค์ด ์ ์€ ๋น„์šฉ์œผ๋กœ๋„ base T2I model์„ ์ƒˆ๋กœ์šด domain์— ์ ์šฉํ•˜๊ฑฐ๋‚˜ ์‹œ๊ฐ์  ํ€„๋ฆฌํ‹ฐ๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ AI ์•„ํ‹ฐ์ŠคํŠธ์™€ ์•„๋งˆ์ถ”์–ด ์ปค๋ฎค๋‹ˆํ‹ฐ ์—์„œ ์ƒ๋‹น๋Ÿ‰์˜ personalized models์„ Civitai๋‚˜ Hugging Face์™€ ๊ฐ™์€ ํ”Œ๋žซํผ์— ๊ฐœ์‹œํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ๋“ค์ด ์ƒ๋‹นํžˆ ์ข‹์€ ์ˆ˜์ค€์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ •์ ์ธ ์ด๋ฏธ์ง€๋งŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋ฐ˜๋ฉด, ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์ด ์˜ํ™”๋‚˜ ์นดํˆฐ๊ณผ ๊ฐ™์€ ์‹ค์‚ฐ์—…์—์„œ ๋” ์š”๊ตฌ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ณ ํ™”์งˆ T2I ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ ์—†์ด ๊ณง๋ฐ”๋กœ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ํŒŒ์ธ ํŠœ๋‹์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ์ปดํ“จํŒ… ์ž์›์˜ ํ•„์š”๋Š” ์•„๋งˆ์ถ”์–ด ์‚ฌ์šฉ์ž์—๊ฒŒ ๊ฑธ๋ฆผ๋Œ์ด ๋œ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” AnimateDiff๋ฅผ ์ œ์•ˆํ•˜๋Š”๋ฐ ์ด๋Š” personalized T2I model์˜ ๋Šฅ๋ ฅ์„ ๋ณด์ „ํ•˜๋ฉด์„œ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ํšจ๊ณผ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์ด๋‹ค. AnimateDiff์˜ ํ•ต์‹ฌ์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹(WebVid-10M)์œผ๋กœ๋ถ€ํ„ฐ ํƒ€๋‹นํ•œ motion ์ •๋ณด๋ฅผ plug-and-play motion module์ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. motion module์˜ ํ•™์Šต์€ ์„ธ๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

  1. domain adapter ํŒŒ์ธํŠœ๋‹

    visual distribution of the target video dataset(์ด๋ฏธ์ง€ ํ’ˆ์งˆ์ฐจ์ด, ๋™์˜์ƒ ์›Œํ„ฐ๋งˆํฌ, ์••์ถ•์œผ๋กœ ์ธํ•œ artifacts)์— ๋Œ€ํ•œ ๋ถ€๋ถ„์€ ์ด ๋ชจ๋“ˆ์ด ํ•™์Šตํ•จ์œผ๋กœ์จ ์ดํ›„ motion๊ด€๋ จ ๋ชจ๋“ˆ๋“ค์ด motion์—๋งŒ ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

  2. ์ƒˆ๋กœ์šด motion module

    ๋น„๋””์˜ค๋ฅผ ์ž…๋ ฅ๋ฐ›์„์ˆ˜ ์žˆ๊ฒŒ inflate์‹œํ‚จ base T2I ๋ชจ๋ธ์— domain adapter๋ฅผ ๋”ํ•œ ๋ชจ๋ธ์— ๋ชจ์…˜ ๋ชจ๋ธ๋ง์„ ์œ„ํ•œ ๋ชจ์…˜ ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์ด ๋ชจ๋“ˆ์„ ํ•™์Šตํ• ๋•Œ๋Š” domain adapter์™€ base model์„ freezeํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด motion module์ด ์›€์ง์ž„์— ๋Œ€ํ•œ ๋ถ€๋ถ„์„ ์ „๋ฐ˜์ ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ๋ชจ๋“ˆ๋ณ„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. (๋‹ค๋ฅธ ๊ทธ๋ฆผ์ฒด๋ฅผ ์›ํ• ๊ฒฝ์šฐ base T2I+domain adapter๋ฅผ ๋ฐ”๊พธ๋ฉด ๋จ)

  3. (optional) MotionLoRA ํ•™์Šต

    MotionLoRA์˜ ๊ฒฝ์šฐ ํŠน์ • motion์„ ์ ์€ ์ˆ˜์˜ reference videos์™€ ํ•™์ŠตํšŸ์ˆ˜๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœํ•˜๋Š” ๋ชจ๋“ˆ์ด๋‹ค. ์ด๋ฆ„๊ณผ ๊ฐ™์ด Low-Rank Adaptation (LoRA) (Hu et al., 2021)๋ฅผ ์ด์šฉํ•˜๋Š”๋ฐ ์ƒˆ๋กœ์šด motion pattern์„ ์ ์€์ˆ˜(50๊ฐœ)์˜ reference video๋งŒ์œผ๋กœ ํ•™์Šต์‹œํ‚ฌ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ์ฐจ์ง€ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ๋„ ์ ์–ด ์ถ”๊ฐ€ํ•™์Šต์ด๋‚˜ ๋ชจ๋ธ์„ ๊ณต์œ ,๋ฐฐํฌํ•˜๋Š”๋ฐ์—๋„ ์œ ๋ฆฌํ•˜๋‹ค.

training_pipeline

Fig. 602 training pipeline#

3. Preliminary#

3.1 Stable Diffusion#

Stable Diffusion (Rombach et al., 2022), the base T2I model used in our work

  • open-sourced, well-developed community, many high-quality personalized T2I models for eval

  • ์‚ฌ์ „ ํ•™์Šต๋œ encoder(\(\mathcal E\))์™€ decoder(\(\mathcal D\))๋ฅผ ์ด์šฉํ•˜์—ฌ latent space์ƒ์—์„œ diffusion process๋ฅผ ์ˆ˜ํ–‰

  • ์ธ์ฝ”๋”ฉ๋œ ์ด๋ฏธ์ง€ \(z_0=\mathcal E(x_0)\) ์˜ ๊ฒฝ์šฐ ์•„๋ž˜์˜ forward diffusion ๊ณผ์ •์„ ํ†ตํ•ด \(z_t\) ๋ณ€ํ™˜๋จ

  • Forward diffusion for \(t=1,2,โ€ฆ,T\)

    \[ z_t=\sqrt{\bar \alpha_t}z_0+\sqrt{1-\bar\alpha}\epsilon,\space \epsilon \sim \mathcal N(0,I) \tag{1} \]
    • pre-defined \(\barฮฑ_t\) determines the noise strength at step \(t\)

    • The denoising network \(ฯต_ฮธ(ยท)\) learns to reverse this process by predicting the added noise, encouraged by an MSE loss

  • MSE loss

    \[ \mathcal L=\Bbb E_{\mathcal E(x_0),y,\epsilon \sim \mathcal N(0,I),t}\big [\| \epsilon-\epsilon_\theta(z_t,t,\tau_\theta(y))\|_2^2\big] \tag{2} \]
    • \(y\) is the text prompt corresponding to \(x_0\)

    • \(ฯ„_ฮธ(ยท)\) is a text encoder mapping the prompt to a vector sequence.

    • In SD, \(ฯต_ฮธ(ยท)\) is implemented as a UNet (down4, middle, up4 blocks; ResNet, spatial self-attn, cross-attn)

3.2 Low-Rank Adaptation(LoRA)#

Low-Rank Adaptation(LoRA) (Hu et al., 2021), which helps understand the domain adapter (Sec. 4.1) and MotionLoRA (Sec. 4.3) in AnimateDiff

  • language model์—์„œ ์ฒ˜์Œ ๋“ฑ์žฅํ•œ ๊ฐœ๋…์œผ๋กœ ๊ฑฐ๋Œ€ ๋ชจ๋ธ์˜ fine-tuning์„ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ฐœ๋…์ด๋‹ค.

  • LoRA๋Š” ๋ชจ๋ธ์˜ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ fine-tuningํ•˜์ง€ ์•Š๊ณ , rank-decomposition ํ–‰๋ ฌ ์Œ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋œ weight๋งŒ ์ตœ์ ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

  • ๊ธฐ์กด ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๊ณ ์ •ํ•จ์œผ๋กœ์จ finetuning์‹œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” catastrophic forgetting(Kirkpatrick et al., 2017)์„ ์˜ˆ๋ฐฉํ•  ์ˆ˜ ์žˆ๋‹ค.

  • The new model weight with LoRA

    \[ \mathcal W'=\mathcal W+\vartriangle\mathcal W=\mathcal W+AB^T \tag{3} \]
    • \(A โˆˆ R ^{mร—r}\) , \(B โˆˆ R ^{nร—r}\) are a pair of rank-decomposition matrices, \(r\) is a hyper-parameter, which is referred to as the rank of LoRA layers

    • attention layer์—์„œ๋งŒ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์ฃผ๋กœ attention layer์—์„œ ์‚ฌ์šฉ๋œ๋‹ค. LoRA๋ฅผ ํ†ตํ•ด fine-tuning์‹œ cost ์™€ storage ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค.

4. AnimateDiff#

โž• Architecture Overall
๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์˜ ํ•ต์‹ฌ์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด transferable model prior๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ ‡๊ฒŒ ํ•™์Šตํ•œ motion module์„ personalized T2I ๋ชจ๋ธ์— ๊ณง๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
์™ผ์ชฝ ๊ทธ๋ฆผ์˜ ํ•˜๋Š˜์ƒ‰ ๋ชจ๋ธ์ด motion module์ด๊ณ , ์ดˆ๋ก์ƒ‰ ์˜์—ญ์ด optional MotionLoRA์ด๋‹ค. AnimateDiff๋ฅผ T2I๋ชจ๋ธ์— ์‚ฝ์ž…ํ•˜์—ฌ animation generator๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด๋ฅผ ์œ„ํ•œ AnimateDiff์—๋Š” ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” 3๊ฐœ์˜ ๋ชจ๋“ˆ์ด ์žˆ๋‹ค.

  • domain adapter - base T2I pre-training data์™€ our video training data๊ฐ„์˜ ๊ฐ„๊ทน์„ ์ค„์—ฌ์ฃผ๊ธฐ ์œ„ํ•œ ๊ฒƒ์œผ๋กœ ํ•™์Šต๊ณผ์ •์—๋งŒ ์‚ฌ์šฉ๋œ๋‹ค.

  • motion module
    - motion prior๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋“ˆ

  • MotionLoRA(optional)
    - pretrained motion module์„ ์ƒˆ๋กœ์šด ์›€์ง์ž„ ํŒจํ„ด(์นด๋ฉ”๋ผ ์›Œํฌ)์œผ๋กœ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•œ๊ฒƒ

inference_pipeline

Fig. 606 inference pipeline#

โž• Training Steps
๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๊ฐ ๋ชจ๋“ˆ์€ ๋”ฐ๋กœ๋”ฐ๋กœ ํ•™์Šต์‹œํ‚ค๋ฉฐ ๊ฐ๊ฐ์„ ํ•™์Šต์‹œํ‚ฌ๋•Œ ๋‚˜๋จธ์ง€ ์˜์—ญ์€ freeze ์‹œํ‚จ๋‹ค. ํ•™์Šต์‹œ ์‚ฌ์šฉํ•˜๋Š” objective function์€ SD๊ณผ ๊ฑฐ์˜ ๊ฐ™๋‹ค.

  • Training step 1. Domain Adapter

  • Training step 2. Motion Module

  • Training step 3. Optional motionLoRA

training_pipeline

Fig. 607 training pipeline#

4.1 Alleviate Negative Effects from Training Data with Domain Adapter#

๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์€ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์— ๋น„ํ•ด ์ˆ˜์ง‘ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๋™์˜์ƒ ๋ฐ์ดํ„ฐ์…‹ WebVid (Bain et al., 2021)๊ณผ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹ LAION-Aestetic (Schuhmann et al., 2022)๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด, ํ’ˆ์งˆ์ฐจ์ด๋„ ํผ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ฐ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์„ ๊ฐœ๋ณ„ ์ด๋ฏธ์ง€๋กœ ๋‹ค๋ฃจ๊ฒŒ ๋˜๋ฉด motion blur, compression artifacts, watermark๋“ฑ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์„ ์ˆ˜๋„ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ T2I ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ ์‚ฌ์šฉํ•œ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์— ๋น„ํ•ด motion prior๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ๋™์˜์ƒ ๋ฐ์ดํ„ฐ ์…‹์˜ ํ’ˆ์งˆ์€ ๋ฌด์‹œํ•  ์ˆ˜ ์—†์„ ๋งŒํผ์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ์ง์ ‘์ ์œผ๋กœ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๊ฒฝ์šฐ, ์ƒ์„ฑํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜์˜ ํ’ˆ์งˆ์ด ์ œํ•œ ๋  ์ˆ˜ ์žˆ๋‹ค.

๋™์˜์ƒ ๋ฐ์ดํ„ฐ์˜ ๋‚ฎ์€ ํ’ˆ์งˆ๋กœ ์ธํ•ด ํ•ด๋‹น ํŠน์„ฑ์„ motion module์ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ํ”ผํ•˜๊ณ  base T2I์˜ ์ง€์‹์„ ๋ณด์ „ํ•˜๊ธฐ ์œ„ํ•ด, ๋„คํŠธ์›Œํฌ๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ฐ ๋„๋ฉ”์ธ(์˜์ƒ/์ด๋ฏธ์ง€)์˜ ์ •๋ณด์— ๋งž๊ฒŒ ํ”ผํŒ…ํ•˜๋Š” ๋ฐฉ์‹(domain adapter)์„ ์ œ์•ˆํ•œ๋‹ค. inference ์‹œ์—๋Š” domain adapter๋ฅผ ์ œ๊ฑฐํ•˜์˜€์œผ๋ฉฐ ์•ž์„œ ์–ธ๊ธ‰ํ•œ domain gap์— ์˜ํ•œ ๋ถ€์ •์  ์˜ํ–ฅ์„ ์ œ๊ฑฐํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. domain adapter layer๋Š” LoRA๋ฅผ ํ™œ์šฉํ–ˆ์œผ๋ฉฐ, self-, cross-attention layer๋“ค์„ base T2I model์— Fig. 3๊ณผ ๊ฐ™์ด ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. ์•„๋ž˜ query projection์„ ์˜ˆ๋กœ ์‚ดํŽด๋ณด๋ฉด,

lora

Fig. 608 LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS#

\[ Q=\mathcal W^Qz+\text{AdapterLayer}(z)=\mathcal W^Qz+\alpha \cdot AB^Tz \tag{4} \]

\(Q\) ๋Š” query, \(z\) ๋Š” internal feature, \(\alpha\) ๋Š” ์ƒ์ˆ˜๋กœ inference time์— domain adapter์˜ ์˜ํ–ฅ๋ ฅ์„ ์กฐ์ ˆํ•œ๋‹ค. (๊ธฐ๋ณธ๊ฐ’์€ 1 / domain adapter์˜ ํšจ๊ณผ๋ฅผ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด \(\alpha\)๋ฅผ 0์œผ๋กœ) ๋‚˜๋จธ์ง€ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” freezeํ•˜๊ณ  domain adapter์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค๋งŒ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ถ€ํ„ฐ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œํ•œ static frame๋“ค์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™”ํ–ˆ๋‹ค. ์ด๋•Œ objective function์€ Eq. (2)๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. (์•„์ง๊นŒ์ง€๋Š” ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ)

4.2 Learn Motion Priors with Motion Module#

motion dynamics๋ฅผ ์‚ฌ์ „ํ•™์Šต๋œ T2I ๋ชจ๋ธ๊ณผ ๊ณต์œ ํ•˜๋Š” dimension์ƒ์˜ ์‹œ๊ฐ„์ถ•์œผ๋กœ ๋ชจ๋ธ๋ง ํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐ€์ง€ ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

  1. 2d diffusion model์„ 3d ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์— ๋งž๊ฒŒ ํ™•์žฅ์‹œ์ผœ์•ผ ํ•œ๋‹ค. (Network Inflation)

  2. ์‹œ๊ฐ„์ถ•์ƒ์œผ๋กœ ํšจ์œจ์ ์ธ ์ •๋ณด์˜ ํ๋ฆ„์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด sub-module์ด ํ•„์š”ํ•˜๋‹ค. (Sub-module Design)

Network Inflation

์‚ฌ์ „ํ•™์Šต๋œ T2I ๋ชจ๋ธ์˜ ์ด๋ฏธ์ง€ ๋ ˆ์ด์–ด๋Š” ๊ณ ํ’ˆ์งˆ์˜ ๊ทธ๋ฆผ ์‚ฌ์ „์ง€์‹(content prior)์„ ํฌ์ฐฉํ• ์ˆ˜ ์žˆ๋‹ค. ์ด ์ง€์‹์„ ํ™œ์šฉ(์œ ์ง€)ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋™์ผ ๋ชจ๋ธ๋กœ video๋ฅผ ๋‹ค๋ฃจ๊ณ ์ž ํ•  ๋•Œ๋Š” ๊ธฐ์กด ์ด๋ฏธ์ง€ ๋ ˆ์ด์–ด๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ๋‚ด๋ฒ„๋ ค๋‘๊ณ , network๋ฅผ ํ™•์žฅ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์ด ์„ ํ˜ธ๋œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ์กด ์—ฐ๊ตฌ (Ho et al., 2022b; Wu et al., 2023; Blattmann et al., 2023)๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ, 5d tensor \(x\in \Bbb R^{b\times c \times f\times h\times w}\) ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋„๋ก ๋ชจ๋ธ์„ ์ˆ˜์ •ํ–ˆ๋‹ค. \(b\)๋Š” batch, \(f\)๋Š” frame์„ ๋œปํ•œ๋‹ค. ๋‚ด๋ถ€ feature map์ด ์ด๋ฏธ์ง€ ๋ ˆ์ด์–ด๋ฅผ ์ง€๋‚˜๊ฐˆ๋•Œ๋Š” ์‹œ๊ฐ„ ์ถ•์„ ์˜๋ฏธํ•˜๋Š” \(f\)๋Š” \(b\)์ถ•์œผ๋กœ reshaping์„ ํ†ตํ•ด ๋ฌด์‹œํ•œ๋‹ค.

(5d tensor โ†’ 4d tensor \(x \in \Bbb R^{bf\times c \times h\times w}\) โ†’ (๊ธฐ์กด ์ด๋ฏธ์ง€ ๋ ˆ์ด๋จธ) โ†’ 4d tensor โ†’ 5d tensor)

์ด๋ฅผ ํ†ตํ•ด ๊ฐ ํ”„๋ ˆ์ž„์„ ๊ฐœ๋ณ„ ์ด๋ฏธ์ง€ ์ฒ˜๋Ÿผ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ˜๋ฉด์— ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋œ motion module์€ ๊ณต๊ฐ„์ถ•(\(h,w\))์„ reshapingํ•˜์—ฌ ๋ฌด์‹œํ•œ๋‹ค. (5d tensor โ†’ 3d tensor \(x \in \Bbb R^{bhw\times c \times f}\) โ†’ (motion module) โ†’ 3d tensor โ†’ 5d tensor)

Module Design

์ตœ๊ทผ ๋น„๋””์˜ค ์ƒ์„ฑ ์—ฐ๊ตฌ๋“ค์€ temporal modeling์˜ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์„ ํƒ๊ตฌํ•˜๊ณ  ์žˆ๋‹ค. AnimateDiff์—์„œ๋Š” Transformer ๊ตฌ์กฐ๋ฅผ ์ฐจ์šฉํ•˜์—ฌ ์‹œ๊ฐ„์ถ•์ƒ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ์ž‘์€ ์ˆ˜์ •์„ ๊ฑฐ์ณ motion module์„ designํ–ˆ๋‹ค. (์ดํ•˜ temporal Transformer) ์‹คํ—˜์„ ํ†ตํ•ด ํ•ด๋‹น ๊ตฌ์กฐ๊ฐ€ motion prior๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ์ ํ•ฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค. Fig.3์„ ๋ณด๋ฉด temporal Transformer๊ฐ€ ์‹œ๊ฐ„์ถ•์—์„œ ๋™์ž‘ํ•˜๋Š” ์—ฌ๋Ÿฌ self-attn block์œผ๋กœ ์ด๋ฃจ์–ด์ง„๊ฒƒ์„ ๋ณผ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ sinusoidal position encoding์„ ํ†ตํ•ด ์• ๋‹ˆ๋ฉ”์ด์…˜์ƒ์˜ ๊ฐ ํ”„๋ ˆ์ž„์˜ ์‹œ๊ฐ„์  ์œ„์น˜์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ ์ž ํ–ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋Œ€๋กœ motion module์˜ ์ž…๋ ฅํฌ๊ธฐ๋Š” feature map์„ reshapingํ•˜์—ฌ ์กฐ์ ˆํ•˜์˜€๋‹ค. (\(x \in \Bbb R^{bhw\times c \times f}\)) feature map์„ ์‹œ๊ฐ„์ถ•์œผ๋กœ ๋‹ค์‹œ ํŽผ์น˜๊ณ ์ž ํ• ๋•Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ธธ์ด \(f\), ํฌ๊ธฐ \(z_1, ...,z_f;z_i \in \Bbb R^{(b\times h\times w)\times c}\)์˜ vector sequence๋กœ ๋‹ค๋ฃฐ์ˆ˜ ์žˆ๋‹ค. ํ•ด๋‹น ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ self-attn block์„ ํ†ต๊ณผํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\[ z_{\text{out}}=\text{Attention}(Q,K,V)=\text{Softmax}(QK^T/\sqrt{c})\cdot V \tag{5} \]

\(Q=W^Qz, K=W^Kz, V=W^Vz\) ์ด๋ฉฐ, ๊ฐ๊ฐ ๋ถ„๋ฆฌ๋œ ์„ธ projection์„ ์˜๋ฏธํ•œ๋‹ค. attention mechanism์„ ํ†ตํ•ด ํ˜„ ํ”„๋ ˆ์ž„์˜ ์ƒ์„ฑ์— ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ฐ ํ”„๋ ˆ์ž„์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, T2I ๋ชจ๋ธ์„ ํ™•์žฅํ•˜์—ฌ motion module์„ ์ถ”๊ฐ€ํ•œ AnimateDiff๊ฐ€ ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ visual content์˜ ๋ณ€ํ™”๋ฅผ ์ž˜ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šตํ•˜์—ฌ motion dynamics๋ฅผ ์ด์šฉํ•ด animation clip์„ ์ œ์ž‘ํ•˜๋„๋ก ํ•œ๋‹ค. self-attn block์ „์— sinusoidal position encoding์„ ์žŠ์–ด์„œ๋Š” ์•ˆ๋œ๋‹ค. ํ•˜์ง€๋งŒ motion module ์ž์ฒด๊ฐ€ frame์˜ ์ˆœ์„œ๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.

์ถ”๊ฐ€์ ์ธ ๋ชจ๋“ˆ์„ ๋„ฃ์Œ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ• ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋“ค์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด temporal Transformer์˜ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” 0์œผ๋กœ ์ดˆ๊ธฐํ™” ํ•˜์˜€์œผ๋ฉฐ residual connection์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ›ˆ๋ จ ์‹œ์ž‘์‹œ์— motion module์ด identity mapping์œผ๋กœ ๋™์ž‘ํ•˜๋„๋ก ํ–ˆ๋‹ค.

4.3 Adapt to New Motion Patterns with MotionLoRA#

์ „๋ฐ˜์ ์ธ motion ์ง€์‹์„ motion module์ด ์‚ฌ์ „ํ•™์Šตํ•˜๋”๋ผ๋„ ์ƒˆ๋กœ์šด ๋™์ž‘ ํŒจํ„ด์— ๋Œ€ํ•œ ์ ์šฉ์— ๋Œ€ํ•œ ๋ฌธ์ œ๋Š” ๋ฐœ์ƒํ•œ๋‹ค. ex. zooming, panning, rolling.

๋†’์€ ์‚ฌ์ „ํ•™์Šต์„ ์œ„ํ•œ ๋น„์šฉ์„ ๊ฐ๋‹นํ•  ์ˆ˜ ์—†์–ด motion module์„ ํŠน์ • ์•ก์…˜์— ๋งž์ถฐ ํŠœ๋‹ํ•˜๊ณ ์ž ํ•˜๋Š” ์‚ฌ์šฉ์ž๋ฅผ ์œ„ํ•ด ์ ์€ ์ฐธ๊ณ  ๋น„๋””์˜ค(reference video)๋‚˜ ์ ์€ ํ›ˆ๋ จ ํšŸ์ˆ˜๋กœ๋„ ํšจ์œจ์ ์œผ๋กœ ๋ชจ๋ธ์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด AnimateDiff์— MotionLoRA๋ฅผ ๋งˆ์ง€๋ง‰์œผ๋กœ ์ ์šฉํ–ˆ๋‹ค. Motion Module์˜ ๊ตฌ์กฐ์™€ ์ œํ•œ๋œ ์ฐธ๊ณ  ๋น„๋””์˜ค๋ฅผ ๊ณ ๋ คํ•˜์—ฌ, self-attn layers์— LoRA layers๋ฅผ inflated model์— ์ถ”๊ฐ€ํ•˜์—ฌ motion personalization์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ํŒŒ์ธํŠœ๋‹ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

๋ช‡ ์ข…์˜ ์ดฌ์˜ ๋ฐฉ์‹์œผ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ rule-based augmentation์„ ํ†ตํ•ด reference videos๋ฅผ ์–ป์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด zooming ๋น„๋””์˜ค๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์„ ์ ์ฐจ ์ค„์ด๊ฑฐ๋‚˜(zoom-in) ๋Š˜๋ ค๊ฐ€๋ฉฐ(zoom-out) augmentation์„ ์ง„ํ–‰ํ–ˆ๋‹ค. AnimateDiff์˜ MotionLoRA๋Š” 20~50๊ฐœ ์ •๋„์˜ ์ ์€ ์ฐธ๊ณ  ๋น„๋””์˜ค, 2000๋ฒˆ์˜ ํ›ˆ๋ จํšŸ์ˆ˜๋กœ ํŒŒ์ธํŠœ๋‹ํ–ˆ์„๋•Œ๋„ ๊ดœ์ฐฎ์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. low-rank property๋กœ ์ธํ•ด MotionLoRA ๋˜ํ•œ composition capability๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ํ•™์Šต๋œ MotionLoRA ๋ชจ๋ธ ๊ฐ๊ฐ์ด inference time์ƒ์˜ motion effect๋ฅผ ์œตํ•ฉํ•˜๊ธฐ์œ„ํ•ด ํ˜‘๋ ฅ(combine)ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋งํ•œ๋‹ค.

4.4 AnimateDiff in Practice#

Training#

Fig. 3์„ ๋ณด๋ฉด AnimateDiff์—๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ชจ๋“ˆ์ด 3๊ฐœ ์žˆ๋‹ค. ๊ฐ ๋ชจ๋“ˆ์˜ objective๋Š” ์•ฝ๊ฐ„์”ฉ ๋‹ค๋ฅด๋‹ค. domain adapter๋Š” SD์˜ MSE loss์ธ Eq. 2 objective function์„ ํ†ตํ•ด ํ•™์Šตํ•œ๋‹ค. ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ๋งŒ๋“œ๋Š” ์—ญํ• ์„ ํ•˜๋Š” motion module๊ณผ motion LoRA์˜ ๊ฒฝ์šฐ video data์— ๋Œ€ํ•œ ์ฐจ์›์„ ๋” ๋งŽ์ด ์ˆ˜์šฉํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ๊ฐ„ ์ˆ˜์ •๋œ objective๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. video data batch ( \(x_0^{1:f}\in \Bbb R^{b\times c \times f \times h \times w}\))๋Š” ์‚ฌ์ „ํ•™์Šต๋œ SD์˜ auto-encoder๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ ํ”„๋ ˆ์ž„ ๋ณ„๋กœ latent code \(z_0^{1:f}\)๋กœ ์ธ์ฝ”๋”ฉ๋œ๋‹ค. ์ด latent code๋Š” Eq. 1 ๊ณผ ๊ฐ™์ด ์ •์˜๋œ diffusion schedule์— ๋”ฐ๋ผ ๋…ธ์ด์ฆˆ๊ฐ€ ์ถ”๊ฐ€(forward process)๋œ๋‹ค.

\[ z_t^{1:f}=\sqrt{\bar \alpha_t}z_0^{1:f}+\sqrt{1-\bar\alpha_t}\epsilon^{1:f} \tag{6} \]

๋ชจ๋ธ์˜ ์ž…๋ ฅ์€ ๋…ธ์ด์ฆˆ๊ฐ€ ์ถ”๊ฐ€๋œ latent codes์™€ ์ด ์Œ์ด๋˜๋Š” text prompts์ด๋ฉฐ, ๋ชจ๋ธ์€ forward process์—์„œ ์ถ”๊ฐ€๋œ ๋…ธ์ด์ฆˆ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. AnimateDiff์˜ motion module์„ ์œ„ํ•œ ์ตœ์ข… training objective๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

\[ \mathcal L=\Bbb E_{\mathcal E(x_0^{1:f}),y,\epsilon^{1:f}\sim\mathcal N(0,I),t}\Big[\|\epsilon-\epsilon_\theta(z_t^{1:f},T,\tau_\theta(y))\|^2_2\Big] \tag{7} \]

๊ฐ ๋ชจ๋“ˆ๋“ค(domain adapter, motion module, MotionLoRA)์„ ํ•™์Šตํ• ๋•Œ, ํ•™์Šต ํƒ€๊ฒŸ์„ ์ œ์™ธํ•œ ์˜์—ญ์€ freeze ์‹œํ‚จ๋’ค ํ•™์Šตํ–ˆ๋‹ค.

Inference#

inference์‹œ์—๋Š” personalized T2I model๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ๋Œ€๋กœ inflate๋˜๋ฉฐ motion module๊ณผ (optional) MotionLoRA๋ฅผ ๋”ํ•ด ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•œ๋‹ค.

domain adapter์˜ ๊ฒฝ์šฐ inference์‹œ ๊ทธ๋ƒฅ ๋ฐฐ์ œํ•˜์ง€ ์•Š๊ณ  personalized T2I model์— injectionํ•˜์˜€์œผ๋ฉฐ domain adapter์˜ ์˜ํ–ฅ๋ ฅ์€ Eq. 4์˜ \(\alpha\)๋ฅผ ์ด์šฉํ•ด ์กฐ์ ˆํ–ˆ๋‹ค. Sec 5.3์˜ Ablation study์—์„œ \(\alpha\)์˜ ๊ฐ’์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ์˜ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ animation frames์€ reverse diffusion process์™€ ์ด๋ฅผ ํ†ตํ•ด ์–ป์€ latent codes๋ฅผ ๋””์ฝ”๋”ฉ ํ•จ์œผ๋กœ์จ ์–ป์„์ˆ˜ ์žˆ๋‹ค.

5. Experiments#

SD 1.5์— AnimateDiff๋ฅผ ์ ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค. ๋˜ํ•œ motion module์„ ํ•™์Šตํ• ๋•Œ๋Š” WebVid 10M ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. (์ž์„ธํ•œ ์‚ฌํ•ญ์€ supplementary material ํ™•์ธํ•ด์ฃผ์„ธ์š”)

5.1 Qualitative Results#

experiments_1

Fig. 609 qualitative results#

5.2 Quantitative Comparison#

experiments_2

Fig. 610 quantitative results#

  • User Study

    text, domain, smooth 3๊ฐœ ์ง€ํ‘œ์— ๋Œ€ํ•œ ๊ฐœ๋ณ„ ๋“ฑ์ˆ˜๋ฅผ ์กฐ์‚ฌํ–ˆ๋‹ค. Average User Ranking(AUR) ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋†’์€ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง€๋ฉด ๋†’์€ ํ’ˆ์งˆ์„ ์˜๋ฏธํ•˜๋Š” preference metric์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

  • CLIP metric

    related paper์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์Œ์„ ๋™์‹œ์— ํ•™์Šตํ•œ CLIP ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ํ‰๊ฐ€์ง€ํ‘œ์ด๋‹ค. ์‚ฌ์ „ํ•™์Šต๋œ CLIP ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ generated frames์™€ reference ์‚ฌ์ด CLIP score๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒƒ์ด๋‹ค.

    +) CLIP score๋Š” CLIP encoder๋ฅผ ํ†ต๊ณผํ•œ ๋ฒกํ„ฐ๋“ค ์‚ฌ์ด ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹

    • Text

      • ๊ฐ ํ”„๋ ˆ์ž„ ์ž„๋ฒ ๋”ฉ๊ณผ ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

    • Domain

      • ์›๋ณธ ์• ๋‹ˆ๋ฉ”์ด์…˜์ด ์—†์œผ๋ฏ€๋กœ reference image์™€ ์ƒ์„ฑ๋œ ์˜์ƒ ์‚ฌ์ด CLIP score๋ฅผ ๊ตฌํ•จ.

    • Smooth

      • ์—ฐ์†๋œ ํ”„๋ ˆ์ž„ ์Œ์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

5.3 Ablation Study#

Domain Adapter#

effect_of_domain_adapter

Fig. 611 scaler๋ฅผ 0์œผ๋กœ ํ•˜๋ฉด domain adapter์— ์˜ํ•œ ํšจ๊ณผ๋ฅผ ์ œ๊ฑฐํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์€ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ animation clip์˜ ์ฒซ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์ด๋‹ค.#

domain adapter์— ์˜ํ•œ ํšจ๊ณผ๋ฅผ ์ œ๊ฑฐํ–ˆ์„๋•Œ ์ „์ฒด์ ์ธ ์ด๋ฏธ์ง€ ํ€„๋ฆฌํ‹ฐ๊ฐ€ ๋†’์•„ ๋ณด์ด๋Š”๋ฐ, ์ด๋Š” domain adapter๊ฐ€ video dataset์˜ ํŠน์„ฑ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” watermark๋‚˜ ๋ชจ์…˜ ๋ธ”๋Ÿฌ ๋“ฑ์„ ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰, domain adapter๊ฐ€ ์ „์ฒด ํ•™์Šต๊ณผ์ •์— ๋„์›€์ด ๋˜์—ˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

Motion module design#

AnimateDiff์˜ temporal Transformer๊ตฌ์กฐ์™€ ์ „์ฒด convolution์ธ ๊ตฌ์กฐ์˜ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ๋‹ค. ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ถ„์•ผ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋œ๋‹ค.

temporal Transformer์˜ temporal attention๋ถ€๋ถ„์„ 1D temporal convolution์œผ๋กœ ๊ต์ฒดํ•˜์—ฌ ๋‘ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์œ ์‚ฌํ•˜๊ฒŒ ๋†“์—ฌ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค. convolution motion module์€ ๋ชจ๋“  ํ”„๋ ˆ์ž„์„ ๋™์ผํ•˜๊ฒŒ ๋†“์•˜์ง€๋งŒ Transformer ๊ตฌ์กฐ์™€ ๋น„๊ตํ•˜์—ฌ ์›€์ง์ž„์„ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.

Efficiency of MotionLoRA#

parameter efficiency์™€ data efficiency ์ธก๋ฉด์—์„œ MotionLoRA์˜ ํšจ์œจ์„ฑ์„ ์‹œํ—˜ํ•ด๋ณด์•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด parameter ๊ฐœ์ˆ˜์™€ data ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ ˆํ•ด๊ฐ€๋ฉฐ ์—ฌ๋Ÿฌ MotionLoRA๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค.

experiments-4

Fig. 612 Efficiency of MotionLoRA#

  • Parameter efficiency

    • ํšจ์œจ์ ์ธ ๋ชจ๋ธํ•™์Šต์„ ์œ„ํ•ด๋„ ๋ชจ๋ธ์˜ ๋ฐฐํฌ๋ฅผ ์œ„ํ•ด์„œ๋„ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด๋‹ค.

    • AnimateDiff๋Š” ๋น„๊ต์  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ์ ๋“ค๋•Œ์—๋„ ๊ดœ์ฐฎ์€ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ๋งŒ๋“ค์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆผ์˜ ์‹คํ—˜์—์„œ๋Š” zoom-in ์นด๋ฉ”๋ผ ์›€์ง์ž„์„ ์ƒˆ๋กญ๊ฒŒ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณธ๊ฒƒ์ด๋‹ค.

  • Data efficiency

    • ํŠน์ • motion pattern์„ ์œ„ํ•œ reference video ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์„ ์‹ค์ œ๋กœ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด๋‹ค.

    • ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ ์„๋•Œ์—๋„ ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ์›€์ง์ž„์€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋‚˜ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๊ทน๋„๋กœ ์ ์„ ๊ฒฝ์šฐ(N=5) ์ƒ์„ฑ๋œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ํ’ˆ์งˆ์˜ ๊ธ‰๊ฒฉํ•œ ์ €ํ•˜๊ฐ€ ์žˆ์—ˆ๋‹ค.

5.4 Controllable Generation#

experiments_5

Fig. 613 Controllability of AnimateDiff#

visual content์™€ motion prior์˜ ๊ฐœ๋ณ„ ํ•™์Šต์„ ํ†ตํ•ด AnimateDiff๊ฐ€ existing content๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ๋‹ค. ์ด ํŠน์„ฑ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด AnimateDiff๋ฅผ ControlNet๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜์ƒ ์ƒ์„ฑ์‹œ depth๋ฅผ ํ†ตํ•ด ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ๋‹ค.

DDIM inversion์„ ํ†ตํ•ด ๋‹ค๋“ฌ์–ด์ง„ latent sequences๋ฅผ ์–ป๊ณ  ์ด๋ฅผ ๋น„๋””์˜ค ์ƒ์„ฑ์— ์‚ฌ์šฉํ•˜๋Š” ์ตœ์‹  ๋น„๋””์˜ค ์ˆ˜์ • ์—ฐ๊ตฌ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ AnimateDiff๋Š” randomly sampled noise๋ฅผ ์ด์šฉํ•˜์—ฌ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•œ๋‹ค.

6. Conclusion#

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ์„ ์œ„ํ•œ practical pipeline์ธ AnimateDiff๋ฅผ ์ œ์•ˆํ•œ๋‹ค. AnimateDiff๋ฅผ ํ†ตํ•ด personalized text-to-image model์„ ๋ฐ”๋กœ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ƒ์„ฑ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„ธ๊ฐ€์ง€ module์„ ๋””์ž์ธํ•˜์˜€์œผ๋ฉฐ ์ด๋ฅผ ํ†ตํ•ด AnimateDiff๋Š” motion prior๋ฅผ ํ•™์Šตํ•˜๊ณ , visual quality๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, MotionLoRA๋ฅผ ํ†ตํ•ด ๊ฐ€๋ฒผ์šด finetuning์„ ํ†ตํ•ด ์›ํ•˜๋Š” motion์œผ๋กœ ์• ๋‹ˆ๋ฉ”์ด์…˜์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

motion module์€ ํ•œ๋ฒˆ ํ•™์Šต๋˜๋ฉด ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋ฅผ animate์‹œํ‚ค๊ณ ์ž ํ• ๋•Œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์–‘ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด AnimateDiff์™€ MotionLoRA์˜ ํšจ์œจ์„ฑ๊ณผ ์ƒ์„ฑ๋Šฅ๋ ฅ์„ ๊ฒ€์ฆํ–ˆ๋‹ค. ๋˜ content-controllability์ธก๋ฉด์—์„œ๋„ ์ถ”๊ฐ€์ ์ธ ํ•™์Šต์—†์ด ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค.

AnimateDiff๋Š” ์ทจํ–ฅ์˜ ๊ทธ๋ฆผ์ฒด, ์บ๋ฆญํ„ฐ์˜ ์›€์ง์ž„, ์นด๋ฉ”๋ผ ์›Œํฌ์— ๋งž๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ์• ๋‹ˆ๋ฉ”์ด์…˜ํ™” ์‹œํ‚ฌ ์ˆ˜์žˆ๋Š” ํšจ์œจ์ ์ธ ๋ฒ ์ด์Šค ๋ผ์ธ์œผ๋กœ์จ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฉด์˜ application์— ํฐ ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

7. ์‹ค์Šต#

์•„๋ž˜ ์ด๋ฏธ์ง€๋“ค์„ ํด๋ฆญํ•˜๋ฉด gif๋ฅผ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

hands_on_1

Fig. 614 side-view-photo-of-17-year-old-girl-in-a-japanese-school
gpt๋กœ ์ƒ์„ฑํ•œ ๊ทธ๋ฆผ์„ input์œผ๋กœ ์‚ฌ์šฉํ•จ
#

hands_on_2

Fig. 615 side-view-photo-of-man-in-black-padded-jumper
์ง์ ‘ ์ดฌ์˜ํ•œ ์‚ฌ์ง„์„ input์œผ๋กœ ์‚ฌ์šฉํ•จ
์ž…๋ ฅํ•œ ์‚ฌ์ง„์˜ ์ธ๋ฌผ์˜ ์ธ์ข…์ด ์œ ์ง€๋˜์ง€ ์•Š์•˜๋Š”๋ฐ ํ•™์Šต๋ฐ์ดํ„ฐ ์…‹์˜ ๋ถˆ๊ท ํ˜• ๋•Œ๋ฌธ์œผ๋กœ ์‚ฌ๋ฃŒ๋จ
#

hands_on_3

Fig. 616 image-of-a-man-with-blonde-hair-and-blue-eyes
gpt๋กœ ์ƒ์„ฑํ•œ ๊ทธ๋ฆผ์„ input์œผ๋กœ ์‚ฌ์šฉํ•จ
#

๐Ÿ“Œ ์‹ค์Šต ํ›„ ๋Š๋‚€์ 

  • WebVid 10M์ด ์• ๋‹ˆ๋ฉ”์ด์…˜ํ™”์— ์ ํ•ฉํ•œ ๋ฐ์ดํ„ฐ์…‹์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค.

  • ๋‹ค์–‘ํ•œ metric์„ ํ‰๊ฐ€์— ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ์ ์ด ์•„์‰ฝ๋‹ค.

  • ํŠน์ • ์• ๋‹ˆ๋ฉ”์ด์…˜ ํด๋ฆฝ์„ ์ƒ์„ฑํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์‹ค์งˆ์ ์œผ๋กœ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์€ motionLoRA์ •๋„๋ผ ์‚ฌ์šฉ์ด ํŽธ๋ฆฌํ•˜๋‹ค.

  • reproduction์ด ๋งค์šฐ ์šฉ์ดํ•˜๋‹ค.

  • AnimateDiff๋ฅผ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” personalized T2I๊ฐ€ ์ œ์ผ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด๋ผ๊ณ  ํ• ์ˆ˜ ์žˆ๋Š”๋ฐ, ์›ํ•˜๋Š” ์Šคํƒ€์ผ์˜ pretrained T2I ๋ชจ๋ธ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค. ๊ทธ๋ฆผ์ฒด๊ฐ€ ์ž˜ ๋งž์ง€ ์•Š์œผ๋ฉด ์• ๋‹ˆ๋ฉ”์ด์…˜ ํด๋ฆฝ ์ดˆ๋ฐ˜์— ๊ธ‰๊ฒฉํžˆ ๋ณ€ํ™”ํ•˜๋Š” ๋ถ€๋ถ„์ด ์ž์ฃผ ์ƒ๊ธด๋‹ค.