Stable Diffusion
Stable Diffusion
A cat in Text-to-image Framework the snow Generator A cat in Text the snow Encoder Generation Model “中間產物” Decoder 圖片的壓縮版本 3
Framework Text-to-image Generator A cat in the snow A cat in the snow Text Encoder Generation Model Decoder 1 3 2 “中間產物” 圖片的壓縮版本
Stable Diffusion https://arxiv.org/abs/2112.10752 Latent Space 2 Conditioning Diffusion Process Semantid Map 2 Denoising U-Net EA Text x(T-1) Repres entations Images Pixel Space D可 品 ☑ T denoising step crossattention switch skip connection concat
Stable Diffusion https://arxiv.org/abs/2112.10752 1 2 3
DALL-E series https://arxiv.org/abs/2204.06125 https://arxiv.org/abs/2102.12092 CLIP objective img encoder “a corgi playing a flame text 3 a80 throwing encoder trumpet" Autoregressive Diffusion prior decoder :
DALL-E series https://arxiv.org/abs/2204.06125 1 2 3 https://arxiv.org/abs/2102.12092 Autoregressive Diffusion
Text "A Golden Retriever dog wearing a blue checkered beret and red dotted turtleneck." Imagen Frozen Text Encoder https://imagen.research.google/ https://arxiv.org/abs/2205.11487 Text Embedding Text-to-Image Diffusion Model 2 64×64 Image f2 Super-Resolution Diffusion Model 256×2561ma 3 Super-Resolution Diffusion Model 1024×10241mag9
Imagen https://imagen.research.google/ https://arxiv.org/abs/2205.11487 1 2 3
A cat in Text-to-image Framework the snow Generator A cat in Text the snow Encoder Generation Model Decoder 3
Framework Text-to-image Generator A cat in the snow A cat in the snow Text Encoder Generation Model Decoder 1 3 2
T5-Small 300M 25 T-Large 25 500M T5-XL 1B T5-XXL 2B XOI-CIH 20 XOI-CI 20 15 15 10 10 0.22 0.24 0.26 0.28 0.24 0.250.26 0.270.280.29 CLIP Score CLIP Score (a)Impact of encoder size. (b)Impact of U-Net size. https://arxiv.org/abs/2205.11487
https://arxiv.org/abs/2205.11487
Frechet Inception Distance (FID) https://arxiv.org/abs/1706.08500 red points:real images CNN softmax blue points:generated images FID Frechet distance ?? between the two Gaussians Smaller is better A lot of samples is needed
Fréchet Inception Distance (FID) red points: real images FID = Fréchet distance between the two Gaussians CNN softmax blue points: generated images ??? Smaller is better A lot of samples is needed. https://arxiv.org/abs/1706.08500
Contrastive Language-Image Pre-Training (CLIP) https://arxiv.org/abs/2103.00020 400 million image-text pairs close far Text Image Text Image Encoder Encoder Encoder Encoder A cat in A dog is the snow running
Contrastive Language-Image Pre-Training (CLIP) https://arxiv.org/abs/2103.00020 Text Encoder Image Encoder A cat in the snow Text Encoder Image Encoder A dog is running. 400 million image-text pairs close far
A cat in Text-to-image Framework the snow Generator A cat in Text the snow Encoder Generation Model Decoder can be trained Decoder without labelled data
Framework Text-to-image Generator A cat in the snow A cat in the snow Text Encoder Generation Model Decoder 1 2 3 Decoder can be trained without labelled data