MACHINE LEARNING BERKELEY Vision Transformers (ViTs) By:ML@B Edu Team
Vision Transformers (ViTs) By: ML@B Edu Team
MACHINE LEARNING BERKELEY Motivation Transformers work well for text-what happens if we use them on images? Transformers have some nice properties that could be useful for computer vision o ex.scalability.global receptive fields
Motivation ● Transformers work well for text → what happens if we use them on images? ● Transformers have some nice properties that could be useful for computer vision ○ ex. scalability, global receptive fields
MACHINE LEARNING BERKELEY Recall:Transformer Architecture Start with text string 1.→text tokens 2.-text embedding vectors(via embedding dictionary) 3.text/position embedding vectors 4.stacks transformer layers(self-attention normalization residual connections MLP blocks) 5.→CLS token 6.-attach classification head and do prediction,etc. Commonly trained with a self-supervised objective(ex.next token prediction)
Recall: Transformer Architecture Start with text string 1. → text tokens 2. → text embedding vectors (via embedding dictionary) 3. → text/position embedding vectors 4. → stacks transformer layers (self-attention + normalization + residual connections + MLP blocks) 5. → CLS token 6. → attach classification head and do prediction, etc. Commonly trained with a self-supervised objective (ex. next token prediction)
MACHINE LEARNING BERKELEY Problem! Start with text string 1.→text tokens 2.-text embedding vectors(via embedding dictionary) 3.text/position embedding vectors 4.stacks transformer layers(self-attention normalization residual connections MLP blocks) 5.→CLS token 6.attach classification head and do prediction,etc. Commonly trained with a self-supervised objective(ex.next token prediction)
Problem! Start with text string 1. → text tokens 2. → text embedding vectors (via embedding dictionary) 3. → text/position embedding vectors 4. → stacks transformer layers (self-attention + normalization + residual connections + MLP blocks) 5. → CLS token 6. → attach classification head and do prediction, etc. Commonly trained with a self-supervised objective (ex. next token prediction)
MACHINE LEARNING BERKELEY Naive Solution(imageGPT) Paper:"Generative Pretraining from Pixels" Pixels are kinda discrete-just treat each color value like a separate word in your vocabulary! o Each pixel is commonly represented by a 24 bit value(integers in the range [O,255]for each of the 3 color channels) o Vocab size of 2^24=16,777,216! Who needs that many colors anyway? o Use a 9 bit representation (integers in the range [0,8]for each of the 3 color channels) o Vocab size of 512 Read pixels from raster order(row by row from left to right)to get input sequence
Naive Solution (imageGPT) Paper: “Generative Pretraining from Pixels” ● Pixels are kinda discrete — just treat each color value like a separate word in your vocabulary! ○ Each pixel is commonly represented by a 24 bit value (integers in the range [0, 255] for each of the 3 color channels) ○ Vocab size of 2^24 = 16,777,216! ● Who needs that many colors anyway? ○ Use a 9 bit representation (integers in the range [0, 8] for each of the 3 color channels) ○ Vocab size of 512 ● Read pixels from raster order (row by row from left to right) to get input sequence
MACHINE LEARNING BERKELEY Naive Solution (imageGPT) Another problem:time complexity o Recall:transformers are O(n^2)w.r.t.input length o AND input length is O(n^2)w.r.t.length of each side o 256x 256 image=>65536 pixels o For reference,BERT only has a max length of 512 tokens ● Solution:just use smaller images Imao o Max size of 64 x 64 Trained on a similar objective to language models(next pixel prediction instead of next token prediction)
Naive Solution (imageGPT) ● Another problem: time complexity :( ○ Recall: transformers are O(n^2) w.r.t. input length ○ AND input length is O(n^2) w.r.t. length of each side ○ 256 x 256 image => 65536 pixels ○ For reference, BERT only has a max length of 512 tokens ● Solution: just use smaller images lmao ○ Max size of 64 x 64 ● Trained on a similar objective to language models (next pixel prediction instead of next token prediction)
MACHINE LEARNING BERKELEY The good PRE-TRAINED ON ●' Nice image representations EVALUATION MODEL ACCURACY LARSLAE CIFAR-10 ResNet-15210 94.0 ● SOTA on semi-supervised classification Linear Probe SimCLR12 95.3 o Task:classification with limited labeled samples iGPT-L 32x32 96.3 CIFAR-100 ResNet-152 78.0 0 Model:linear classifer on iGPT representations Linear Probe SimCLR 80.2 o Competitive results with a naive method iGPT-L32x32 82.8 lots of compute ● Nice image generations o Effective at modeling visual information
The good ● Nice image representations ● SOTA on semi-supervised classification ○ Task: classification with limited labeled samples ○ Model: linear classifier on iGPT representations ○ Competitive results with a naive method + lots of compute ● Nice image generations ○ Effective at modeling visual information
MACHINE LEARNING BERKELEY The bad ●' "We train iGPT-S,iGPT-M,and iGPT-L,transformers containing 76M,455M,and 1.4B parameters respectively,on ImageNet.We also train iGPT-XL,a 6.8 billion parameter transformer,on a mix of ImageNet and images from the web." ● "iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days" o For reference,MoCo is another self-supervised model but it has a ResNet backbone that is capable of handling a 224 x 224 image resolution All that for only a 64x64 resolution!
The bad ● “We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL, a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web.” ● “iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days” ○ For reference, MoCo is another self-supervised model but it has a ResNet backbone that is capable of handling a 224 x 224 image resolution ● All that for only a 64x64 resolution!
MACHINE LEARNING BERKELEY So...why? ● Mostly a proof of concept Paradigm of transformers +m a ss i ve self-supervised pre-training but applied to a new domain o A general method for learning representations o Same method,new modes
So… why? ● Mostly a proof of concept ● Paradigm of transformers + m a s s i v e self-supervised pre-training but applied to a new domain ○ A general method for learning representations ○ Same method, new modes