MACHINE LEARNING BERKELEY Aside ● There are ways to make transformers more efficient (architecture-wise) ● BUT recall:a major appeal of using transformers is that they scale well relative to compute Transformer architectures are supposed to be simple:self attention is just huge matrix multiplications o huge matrix multiplications are good for parallelization o want to keep the architecture as simple as possible Aside ● There are ways to make transformers more efficient (architecture-wise) ● BUT recall: a major appeal of using transformers is that they scale well relative to compute ● Transformer architectures are supposed to be simple: self attention is just huge matrix multiplications ○ huge matrix multiplications are good for parallelization ○ want to keep the architecture as simple as possible