各式各樣的Attention Hung-yi Lee李宏毅
各式各樣的 Attention Hung-yi Lee 李宏毅
Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力 【機器學習2021】自注意力 機制(Self-attention)(上) 機制(Self-attention)(下)
Prerequisite https://youtu.be/hYdO9CscNes https://youtu.be/gmsMY5kc-zw 【機器學習2021】自注意力 機制 (Self-attention) (上) 【機器學習2021】自注意力 機制 (Self-attention) (下)
To Learn More.… Big Bird Transformer Synthesizer Performer ○Linformer ●Reformer Sinkhorn 0 Linear Transformer Long Range Arena:A Benchmark for Efficient Local Attention ● Transformers https://arxiv.org/abs/2011.04006 100 150 200 250 300 350 Speed (examples per sec) Recurrence Pefome sarTeiore Low Rank/ Memory Kernels ETC Transformer Synthe Big Bird Leamable Fixed/Factorlzed/ Patterns Random Patterns Efficient Transformers:A Survey eoe https://arxiv.org/abs/2009.06732 Axia
To Learn More … https://arxiv.org/abs/2009.06732 Efficient Transformers: A Survey Long Range Arena: A Benchmark for Efficient Transformers https://arxiv.org/abs/2011.04006 3
How to make self-attention efficient? key Sequence length =N anb 三三 Attention Matrix N×W
How to make self-attention efficient? Attention Matrix key query 𝑁 𝑁 𝑁 × 𝑁 Sequence length = 𝑁
Output Probabilities Notice Softmax Self-attention is only a Add Norm module in a larger Feed Forward network. Add Norm Add Norm ·Self-attention Multi-Head Feed Attention dominates computation Forward when N is large. Add Norm Add Norm Masked Multi-Head Multi-Head Usually developed for Attention Attention image processing Positional Positional Encoding Encoding N= Input Output 256 Embedding Embedding 256*256 Inputs Outputs 256 (shifted right)
Notice • Self-attention is only a module in a larger network. • Self-attention dominates computation when 𝑁 is large. • Usually developed for image processing 𝑁 = 256 ∗ 256 256 256
Skip Some Calculations with Human Knowledge Can we fill in some values with human knowledge?
Skip Some Calculations with Human Knowledge Can we fill in some values with human knowledge?
Local Attention Truncated Attention Set to 0 Similar with CNN Calculate attention key weight
Local Attention / Truncated Attention Calculate attention weight Set to 0 …… Similar with CNN key query
Stride Attention
Stride Attention … …
Global Attention special token="token中的里長伯" Add special token into original sequence Attend to every token-collect global information Attended by every token->it knows global information No attention between non- special token
… … Global Attention Add special token into original sequence • Attend to every token → collect global information • Attended by every token → it knows global information special token = “token中的里長伯“ No attention between nonspecial token
Many Different Choices .. 小孩子才做摆擇··。 Different heads use different patterns
Many Different Choices … Different heads use different patterns. 小孩子才做選擇...