
西安交通大学Natural languageprocessingwith deeplearningXIANHAOTONGUNIVERSITYLanguage Model&Distributed Representation (5)交通大学ChenLicli@xjtu.edu.cn2023
Chen Li cli@xjtu.edu.cn 2023 Language Model & Distributed Representation (5) Natural language processing with deep learning

Outlines1.Self-attention2. Transformer3. Pre-training LM
Outlines 1. Self-attention 2. Transformer 3. Pre-training LM

Outlines1.Self-attention2. Transformer3. Pre-training LM
Outlines 1. Self-attention 2. Transformer 3. Pre-training LM

Self-attentionSelf-Attentionyt=f(at,A,B)Where AandB areanother sequence (matrix)交通大学
Self-attention l Where A and B are another sequence (matrix) l Self-Attention

Self-attentionSelf-Attentionyt = f(at, A, B)WhereA andB areanotherseguence (matrix)If take A(key)= B(value) = X(query), then it is called selfattention交通大学
Self-attention l Where A and B are another sequence (matrix) l If take A(key)= B(value) = X(query), then it is called self attention. l Self-Attention

Self-attentionSelf-Attentionyt = f(at, A, B)Where A and B are another sequence (matrix)If take A(key)= B(value) = X(query), then it is called selfattention.It means to compare X, with the original words and calculate Yat last!
Self-attention l Where A and B are another sequence (matrix) l If take A(key)= B(value) = X(query), then it is called self attention. l It means to compare Xt with the original words and calculate Yt at last! l Self-Attention

Self-attentionSelf-Attentionyt = f(at, A, B)Completely out ofthetraditional RNNorCNNframeworkWhere A and B are another sequence (matrix)If take A(key)= B(value) = X(query), then it is called selfattention.It means to compare X, with the original words and calculate Yat last!
Self-attention l Where A and B are another sequence (matrix) l If take A(key)= B(value) = X(query), then it is called self attention. l It means to compare Xt with the original words and calculate Yt at last! Completely out of the traditional RNN or CNN framework l Self-Attention

Self-attentionSelf-Attentionyt = f(at, A, B)Completelyoutofthetraditional RNNorCNNframeworkWhere A and B are another sequence (matrix)If take A(key)= B(value) = X(query), then it is called selfattention.It means to compare X, with the original words and calculate Yat last!Fasterand can directly get globalinformation!
Self-attention l Where A and B are another sequence (matrix) l If take A(key)= B(value) = X(query), then it is called self attention. l It means to compare Xt with the original words and calculate Yt at last! Completely out of the traditional RNN or CNN framework Faster and can directly get global information ! l Self-Attention

Self-attentionSelf-AttentionKeylKey2Key3Key4AttentionQueryValueValuelValue2Value3Value4Source交道大学
Self-attention l Self-Attention

Self-attentionSelf-AttentionKeylKey2Key3Key4KeylKey2Key3Key4AttentionQueryValueStep1QueryF(Q,K)F(QK)FIQKF(Q,K)ValuelValue2Value3Value4s2s3s4SSourceSoftMax(Step2Calculationprocess:Step 1:calculatingthesimilarityAttentionbetweenqueryandkeytogettheValueweightsStep3ValuelValue2Value3Value4
Self-attention l Self-Attention Step 1 Step 2 Step 3 Calculation process: lStep 1: calculating the similarity between query and key to get the weights