Advanced artificial Intelligence Lecture: Recurrent neural Network
Advanced Artificial Intelligence Lecture 7: Recurrent Neural Network
Outline Recurrent neural Network Vanilla rnns Some rnn variants Backpropagation through time Gradient Vanishing/Exploding Long short-term Memory LSTM Neuron Multiple-layer LSTM Backpropagation through time in LStm Time-Series Prediction
Outline ▪ Recurrent Neural Network ▪ Vanilla RNNs ▪ Some RNN Variants ▪ Backpropagation through time ▪ Gradient Vanishing / Exploding ▪ Long Short-term Memory ▪ LSTM Neuron ▪ Multiple-layer LSTM ▪ Backpropagation through time in LSTM ▪ Time-Series Prediction
Vanilla rnns Sequential data So far, we assume that data points(x, y)'s in a dataset are i.i. d (independent and identically distributed) Does not hold in many applications Sequential data: data points come in order and successive points may be dependent, e.g Letters in a word Words in a sentence/document Phonemes in a spoken word utterance Page clicks in a Web session Frames in a video, etc
Vanilla RNNs ▪ Sequential data So far, we assume that data points (x, y)’s in a dataset are i.i.d (independent and identically distributed) Does not hold in many applications Sequential data: data points come in order and successive points may be dependent, e.g., Letters in a word Words in a sentence/document Phonemes in a spoken word utterance Page clicks in a Web session Frames in a video, etc
Vanilla rnns Sequence Modeling How to model sequential data? Recurrent neural networks(vanilla rNNs) c(t depends on x(1),.x(t) Output all t)depends on hidden activations a+.) LtD) (Bias term omitted (k1) act(( )(k -1)+w(k)a(k-1 x a( summarizes x(…,x1) Earlier points are less important Sourceofslidehttps://ww.youtubecom/watch?v2btuy-fw3c&list=plipcwhqlgjdkvoozhmqswxlja9xw7osok
Vanilla RNNs ▪ Sequence Modeling How to model sequential data? Recurrent neural networks (vanilla RNNs): C(t) depends on x(1) ,··· ,x(t) Output a (L,t) depends on hidden activations: (Bias term omitted) a (·,t) summarizes x(t) ,··· ,x(1) Earlier points are less important Source of slide: https://www.youtube.com/watch?v=2btuy_-Fw3c&list=PLlPcwHqLqJDkVO0zHMqswX1jA9Xw7OSOK
Vanilla rnns Sequence Modeling a(k, =act(zk, t) =act(U(a4-1)+Wa(k-1) Weights are shared across time instances(W(k) Assumes that the“ transition functions”are time invariant(U(k) Our goal is to learn U(k)and W(k) for k=1,. L Sourceofslidehttps://ww.youtubecom/watch?v2btuy-fw3c&list=plipcwhqlgjdkvoozhmqswxlja9xw7osok
Vanilla RNNs ▪ Sequence Modeling Weights are shared across time instances (W(k) ) Assumes that the “transition functions” are time invariant (U(k) ) Our goal is to learn U(k) and W(k) for k = 1,···,L Source of slide: https://www.youtube.com/watch?v=2btuy_-Fw3c&list=PLlPcwHqLqJDkVO0zHMqswX1jA9Xw7OSOK
Vanilla rnns RNNs have Memory The computational graph of an rnn can be folded in time Black squares denotes memory access C (Ct-1) C (cr+) a(0 a(4+1) Unfold H (2) (2) awi anl, mAy a(a、UFa) x x(-) x(0 x(+1)
Vanilla RNNs ▪ RNNs have Memory The computational graph of an RNN can be folded in time Black squares denotes memory access
Vanilla rnns EXample Application Slot-Filling Spoken Language Understanding) I would like to arrive Shenyang on November 2nd ticket booking system Destination Shenyang Slot time of arrival: November 2nd Sourceofslidehttp://speech.ee.ntu.edu.tw_/-tilkagk/coursesMl16.html
Vanilla RNNs ▪ Example Application ▪ Slot-Filling (Spoken Language Understanding) I would like to arrive Shenyang on November 2nd . ticket booking system Destination: time of arrival: Shenyang Slot November 2nd Source of slide: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html
Vanilla rnns EXample Application Slot-Filling Spoken Language Understanding) Solving slot filling by Feedforward network? Input: a word (Each word is represented as a vector) Shenyang→ x Sourceofslidehttp://speech.ee.ntu.edu.tw/-tlkagk/coursesMl16.html
Vanilla RNNs ▪ Example Application ▪ Slot-Filling (Spoken Language Understanding) Solving slot filling by Feedforward network? Input: a word (Each word is represented as a vector) 1 x 2 x 2 y 1 y Shenyang Source of slide: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html
Vanilla rnns EXample Application 1-of-N encoding How to represent each word as a vector? 1-of-N Encoding lexicon =(apple, bag, cat, dog, elephant The vector is lexicon size apple =[1000 0 Each dimension corresponds to a bag=[01000] word in the lexicon cat=[00100] The dimension for the word is 1, and others are o dog=[00010] elephant =[0000 1 Sourceofslidehttp://speech.ee.ntu.edu.tw/-tlkagk/coursesMl16.hTml
Vanilla RNNs ▪ Example Application ▪ 1-of-N encoding How to represent each word as a vector? Each dimension corresponds to a word in the lexicon The dimension for the word is 1, and others are 0 lexicon = {apple, bag, cat, dog, elephant} apple = [ 1 0 0 0 0] bag = [ 0 1 0 0 0] cat = [ 0 0 1 0 0] dog = [ 0 0 0 1 0] elephant = [ 0 0 0 0 1] The vector is lexicon size. 1-of-N Encoding Source of slide: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html
Vanilla rnns EXample Application Beyond 1-of-N encoding Dimension for“ Other Word hashing apple a-a-a bag a-a-b 0 cat 0000 a-p-p dog 26X26X26 o-|-e eternal p-p other w=apple W=“ Gandalf w=“ Sauron Sourceofslidehttp://speech.ee.ntu.edu.tw_/-tikagk/coursesMl16.html
Vanilla RNNs ▪ Example Application ▪ Beyond 1-of-N encoding w = “apple” a-a-a a-a-b p-p-l 26 X 26 X 26 … … a-p-p … p-l-e … … … … … 1 1 1 0 0 Dimension for “Other” Word hashing w = “Sauron” … apple bag cat dog elephant “other” 0 0 0 0 0 1 w = “Gandalf” Source of slide: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html