Learning to Act optimally Reinforcement Learning Brian C. Williams 16410 December 8th. 2003 Slides adapted from Manuela veloso Reid simmons Tom Mitchell. CMU TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States Board configurations(1020) Actions Rewards +100 if win 100 if lose 0 for all other states Trained by playing 1.5 million games against self Currently, roughly equal to best human player
Learning to Act Optimally: Reinforcement Learning Brian C. Williams 16.410 December 8th, 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1 TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. Î Currently, roughly equal to best human player. 2
Markov Decision Processes and Reinforcement Learning · Motivation earning policies through reinforcement Q values Q learning · Summary Appendix: Nondeterministic MDPs Reinforcement Learning problem Given: Repeatedly Executed action Observed state Observed reward Leam action policy t: s-A State//Reward Action Maximizes life reward Environment om any start state Discount: 0<y< a Note Unsupervised learning Goal: Learn to choose actions Delayed reward that maximize life reward Model not known ro+YG,+y2r2
3 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • 4 Reinforcement Learning Problem • • • Learn action policy S: S o A • r0 + J r1 + J 2 r2 . . . from any start state. • 1 Note: • • • Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions r0 + J r1 + J 2 r2 . . . Motivation Q values Q learning • Appendix: Nondeterministic MDPs Given: Repeatedly… Executed action Observed state Observed reward Maximizes life reward Discount: 0 Unsupervised learning Delayed reward Model not known Environment that maximize life reward Summary J
How About Learning value Functions? Have agent learn value function Vi, denoted V. 2. Agent selects optimal action by one step lookahead on V T*()=argmaxa [r(s, a)+yV(8(s, a) Problem: Works well if agent knows the environment model δ:SXA→S r.SxA→ With no model agent cant choose action from v Eliminating the Model with Q Functions (s)=argmaxa [r(s, a)+yV(s(s, a) Key idea Define function like V that encapsulates S and r Q(s, a)=r(s, a)+yV(8(s, a)) Then, if agent learns Q, it can choose an optimal action without knowing S r. I*(s)=argmaxa Q(s, a)
5 How About Learning Value Functions? VS , denoted V V S*(s) = argmaxa >r(s,a + JV (G(s, a)] Problem: • • G: S x A o S • r: S x A o • . 6 Eliminating the Model with Q Functions S*(s) = argmaxa >r(s,a + JV (G(s, a)] Key idea: • V that encapsulates G and r: Q(s,a = r(s,a + JV (G(s, a)) • Q, it can choose an optimal action without knowing G or r. S*(s) = argmaxa Q(s,a 1. Have agent learn value function 2. Agent selects optimal action by one step lookahead on Works well if agent knows the environment model. With no model, agent can’t choose action from V Define function like Then, if agent learns
How Do We Learn Q? Q(S, a, =r(s, a,+yV(8(s, a ) Need to eliminate v In update rule Note Q and V are closely related V(s=maxa, Q(s, a) Substituting Q for V Q(S,a, =r(s, a, +r maxa Q(s, a) Example-Q Learning Update Y=09 R R Q(s, arghtfr(s,, aright+r maxa, Q(s, a) ←-0+0.9max{63,81,100} 90 Note: if rewards are non-negative For all s, a, n, Q,(s, a)<Qn+1(s, a) For all s,a,n,0≤Qn(s,a≤Q(S,a)
7 How Do We Learn Q? Q(st ,at = r(st ,at + JV (G(st , at )) • • V*(s) = maxa’ Q(s,a’) • Q(st ,at = r(st ,at + Jmaxa’ Q(s’,a’) 8 Example – Q Learning Update Q(s1,aright) m r(s1,aright) + Jmaxa’ Q(s2,a’) m 0 + m 90 90 R 100 81 72 63 R 100 81 63 Note: if rewards are non-negative: ll s, a, n, Qn(s, a) dQn+1(s, a) ll s, a, n, 0 dQn(s, a) dQ(s, a) J Need to eliminate V* In update rule. Note Q and V* are closely related: Substituting Q for V*: • For a • For a 0.9 max {63, 81, 100} = 0.9
Q-Learning Iterations Starts at top left corner-move clockwise around perimeter Initially all values in Q table are zero: y=0.8 Q(S,a)←r+ y maxa, G(sa) S1 Q(S1, E Q(S2, E) Q(s3,S) Q(s4, W) 0 +y maxa Q (s5, loop))= 0 0 r+ymax但Q(84 =0+08xmax{100)=8 10 0 83×088) 8 10 Crib Sheet: Q-Learning for Deterministic Worlds Let q denote the current approximation to Q Initiall For each s, a initialize table entry Q(s, a)<0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q(s, a)as follows Q(s,a)∈ y max a2Q(s,a)
9 Q-Learning Iterations • Initially all values in Q table are zero; J Q(s, a) mr+ Jmaxa’ Q(s’,a’) G 10 10 10 s1 s2 s3 s6 s4s5 10 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: s, a initialize table entry Q(s, a) m0 • s Do forever: • a and execute it • r • s’ • Q (s, a) as follows: Q(s, a) mr+ Jmaxa’ Q(s’,a’) • s ms’ Starts at top left corner – move clockwise around perimeter; = 0.8 • For each Observe current state Select an action Receive immediate reward Observe the new state Update the table entry for Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) 0 0 0 r+ Jmaxa’ {Q(s5,loop)}= 10 + 0.8 x 0 = 10 0 0 r+ Jmaxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x max{10,0) = 8 10 0 r+ Jmaxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8) = 6.4 8 10
Discussion How should the learning agent use the intermediate q values? Exploration Vs Exploitation Scaling up in the size of the state space Function approximators(neural net instead of table) Reuse. use of macros Markov decision processes and Reinforcement Learning Motivation Learning policies through reinforcement · Q values Q learning Summary Appendix: Nondeterministic MDPs
11 Discussion • intermediate Q values? • • • • 12 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • How should the learning agent use the Exploration vs Exploitation Scaling up in the size of the state space Function approximators (neural net instead of table) Reuse, use of macros Motivation Q values Q learning • Appendix: Nondeterministic MDPs Summary
Nondeterministic EXample R-Research 0.9 D-Development S1 1.0 D Unemployed Industry 1.0 R 1.0 0.9 Grad School Academia 0.1 R 0.1 0.9 1.0 Non Deterministic Case We redefine V, Q by taking expected values V(S少=E[t+yrtt+y2t+2∴.] v(s)=E[Σyt Q(S, a =E[r(s, a+yV((s, aD)
13 Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 1.0 0.1 0.9 1.0 14 • V, Q by taking expected values VS(st ) = E[rt + J rt+1 + J 2 rt+2 . . .] VS(st ) = E[ ¦ J i rt+I ] Q(st ,at = E[r(st ,at + JV (G(st , at ))] R – Research D – Development NonDeterministic Case We redefine
Nondeterministic Case Alter training rule to Qn(s, a)<(1-an)Qn-1(s, a)+an[+ y maxa, Qn-1(s, a) where an =1/(1+visits,(s, a)) and s=8(s, a) Can still prove convergence of Q Watkins and Dayan, 92 Ongoing Research Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action state · Learn and useδ:SⅹA→S Relationship to dynamic programming Multiple learners- Multi-agent reinforcement learning
15 Nondeterministic Case • Qn(s, a) m (1- Dn) Qn-1 (s,a) + Dn [r+ J maxa’ Qn-1 (s’,a’)] where Dn = 1/(1+visitsn G(s, a). Can still prove convergence of Q [Watkins and Dayan, 92] 16 Ongoing Research • • • • Learn and use G S x A o S • • Alter training rule to Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action, state Relationship to dynamic programming Multiple learners – Multi-agent reinforcement learning (s,a)) and s’ =