Partially Observable Markov Decision processes Additional reading Leslie Pack Kaelbling, Michael L Littman and Anthony R. Cassandra, "Planning and cting in Partially Observable Stochastic Domains, "Artificial Intelligence, Vol. 101 1998 J Pineau, G. Gordon &s. Thrun"Point-based value iteration: An anytime algorithm for POMDPs". International Joint Conference on Artificial Intelligence(IJCAI) Acapulco, Mexico. Aug. 2003 Ssues Problem statement If we do not know the current state of the world what should we do to act optimally? Input Model of states actions observations transition and emission functions reward function initial distribution and discount factor Outputs from different algorithms Complete/partial, exact/approximate value function Complete/partial, exact/approximate finite state machine Choices Policy Vs plan Exact VS approximate Representation Discrete vs, continuous states actions observations
Partially Observable Markov Decision Processes Ɣ Additional reading: Ɣ Acting in Partially Observable Stochastic Domains,'' Artificial Intelligence, Vol. 101, 1998. Ɣ Issues Ɣ Problem statement: – Ɣ Inputs: – Ɣ Outputs from different algorithms: – Complete/partial, exact/approximate value function – Complete/partial, exact/approximate finite state machine Ɣ Choices: – Policy vs. plan – Exact vs. approximate – Representation – Discrete vs. continuous states, actions, observations Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra, ``Planning and J. Pineau, G. Gordon & S. Thrun "Point-based value iteration: An anytime algorithm for POMDPs''. International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. Aug. 2003. If we do not know the current state of the world, what should we do to act optimally? Model of states, actions, observations, transition and emission functions, reward function, initial distribution and discount factor
Graphical Models actions Sondik. 1971 Beliefs T(S/ ai, s) Observations O(;/ s hidden states Rewards RI Reliable navigation Conventional trajectories may not be robust to localization error Estimated robot position o True robot position o Goal position o
Graphical Models Sondik, 1971 States S1 Rewards R1 S2 T(sj |ai , si ) Z2 b Beliefs 1 Observations Z1 a Actions 1 O(zj |si ) b2 Hidden Observable Reliable Navigation Ɣ Conventional trajectories may not be robust to localization error Estimated robot position True robot position Goal position
Control models Markov Decision Processes Probabilistic P(x) Control Mode World state o Partially Observable Markov Decision Processes Probabilistic Perception P(x) Control Model World state Navigation as a POmdp Controller chooses actions based on probability distributions Action, observation State space State is hidden from the controller
Ɣ Markov Decision Processes Control Models Probabilistic Perception Model P(x) Control World state World state Probabilistic Perception Model P(x) Control Partially Observable Markov Decision Processes Navigation as a POMDP State Space State is hidden from the controller Action, observation Controller chooses actions based on probability distributions argmax P(x)
Passive Controlled Observable Markov Model MDP Hidden State HMM POMDP Stationary policy: Best action is fixed Non-stationary policy: Best action depends on time States can be discrete, continuous, or hybrid Tradeoffs MDP tractable to solve relatively easy to specify Assumes perfect knowledge of state POMDP treats all sources of uncertainty(action, sensing, environment )in a uniform framework Allows for taking actions that gain information Difficult to specify all the conditional probabilities Hugely intractable to solve optimally POMDP Advantages Models information gathering Computes trade-off between Getting reward Being uncertain Am I here, or over there? 人人
Ɣ Stationary policy: Best action is fixed Ɣ Non-stationary Ɣ States can be , continuous, or hybrid Tradeoffs Ɣ MDP + + – Assumes perfect knowledge of state Ɣ POMDP + in a uniform framework + – – Hugely intractable to solve optimally Hidden State HMM POMDP Markov Model MDP Fully Observable Passive Controlled POMDP Advantages Ɣ Models information gathering Ɣ Computes trade-off between: Ɣ Getting reward Ɣ Being uncertain Am I here, or over there? policy: Best action depends on time discrete Tractable to solve Relatively easy to specify Treats all sources of uncertainty (action, sensing, environment) Allows for taking actions that gain information Difficult to specify all the conditional probabilities
A Simple POmdp p(s) State is hidden S2 POMDP Policies Current belief p(s) Optimal action Belief Space
A Simple POMDP p(s) s1 s2 s3 State is hidden POMDP Policies
POMdP Policies Belief Space Value Function over Belief Space current belief Optimal action Coastal Navigation
POMDP Policies Coastal Navigation
Example pomdp Two states: S,S? Two actions: a,,a2 Two observations: Z1, Z2 a Assume y=.9 Reward: r(a,, S,)=2 Emission p(z s1)=.9 2 probabilities: p(z,,=1 r(a1S2)=2 p(z2)=5 r(a23S2)=2 p(Z2ls2)=5 Transition p(S1S1a1)=3 p(S1S2,a1)=6 probabilities: p(S Is, a,) p(S1S2,a1) p(S1S12a2)= p(S1S2a2)=.8 p(S21a2)=.9 p(S1S2,a2)=,2 Example courtesy of Simmons and veloso, 2002 Exact Value Iteration Define " policy tree For a horizon of length a t. there are ZI possible policy trees a Idea Compute value of tree of length t recursively from trees of length t-1 (b)=maxb·a ∈r
Example POMDP s2 s1 Two states: s1, s2 Two actions: a1, a2 Two observations: z1, z2 Assume Ȗ =.9 Reward: r(a1, s1) = 2 r(a2, s1) = 2 r(a1, s2) = 2 r(a2, s2) = 2 Emission p(z1|s1) = .9 2|s1) = .1 p(z1|s2) = .5 p(z2|s2) = .5 Transition p(s1|s1, a1) = .3 p(s1|s2, a1) = .6 2|s1, a1) = .7 p(s1|s2, a1) = .4 p(s1|s1, a2) = .1 p(s1|s2, a2) = .8 p(s2|s1, a2) = .9 p(s1|s2, a2) = .2 a2 a1 Exact Value Iteration Ɣ Define “policy tree” Ɣ For a horizon of length t, there are possible policy trees Ɣ Idea: Compute value of tree of length t recursively from trees of length t-1 a1 z2 z1 a2 z2 z1 a1 z2 z1 t Z A A | | | || | D D b t( ) max Example courtesy of Simmons and Veloso, 2002 probabilities: p(z probabilities: p(s * V b
Exact Value iteration Compute value of each policy tree over space of beliefs b by taking expectation Given tree p (b)=E(Vn(s)=∑p(s)n( Expectation is linear: value function for policy tree is therefore linear, expressed as an a-vector' Compute v(s)from Bayes filter Vn(s)=maxR(sa)+y∑p(sa,s∑p(z|S)n(s) t-1 a2 p Exact Value Iteration Set V=0=0 V=1: Two policy trees, p,, p2 p P2:(a2 SI a2=[v2(S1),V2(2) [R(S12a1)+…,R(S2,a1) =[R R(S2,a2) [R(S12a1),R(S2,a1) =[R(s2a2),R(S2,a2) b maxb·a c∈r p(S2) p(S2) p(S2)
Exact Value Iteration Ɣ Compute value of each policy tree over space of beliefs b by taking expectation Ɣ Given tree p: Ɣ Expectation is linear: value function for policy tree is therefore linear, expressed as an “Į-vector” Ɣ Compute Vp(s) from Bayes’ filter: ¦ | | ( ) ( ( )) ( ) ( ) S p b p p V b E V s s ¦ ¦ ' | | | | 1 ( ) max ( , ) ( '| , ) ( | ') ( ') s Z S t p a t p V s s z J a1 z2 z1 a2 z2 z1 a1 z2 z1 1 1 t pz V 1 1 t pz V Exact Value Iteration Ɣ Set Vt=0=0 Ɣ Vt=1: Two policy trees, p1, p2 a1 p1: p2: a2 [ ( , ), ( , )] [ ( , ) ..., ( , ) ...] [ ( ), ( )] 1 1 2 1 1 1 2 1 1 1 1 1 2 a a V s V s p p D [ ( , ), ( , )] [ ( , ) ..., ( , ) ...] [ ( ), ( )] 1 2 2 2 1 2 2 2 2 2 1 2 2 a a V s V s p p D p(s2 1 ) p(s2 1 ) p(s2 1 ) 1 1 t Vp 1 2 t Vp t 1 V D D b t( ) max p s V R s a p s a s p z s V R s a R s R s a R s R s a R s R s a R s * V b
Exact Value iteration V=2: 8 possible policy trees pI PI 12 P21 p2 P22 R(s1,a)+y∑p(S|a,S)∑p(=ls)(s s'eS Z R(S2, a,)+r2p(s a,52 2P(213 )a, (s")1 S l12 Exact Value Iteration a1=[3.17,244 d211 1.99,4.62 a12=[3.773,2746]·Q212=[2.791,4.728] Q121=[3.557,2314]·a2=[2.719,4.152] a122=[4.16,262]a22=[3.52,4.26] 3.5 2.5 2 0.5 p(S2)
Exact Value Iteration Ɣ Vt=2: 8 possible policy trees p1 a1 z2 z1 p1 p1 a1 z2 z1 p2 p2 a1 z2 z1 p1 p2 a1 z2 z1 p2 p1 a2 z2 z1 p1 p1 a2 z2 z1 p2 p2 a2 z2 z1 p1 p2 a2 z2 z1 p2 p111 p112 p121 p122 p211 p212 p221 p222 .... ( , ) ( '| , ) ( | ') ( ')] ( , ) ( '| , ) ( | ') ( '), 112 ' | | 2 1 1 2 1 ' | | 1 1 1 1 1 111 » » » ¼ º « « « ¬ ª ¦ ¦ ¦ ¦ D J D J D D s Z S s Z S a s s Exact Value Iteration Ɣ Į111= [3.17, 2.44] Ɣ Į112= [3.773, 2.746] Ɣ Į121= [3.557, 2.314] Ɣ Į122= [4.16, 2.62] Ɣ Į211= [1.99, 4.62] Ɣ Į212= [2.791, 4.728] Ɣ Į221= [2.719, 4.152] Ɣ Į222= [3.52, 4.26] 0 1 2 3 4 5 0 1 p(s2) Į222 Į212 Į122 R s p s a s p z s R s a p s a s p z s 0.5 1.5 2.5 3.5 4.5
Pruning alpha vectors Test all a-vectors for dominance maximize d suchthat:∑bss)2s()+d)rai∈v-{n} and:b(s)≥0 for all se s v2={2791,4.728],[3.52,4.26],[4.16,262} Execution 212 p(s) S1 S Belief space actio
Pruning Alpha Vectors Ɣ Test all Į-vectors for dominance Ɣ Vt=2 = {[2.791, 4.728], [3.52, 4.26], [4.16, 2.62]} s S b(s)V s d p V p d S p S p t t ¦ ¦ maximize ( ) 0 ( ) 1 ~ ( ) | | ~ | | p(s) s1 s2 2 2.5 3 3.5 4 4.5 5 0 1 Execution Current belief Optimal Action Belief Space Į222 Į212 Į122 a1 b s b s b(s)V (s) and : for all and : such that : for all { }