Partially Observable Markov Decision processes part ll Additional reading Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Ph. D. Thesis. Brown University, Department of Computer Science, Providence, RL, 1998 CThis lecture most focuses on chapter 6) Ssues Problem statement If we do not know the current state of the world what should we do to act optimally? Input Model of states actions observations transition and emission functions reward function initial distribution and discount factor Outputs from approximate algorithms approximate value function Outputs from heuristic algorithms: policies that capture the kind of behaviour we want
Partially Observable Markov Decision Processes Part II ● Additional reading: ● Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Ph.D. Thesis. Brown University, Department of Computer Science, Providence, RI, 1998 (This lecture most focuses on chapter 6). Issues ● Problem statement: – If we do not know the current state of the world, what should we do to act optimally? ● Inputs: – Model of states, actions, observations, transition and emission functions, reward function, initial distribution and discount factor ● Outputs from approximate algorithms: – approximate value function ● Outputs from heuristic algorithms: – policies that capture the kind of behaviour we want
Point-based Value Iteration(Pineau et al., 2003) Fix a set of beliefs° Maintain a single a-vector per belief Ur pdate each a-vector according to bayes'filter a,.(s)=R(s, a)+rP(s'la, s)P(=ls")a arg max(a ar1= arg max(a{n·b,) Maximum-Likelihood State Heuristic Assume the MDP policy Nourbakhsh, Powers and Birchfield,1995 MDP Given belief b I MS (6)=TMDP arg max(b(s) S Voting heuristic Assume the mdP policy Given belief b aa Av(b)=arg max b(s)I(MoP(s), a
Point-based Value Iteration (Pineau et al., 2003) ● Fix a set of beliefs ● Maintain a single α-vector per belief ● Update each α-vector according to Bayes’ filter 2 2.5 3 3.5 4 4.5 5 0 1 b1 b2 b3 argmax( ) argmax( ) ( ) ( , ) ( '| , ) ( | ') , , , , ' , , 1 , , , i t t a i z Z i t a z i t a i s S t i t a z i b b s R s a p s a s p z s t a t a z i = ⋅ = ⋅ = + ∑ ∑ ∈ ∈ − α α α α α γ α α α Maximum-Likelihood State Heuristic ● Assume the MDP policy ● Given belief b: π MDP : S → A ( )⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ∈ (b) argmax b(s) s S π MLS π MDP Voting Heuristic ● Assume the MDP policy ● Given belief b: Nourbakhsh, Powers and Birchfield, 1995 Simmons & Koenig, 1996 ∑ ( ) ∈ ∈ = ⎩ ⎨ ⎧ ≠ = = s S MDP a A AV i j i j i j b b s I s a a a a a I a a ( ) argmax ( ) ( ), 0 : 1: ( , ) π π
Q-MDP Cassandra, Kaelbling and Littman, 1994 Assume the optimal mdp Q-function Qop:S×A>咒 Given belief b Zo-MDp(b)=arg max 2b(s)@(s, a a∈A Entropy Heuristic Given belief b arg min E_ LH(p(s|a, z)) if H(b)>K a∈ T MDPl arg max(P(s) otherwise Alternate Dual-Mode heuristic Given belief b argmax H(b∑p(s)Qc( S,a)+ A -H()∑p(S)Qm1(a) Discretization Discretize belief space Convert to belief space MDP Solve using conventional DP Subject to "Curse of Dimensionality
Q-MDP ● Assume the optimal MDP Q-function: ● Given belief b: ● Given belief b: Cassandra, Kaelbling, and Littman, 1994 QMDP : S × A → ℜ ∑ ∈ ∈ − = A s S a π Q MDP (b) argmax b(s)Q(s,a) Entropy Heuristic Cassandra, Kaelbling & Kurien, 1996 ( ) [ ] ⎪ ⎩ ⎪ ⎨ ⎧ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ > = ∈ otherwise if argmax( ( )) argmin ( ( | , )) ( ) p s E H p s a z H b b s MDP z a A π κ π Alternate Dual-Mode Heuristic ● Given belief b ( ) ( ) ⎥ ⎦ ⎤ − ⎢ ⎣ ⎡ = + ∑ ∑ ∈ ∈ ∈ s S MDP s S CU a A H b p s Q s a b H b p s Q s a 1 ( ) ( ) ( , ) argmax ( ) ( ) ( , ) κ κ π Discretization ● Discretize belief space ● Convert to belief space MDP ● Solve using conventional DP ● Subject to “Curse of Dimensionality
Coastal Navigation Represent beliefs using b=arg maxb(s); H(b) 1. Discretize state-entropy space 2. Compute reward function and transition function 3. Solve belief state mdp Model parameters · Reward function P(s Back-project to high dimensional belief S S3 R(b Compute expected reward from belief R(b)=E(R(S)=P(SR(S)
Coastal Navigation ● Represent beliefs using 1. Discretize state-entropy space 2. Compute reward function and transition function 3. Solve belief state MDP argmax ( ); ( ) ~ b b s H b s = Model Parameters • Reward function R(b) s1 s2 s3 p(s) Back-project to high dimensional belief = = ∑ S b R(b) E (R(s)) p(s)R(s) ~ Compute expected reward from belief: ~
Model parameters · Use forward model Full dimension P, s P Deterministic process Stochastic process Model parameters Use forward model b∝P(=SbS if 6(s)=b(sa, z) =0 otherwise
Model Parameters • Use forward model bi bj bk a z1 pi (s) pj (s) bi bj a, z1 Low dimension Full dimension Deterministic process Stochastic process ~ ~ ~ ~ ~ Model Parameters • Use forward model bi bq bk a z2 bi bj bk a z1 T(bi , a, bj ) ∝ p(z|s)bi (s|a) if bj (s) = bi (s|a,z) = 0 otherwise ~ ~ ~ ~ ~ ~
What You Should Know Advantages and disadvantages to using POMDPS What an optimal policy looks like How to compute a policy Different heuristics and approximations now they work and where they may fail
What You Should Know ● Advantages and disadvantages to using POMDPs ● What an optimal policy looks like ● How to compute a policy ● Different heuristics and approximations: how they work and where they may fail