《博弈论》（英文版） lect 12 Repeated Games.pdf_大学文库

The definition uses preference relations instead of utility functions. This is to accommo- date payoff aggregation criteria which do not admit an expected-utility representation I will only illustrate the key features of the two "special"criteria proposed in OR, and then focus on discounting. I will mostly discuss infinitely-repeated games Discounting In general, we assume that players share a common discount factor d E(0, 1), and rank payoff streams according to the rule (n21x;(m2)1∑6-1(n2-2) Now, we want to be able to talk about payoff profiles of the stage game G as being achieved in the repeated version of G. Thus, for instance, if the profile a E A is played in each repetition of the game, we want to say that the payoff profile u(a) is attained. But, of course, this is not the discounted value of the stream (u(a), u(a),. ) rather, this value is Thus, one usually“ Measures"” payoffs in a repeated game in“ discounted units”,i.e.in terms of the discounted value of a constant payoff stream(1,1,.), which is of course s The bottom line is that, given the terminal history(a, .. ar, . .) the correspondin payoff profile is taken to be(1-d>t18-u(a) Limit -of-Means The key feature(OR: " the problem")of discounting is that periods are weighted differently One might think that, in certain situations, this might be unrealistic: OR propose the example of nationalist struggles A criterion which treats period symmetrically is the limit of means. Ideally, we would want the value of a stream(u)>1 to Player i to be limT- EI,. However, this limit may fail to exist. I Thus, we consider the following rule 21分 lim inf-t T 0 Recall how lim infn In is defined: let yn= inf(m: m 2 n), and set lim inf, n= limn yn Thus, lim infn In>0 iff, for some e>0, In>E for all but finitely many n's Here' s how to construct an example: call v t the average up to T. First, set u l= 1, so Vt=1 Now consider the sequence(1,0,0): V3=1. Now extend it to(1,0, 0, 1, 1, 1): V6 (1,0,0, 1, 1, 1,0,0,0,0,0,0):V2=3. You see how you can construct a sequence such that the sequence of verges fips between 3 and 3

The definition uses preference relations instead of utility functions. This is to accommodate payoff aggregation criteria which do not admit an expected-utility representation. I will only illustrate the key features of the two “special” criteria proposed in OR, and then focus on discounting. I will mostly discuss infinitely-repeated games. Discounting In general, we assume that players share a common discount factor δ ∈ (0, 1), and rank payoff streams according to the rule (u t i )t≥1 i (w t i )t≥1 ⇔ X t≥1 δ t−1 (u t i − w t i ) > 0 Now, we want to be able to talk about payoff profiles of the stage game G as being achieved in the repeated version of G. Thus, for instance, if the profile a ∈ A is played in each repetition of the game, we want to say that the payoff profile u(a) is attained. But, of course, this is not the discounted value of the stream (u(a), u(a), . . .): rather, this value is u(a) 1−δ . Thus, one usually “measures” payoffs in a repeated game in “discounted units”, i.e. in terms of the discounted value of a constant payoff stream (1, 1, . . .), which is of course 1 1−δ . The bottom line is that, given the terminal history (a 1 , . . . , at , . . .), the corresponding payoff profile is taken to be (1 − δ) P t≥1 δ t−1u(a t ). Limit-of-Means The key feature (OR: “the problem”) of discounting is that periods are weighted differently. One might think that, in certain situations, this might be unrealistic: OR propose the example of nationalist struggles. A criterion which treats period symmetrically is the limit of means. Ideally, we would want the value of a stream (u t i )t≥1 to Player i to be limT→∞ PT t=1 u t i T . However, this limit may fail to exist.1 Thus, we consider the following rule: (u t i )t≥1 i (w t i )t≥1 ⇔ lim inf t→∞ X T t=1 u t i − w t i T > 0 [Recall how lim infn xn is defined: let yn = inf{xm : m ≥ n}, and set lim infn xn = limn yn. Thus, lim infn xn > 0 iff, for some > 0, xn > for all but finitely many n’s.] 1Here’s how to construct an example: call V t i the average up to T. First, set v 1 i = 1, so V t i = 1. Now consider the sequence (1,0,0): V 3 i = 1 3 . Now extend it to (1,0,0,1,1,1): V 6 i = 2 3 . Now consider (1,0,0,1,1,1,0,0,0,0,0,0): V 1 i 2 = 1 3 . You see how you can construct a sequence such that the sequence of averages flips between 1 3 and 2 3 . 2

ow the stream(0, 0,,, 0, 2, 2, 2, . )will always be preferred to(1, 1, ..), whereas for low enough o, the reverse is true. Conversely, the limit-of-means criterion does not distinguish between(1, 1,0,0, .. )and(0, 0,..., whereas, for all 8 E(0, 1), the latter stream is preferred to the former ote that, for any pair of streams, there exists a discount factor such that the ordering of those two streams according to the limit-of-means and discounting criteria is the same (but, you need to pick a different d for every pair of streams Overtaking The limit-of-means criterion completely disregards finite histories. Thus, for instance, it does not distinguish between(-1, 2,,. )and(-1, 1, 0, 0, . ). This might seem a bit extreme Thus consider the following variant (u4)121x(u4)≥1分 lim inf t=1 Then, according to the overtaking criterion, a stream is preferred to another if it eventually yields a higher cumulative payoff. In particular, (-1, 2, 0, 0,.) is preferred to(-1, 1, 0, 0 In some sense, the overtaking criterion is "the best of both worlds: it treats games symmetrically, and it does give some weight to finite histories Having said that, we shall forget about any criterion other than discounting Machines Let me just remind you of the definition from OR. We do not really need them in this lecture. Definition 2 Fix a normal-form game G. A machine for Player i E N is a tuple M (Qi, q i, fi, Ti), where (Qi is a finite set(whose elements should be thought of as labels (ii) qi is the initial state of the machine; Gi)fi: Qi- Ai is the action function: it specifies what Player i does at each state; and (v)Ti: Qi x AQi is the transition function: if action a E A is played and Player i's machine state is qi E Qi, then at the next stage Player i' s machine state will be Ti (gi, a) Note that every machine defines a strategy, but not conversely. This is clear: a strategy is a sort of "hyper-machine"which has as states the set of non-terminal histories; but of course there are infinitely many such histories, so we cannot"shoehorn"strategies into our definition of machines

Now the stream (0, 0, 0, . . . , 0, 2, 2, 2, . . .) will always be preferred to (1, 1, . . .), whereas, for low enough δ, the reverse is true. Conversely, the limit-of-means criterion does not distinguish between (−1, 1, 0, 0, . . .) and (0, 0, . . .), whereas, for all δ ∈ (0, 1), the latter stream is preferred to the former. Note that, for any pair of streams, there exists a discount factor such that the ordering of those two streams according to the limit-of-means and discounting criteria is the same (but, you need to pick a different δ for every pair of streams!) Overtaking The limit-of-means criterion completely disregards finite histories. Thus, for instance, it does not distinguish between (-1, 2, 0, 0, . . .) and (-1, 1, 0, 0, . . .). This might seem a bit extreme. Thus, consider the following variant: (u t i )t≥1 i (w t i )t≥1 ⇔ lim inf t→∞ X T t=1 (u t i − w t i ) > 0 Then, according to the overtaking criterion, a stream is preferred to another if it eventually yields a higher cumulative payoff. In particular, (-1, 2, 0, 0, . . .) is preferred to (-1, 1, 0, 0, . . .). In some sense, the overtaking criterion is “the best of both worlds:” it treats all stage games symmetrically, and it does give some weight to finite histories. Having said that, we shall forget about any criterion other than discounting! Machines Let me just remind you of the definition from OR. We do not really need them in this lecture. Definition 2 Fix a normal-form game G. A machine for Player i ∈ N is a tuple Mi = (Qi , q0 i , fi , τi), where: (i) Qi is a finite set (whose elements should be thought of as labels); (ii) q 0 i is the initial state of the machine; (iii) fi : Qi → Ai is the action function: it specifies what Player i does at each state; and (iv) τi : Qi × A → Qi is the transition function: if action a ∈ A is played and Player i’s machine state is qi ∈ Qi , then at the next stage Player i’s machine state will be τi(qi , a). Note that every machine defines a strategy, but not conversely. This is clear: a strategy is a sort of “hyper-machine” which has as states the set of non-terminal histories; but of course there are infinitely many such histories, so we cannot “shoehorn” strategies into our definition of machines. 3

Strategies implemented by machines are Markovian; informally, they are"simple"strate- Nash Folk Theorem for the Discounting Criterion Definition 3 Fix a normal-form game G. The minmax payoff for player i, vi, is defined by t=mina-∈A,maxa;∈At1(a1,a-i) Definition 4 Fix a normal-form game G. A payoff profile u E R is feasible if u aea D u(a) for some set of rational coefficients(, (Ba)aeA) with aea Ba =7. It is en- forceable ift≥t(r U2) for all i∈N It should be clear that no payoff profile w such that wi vi for some player i E N can be achieved in a Nash equilibrium of the infinite repetition of G Proposition 0.1 OR 144.1/If s is a Nash equilibrium of the infinitely repeated game t, then the resulting payoff profile is enforceable The proof is in OR. In particular, in a two-player game, if the machine M2=(Q2, f2, 92, T2) implements a strategy for Player 2, then the machine M1=(Q2, f1, 92, T2)with f1(q2)E rI(2(q2)) yields at least u to Player 1. This is OR Exercise [144.2 The main result in this lecture is known as the Nash Folk Theorem with Discounting Proposition 0.2 /OR 145.2 For any feasible, strictly enforceable payoff profile w, and for every E>0, there exists d l and a feasible, strictly enforceable payoff profile w such that Ww-w8, w is a Nash equilibrium payoff profile of r The intuition is easiest to grasp if w=u(a) for some a E lien ai: for each player, we construct a strategy which conforms to a as long as no other player has deviated, and plays an action which minmaxes the first deviator for every history following a deviation. For high enough discount factor, the benefit from a single-stage deviation is outweighted by the fact that the deviator receives his minmax payoff forever after For arbitrary feasible payoffs, we must use cycles. If the payoff aggregation criterion was limit-of-means, we would be home free, because(1)the value of a finite stream in which each ui(a)is repeated exactly Ba times is precisely 2eA Baui(a)=w; and(2)the value of ar infinite repetition of the finite stream just defined is again w. This is OR, Theorem 144.3 However, for discounting we need to do some extra bookkeeping. The value of the finite stream defined in(1)is not exactly w, but it will be close to w for high enough 8: this will be our w. The trick is then to measure time in terms of y-cycles, redefining the discount factor appropriately

Proof: Write lien Ai=fa(1),.,a(k)), where k=lien Ail. Let o be a sequence consisting of Ba() repetitions of a(1), followed by Ba(2) repetitions of a(2),., and finally by Ba(k) repetitions of a(K).Define -∑ k=1=1 i.e. the discounted value of the stream of payoffs for Player i generated by the finite sequence a. Essentially, the e-th repetition of ui(a(l)) is discounted by 8- the e-th repetition of ui(a(2))is discounted by & a()+e-l, because the ui(a(2)),'s follow Ba(1) repetitions of ui(a(1)) d Now observe that the value of the payoff stream generated by a history h =(o, o) is w(8)+8(8);thus, the value of the payoff stream along the infinite sequence z=(o, o,...) is -yw(O); recall that we measure values in terms of the discounted value of a constant stream(1, 1,...), so we need to multiply the discounted sum by(1-8). Denote the latter quantity by w(8) Repeat this construction for every i e n to obtain a profile w(8). Clearly, for every e>0, there exists 8 such that 8>8 implies that Jw-w(O)l<e We now construct a strategy Si for each player i E N. To deter deviations, fix, for each player N, a minmaxing action profile P-iE arg min max ui(aj, a-1) For every if j, denote by p-i, i the i-th component of p-i Consider a player i E N. For each finite history h which consists of finitely many repetitions of o, followed by an initial subsequence of o, Player i plays the action she is supposed to play next according to o. These actions determine the equilibrium path Now consider a finite history h which is off the equilibrium path. Then there exists a subhistory h' of h such that h' is on the equilibrium path (i.e. players conform to o) but there exists at least one player j,i such that s, (h) is not according to o(if there are multiple deviators, choose the one with the lowest index). Then, at h, Player i plays p-j Finally, at any history h at which Player i must be punished (i.e. there exists a subhistory h' of h on the equilibrium path such that i is the lowest-indexed deviator), choose si(h) i(p-i 3 We could significantly simplify the exponent of 5, but i choose to be explicit so you see what is really 8 oBserve that, as soon as a first deviation occurs, the lowest-indexed deviator is punished at each continu- ation history; in particular, if some player later fails to punish the deviator, nothing happens: the prescribed strategy is still to minmax the original deviator

Proof: Write Q i∈N Ai = {a(1), . . . , a(K)}, where K = | Q i∈N Ai |. Let σ be a sequence consisting of βa(1) repetitions of a(1), followed by βa(2) repetitions of a(2), . . ., and finally by βa(K) repetitions of a(K). Define w 00 i (δ) = X K k=1 β Xa(k) `=1 δ Pk−1 k0=1 βa(k0)+`−1 ui(a(k)) i.e. the discounted value of the stream of payoffs for Player i generated by the finite sequence σ. 2 Essentially, the `-th repetition of ui(a(1)) is discounted by δ `−1 ; the `-th repetition of ui(a(2)) is discounted by δ βa(1)+`−1 , because the ui(a(2))’s follow βa(1) repetitions of ui(a(1)); and so on. Now observe that the value of the payoff stream generated by a history h = (σ, σ) is w 00 i (δ)+δ γw 00 i (δ); thus, the value of the payoff stream along the infinite sequence z = (σ, σ, . . .) is 1−δ 1−δ γw 00 i (δ); recall that we measure values in terms of the discounted value of a constant stream (1,1,. . .), so we need to multiply the discounted sum by (1 − δ). Denote the latter quantity by w 0 i (δ). Repeat this construction for every i ∈ N to obtain a profile w 0 (δ). Clearly, for every > 0, there exists δ 1 such that δ > δ1 implies that |w − w 0 (δ)| < . We now construct a strategy si for each player i ∈ N. To deter deviations, fix, for each player j ∈ N, a minmaxing action profile p−j ∈ arg min a−j∈A−j max aj∈Aj uj (aj , a−j ) For every i 6= j, denote by p−j,i the i-th component of p−j . Consider a player i ∈ N. For each finite history h which consists of finitely many repetitions of σ, followed by an initial subsequence of σ, Player i plays the action she is supposed to play next according to σ. These actions determine the equilibrium path. Now consider a finite history h which is off the equilibrium path. Then there exists a subhistory h 0 of h such that h 0 is on the equilibrium path (i.e. players conform to σ) but there exists at least one player j 6= i such that sj (h 0 ) is not according to σ (if there are multiple deviators, choose the one with the lowest index). Then, at h, Player i plays p−j,i. Finally, at any history h at which Player i must be punished (i.e. there exists a subhistory h 0 of h on the equilibrium path such that i is the lowest-indexed deviator), choose si(h) ∈ ri(p−i).3 2We could significantly simplify the exponent of δ, but I choose to be explicit so you see what is really going on. 3Observe that, as soon as a first deviation occurs, the lowest-indexed deviator is punished at each continuation history; in particular, if some player later fails to punish the deviator, nothing happens: the prescribed strategy is still to minmax the original deviator. 5