《博弈论》（英文版） lect 13 Repeated Games.pdf_大学文库

ote:OR actually allows for arbitrary state spaces. For most proofs, finite state spaces are enough. There is one exception, which I will note below Definition 2 Fix a normal-form game G. The minmax payoff for r player i, vi, is defined by t=mina-∈A,maxa∈A4t(a1,a-) We are not going to worry about implementing mixtures of action profiles in this lecture so I just remind you of the definition of enforceability Definition 3 Fix a normal-form game G. A payoff profile u E R is enforceable (resp strictly enforceable) if u;> v;(resp. ui>vi)for alliE N Perfect folk theorems The strategies in the proof of the Nash folk theorem( say, with discounting) call for indefinite punishment following a deviation from the equilibrium. This is fine in games such as the Prisoner's Dilemma, in which the minmax actions are actually an equilibrium(indeed, in PD, they are the only equilibrium! ) That is, the threat of indefinite punishment is indeed credible A D A2,31,6 D0.10,1 Figure 1: Indefinite Punishment is not credible However, consider the game in Figure 1. The Row player can hold the Column player down to his minimum payoff by choosing D, but D is strictly dominated for her. Hence, while (2, 3) can be enforced as a Nash equilibrium payoff profile in the infinitely repeated version of the game by assuming that players switch to(D, D) following a deviation, this would not work if we required subgame perfection, regardless of the payoff aggregation criterion A warm-up exercise: Perfect Folk Theorems for Limit-of-Means Thus, we must be smarter than that. a key intuition is that, after all, a deviator need not be punished forever, but only long enough to wipe out any gains from his deviation Formally, fix a game G=(N, (Ai, lilieN )and let M= maxienaea ui (a). Fix an outcome aof G; clearly, a one-time deviation by Player i cannot yield more than M-ui(a). Thus, if Player i deviates, we need only punish him so as to wipe out this payoff differential; clearly, how long this will be depends on the payoff aggregation criterion Intuitively, limit-of-means should make things very simple, because, following a deviation punishers face only a finite number of periods in which they must forego their equilibrium

Note: OR actually allows for arbitrary state spaces. For most proofs, finite state spaces are enough. There is one exception, which I will note below. Definition 2 Fix a normal-form game G. The minmax payoff for player i, vi , is defined by vi = mina−i∈A−i maxai∈Ai ui(ai , a−i). We are not going to worry about implementing mixtures of action profiles in this lecture, so I just remind you of the definition of enforceability: Definition 3 Fix a normal-form game G. A payoff profile u ∈ RN is enforceable (resp. strictly enforceable) if ui ≥ vi (resp. ui > vi) for all i ∈ N Perfect Folk Theorems The strategies in the proof of the Nash folk theorem (say, with discounting) call for indefinite punishment following a deviation from the equilibrium. This is fine in games such as the Prisoner’s Dilemma, in which the minmax actions are actually an equilibrium (indeed, in PD, they are the only equilibrium!). That is, the threat of indefinite punishment is indeed credible. A D A 2,3 1,6 D 0,1 0,1 Figure 1: Indefinite Punishment is not credible However, consider the game in Figure 1. The Row player can hold the Column player down to his minimum payoff by choosing D, but D is strictly dominated for her. Hence, while (2,3) can be enforced as a Nash equilibrium payoff profile in the infinitely repeated version of the game by assuming that players switch to (D,D) following a deviation, this would not work if we required subgame perfection, regardless of the payoff aggregation criterion. A warm-up exercise: Perfect Folk Theorems for Limit-of-Means Thus, we must be smarter than that. A key intuition is that, after all, a deviator need not be punished forever, but only long enough to wipe out any gains from his deviation. Formally, fix a game G = (N,(Ai , ui)i∈N ) and let M = maxi∈N,a∈A ui(a). Fix an outcome a ∗ of G; clearly, a one-time deviation by Player i cannot yield more than M −ui(a ∗ ). Thus, if Player i deviates, we need only punish him so as to wipe out this payoff differential; clearly, how long this will be depends on the payoff aggregation criterion. Intuitively, limit-of-means should make things very simple, because, following a deviation, punishers face only a finite number of periods in which they must forego their equilibrium 2

payoff stream in order to punish Player i. Under limit-of-means aggregation, finite periods do not matter, so punishers are indifferent between punishing and not punishing There is only one subtlety: of course, the same argument applies to Player i: no one-time deviation can be profitable for her! So, what exactly are we trying to deter? te The answer is that, although no finite deviation from the infinite repetition of a* is prof- itable for Player i, the following strategy could be. Suppose that, in the putative equilibrium we wish to construct, a deviation is followed by L rounds of punishment(you can think of L'=I if you wish). Thus, if Player i deviates once, she gets an extra payoff of (at most) M-ui(a), but then loses ui (a*)-vi utils in each of the subsequent L perioc Now suppose L is small, so that M-ui(a*>Lfu(a)-vi. For example, in the game of Figure 1, suppose that i is the Column player, that a*=(A, A), and that L= 1. Then a deviation yields 3 utils, whereas one round of punishment costs 2 utils. Then, Player i can dopt the following strategy: deviate, then play a best-response to p-i(in Figure 1, play D for L periods; then, as soon as play is supposed to return to a*, deviate and best-respond to the minmax profile, and so on. This is a profitable deviation. Observe that it is also a neat example of a game in which the one-deviation property holds Thus, we must choose L large enough so ViE N, M-u(a)< Lui(a)-vil In figure 1. it is enough to choose l To complete the argument, we must specify what happens if more than one player de- viates, or if somebody deviates from the punishment stage. As in the proof of the Nash Folk Theorem, multiple deviations lead to the lowest-index player being punished(they will not occur in equilibrium anyway, so players cannot hope to get away with deviating because somebody else will deviate, too Finally, if one or more punishers fail to punish, we simply disregard these further(un- profitable) deviations: again, in equilibrium they are not going to occur, so players cannot count on them to improve their predicament after a first deviation This Proposition 0.1 OR, Proposition 146.2 Fix a game G. Any feasible, strictly enforce- able payoff profile of G is a subgame-perfect equilibrium payoff profile of the limit-of-means infinitely repeated version of G Machines You will undoubtedly notice that describing these strategies verbally is awkward; doing so formally(as we have tried to do in the proof of the Nash Folk theorem) is even worse Machines can help. For instance, here is a set of machines that players can use to implement the strategies in the proof of Proposition 0.1: Player i uses machine M (Q, q, fi, T)(where Q, q, T are common to all players) defined as follows

payoff stream in order to punish Player i. Under limit-of-means aggregation, finite periods do not matter, so punishers are indifferent between punishing and not punishing. There is only one subtlety: of course, the same argument applies to Player i: no one-time deviation can be profitable for her ! So, what exactly are we trying to deter? The answer is that, although no finite deviation from the infinite repetition of a ∗ is profitable for Player i, the following strategy could be. Suppose that, in the putative equilibrium we wish to construct, a deviation is followed by L rounds of punishment (you can think of L 0 = 1 if you wish). Thus, if Player i deviates once, she gets an extra payoff of (at most) M − ui(a ∗ ), but then loses ui(a ∗ ) − vi utils in each of the subsequent L periods. Now suppose L is small, so that M − ui(a ∗ ) > L[ui(a ∗ ) − vi ]. For example, in the game of Figure 1, suppose that i is the Column player, that a ∗ = (A, A), and that L = 1. Then a deviation yields 3 utils, whereas one round of punishment costs 2 utils. Then, Player i can adopt the following strategy: deviate, then play a best-response to p−i (in Figure 1, play D) for L periods; then, as soon as play is supposed to return to a ∗ , deviate and best-respond to the minmax profile, and so on. This is a profitable deviation. [Observe that it is also a neat example of a game in which the one-deviation property holds!] Thus, we must choose L large enough so ∀i ∈ N, M − ui(a ∗ ) < L[ui(a ∗ ) − vi ] In Figure 1, it is enough to choose L 0 = 2. To complete the argument, we must specify what happens if more than one player deviates, or if somebody deviates from the punishment stage. As in the proof of the Nash Folk Theorem, multiple deviations lead to the lowest-index player being punished (they will not occur in equilibrium anyway, so players cannot hope to get away with deviating because somebody else will deviate, too). Finally, if one or more punishers fail to punish, we simply disregard these further (unprofitable) deviations: again, in equilibrium they are not going to occur, so players cannot count on them to improve their predicament after a first deviation. This proves: Proposition 0.1 [OR, Proposition 146.2] Fix a game G. Any feasible, strictly enforceable payoff profile of G is a subgame-perfect equilibrium payoff profile of the limit-of-means infinitely repeated version of G. Machines You will undoubtedly notice that describing these strategies verbally is awkward; doing so formally (as we have tried to do in the proof of the Nash Folk theorem) is even worse. Machines can help. For instance, here is a set of machines that players can use to implement the strategies in the proof of Proposition 0.1: Player i uses machine Mi = (Q, q0 , fi , τ ) (where Q, q0 , τ are common to all players) defined as follows. 3

First, Q=N, P(,t)). N is the normal state, in which a* is played. P(, t )is the state in which i is punished, and t more rounds of punishment are required Second, q=N. Third, fi(N)=a*, fi(P(, t))=p-ii if jti, and f, (P(, t)=ri(p-i This should be obvious Finally, T(, ) is such that we remain in N if nobody deviates, we switch from N to P(, L)if j is the lowest-index deviator, we always move from P(, t)to P(, t-1)if t+0 and we always move from P(, O) back to N Easy Discounting The strategy profile thus constructed is not a subgame-perfect equilibrium of the game in Figure 1 if payoffs are aggregated via discounting, for any discount factor. Suppose the Column player deviates: then the Row player is supposed to choose D for 2 periods, and hence receive a How payoff of 0 units. She will then receive 2 forever after. However, if she deviates to A, nothing happens: the Column player continues to choose D, so again the row player can choose A. Obviously, she prefers(1, 1, 2, 2, .. )to(0,0, 2, 2,...)! Thus, the key intuition is that we must somehow ensure that punishers are willing te punish. In principle, one could think of punishing punishers, but this may fail to work with discounting: essentially, second deviations might require longer punishment periods(because the burden of carrying out the first punishment lasts for L periods, and not just one), third deviations might require even longer punishments, and so on. This is certainly the case in the game of Figure 1 The alternative is to reward punishers. This leads to OR's Proposition 151.1 o I am only going to offer a few comments on the proof. The"conciliation"states C()serve reward punishers, in a way. However, this is subtle: we never go back to the Nash state C(O). What happens is, if i deviates and i punishes him, then after punishment we move to C(, which i prefers to C(i). Otherwise, we go to C(i) and stay there until somebody else deviates.In my opinion, this is also a punishment of sorts, but of course you are welcome to differ Also the remark about the first condition on d being sufficient has to do with the fact that, after L periods of punishment, we move to C(): since ui(a())<u(a())by assump- tion, this is actually a further punishment(which, er, actually reinforces my interpretation but never mind that). The point is that this punishment may not be enough, or may come 1 Again, let me repeat this because I first got it wrong in class(but then you guys spotted me!):whenever we are in state C(), we remain there if somebody deviates, and move to the state P(k, L)in which Player a deviation by Player j is the threat of further punishment; a() need not be a Nash equilibrium peps after k's punishment begins if k deviates from a(). Thus, what supports a() as a continuation equilibrium

First, Q = {N, P(j, t)}. N is the normal state, in which a ∗ is played. P(j, t) is the state in which j is punished, and t more rounds of punishment are required. Second, q 0 = N. Third, fi(N) = a ∗ i , fi(P(j, t)) = p−j,i if j 6= i, and fj (P(j, t)) = rj (p−j ). This should be obvious. Finally, τ (·, ·) is such that we remain in N if nobody deviates, we switch from N to P(j, L) if j is the lowest-index deviator, we always move from P(j, t) to P(j, t − 1) if t 6= 0, and we always move from P(j, 0) back to N. Easy! Discounting The strategy profile thus constructed is not a subgame-perfect equilibrium of the game in Figure 1 if payoffs are aggregated via discounting, for any discount factor. Suppose the Column player deviates: then the Row player is supposed to choose D for 2 periods, and hence receive a flow payoff of 0 units. She will then receive 2 forever after. However, if she deviates to A, nothing happens: the Column player continues to choose D, so again the Row player can choose A. Obviously, she prefers (1,1,2,2,. . .) to (0,0,2,2,. . .)! Thus, the key intuition is that we must somehow ensure that punishers are willing to punish. In principle, one could think of punishing punishers, but this may fail to work with discounting: essentially, second deviations might require longer punishment periods (because the burden of carrying out the first punishment lasts for L periods, and not just one), third deviations might require even longer punishments, and so on. This is certainly the case in the game of Figure 1. The alternative is to reward punishers. This leads to OR’s Proposition 151.1. I am only going to offer a few comments on the proof. The “conciliation” states C(j) serve to reward punishers, in a way. However, this is subtle: we never go back to the Nash state C(0). What happens is, if j deviates and i punishes him, then after punishment we move to C(j), which i prefers to C(i). Otherwise, we go to C(i) and stay there until somebody else deviates.1 In my opinion, this is also a punishment of sorts, but of course you are welcome to differ. Also: the remark about the first condition on δ being sufficient has to do with the fact that, after L periods of punishment, we move to C(j); since ui(a(j)) < ui(a(0)) by assumption, this is actually a further punishment (which, er, actually reinforces my interpretation, but never mind that). The point is that this punishment may not be enough, or may come 1Again, let me repeat this because I first got it wrong in class (but then you guys spotted me!): whenever we are in state C(j), we remain there if somebody deviates, and move to the state P(k, L) in which Player k’s punishment begins if k deviates from a(j). Thus, what supports a(j) as a continuation equilibrium after a deviation by Player j is the threat of further punishment; a(j) need not be a Nash equilibrium per se. 4