Lecture 5. Greedy algorithm Huffman Codes Huffman codes用于数据压缩一般可以达到 · Huffman code 压缩20%到90%的效果。 Matroid(拟阵) 什么是编码:编码实质上是用给定的一组 符号集的字符串到另外一组符号集的映 射。 例如用{0,1}字符串到所有英文字符{abc…} 的映射的8位码 清华大学們学院宋恒 清华大学软件学院宋 Prefix codes ·我们只考虑任一个编码字不是另一个编码 Fixed-length codeword 000 001 010 011 100 101 的前缀的编码。 ariable-length codeword 0 101 ·这样的编码叫 Prefix Codes Figure 163 A character-coding problem. A data 下页给出几种编码例子 file of 100,000 characters contains only the characters a-f, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300.000 bits. Using the variable-length code shown, the file can be 清华大学学院未恒 编码表示方式 Prefix codes are desirabl because they simplify decoding 利用树来表示,下图表示等长编码 示例:在上图非定长编码下 0001110101唯一代表字符串 aaaeab 清华大学轼字院宋恒 消华大学软件学院宋
1 清华大学软件学院宋斌恒 1 Lecture 5. Greedy Algorithm •Huffman code •Matroid(拟阵) 清华大学软件学院宋斌恒 2 Huffman Codes •Huffman codes用于数据压缩一般可以达到 压缩20%到90%的效果。 •什么是编码:编码实质上是用给定的一组 符号集的字符串到另外一组符号集的映 射。 •例如用{0,1}字符串到所有英文字符{a,b,c,… } 的映射的8位码 清华大学软件学院宋斌恒 3 Prefix codes •我们只考虑任一个编码字不是另一个编码 字的前缀的编码。 •这样的编码叫Prefix Codes。 •下页给出几种编码例子: a b c d e f Frequency (in thousands) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Variable-length codeword 0 101 100 111 1101 1100 Figure 16.3 A character-coding problem. A data file of 100,000 characters contains only the characters a-f, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300,000 bits. Using the variable-length code shown, the file can be encoded in 224,000 bits. 清华大学软件学院宋斌恒 5 Prefix codes are desirable because they simplify decoding. •示例:在上图非定长编码下,串 0001110101唯一代表字符串aaaeab 清华大学软件学院宋斌恒 6 编码表示方式 •利用树来表示,下图表示等长编码: 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14
Given a tree T corresponding to a prefix code, le matter to compute the number of bits An optimal code for a file is quired to encode a file. Let c denote the character always represented by a full in alphabet C, f()denote the frequency of c in the file, drc)the depth of c in the tree. The binary tree number of bits required to encode a file is thus wwhich we define as the cost of the tree T. 清华大学們学院宋恒 清华大学软件学院宋 Constructing a Huffman code Huffman invented a greedy algorithm that 9 constructs an optimal prefix code called a Huffman code c12D3 下面我们给出 Huffman code算法的伪代码 Figure 6. 4. Archs ear res laden to the ceding sehe ans its 清华大学软件学院宋 1. Huffman (C) 2.Q∈c∥ Q is a priority queue=newQ(C) 3. For ie 1 to n-1 (a)[15[e 3. Right(zjyeQ Extract-ing 4. Return Q Extract- Mino 清华大学轼字院宋恒 消华大学软件学院宋
2 清华大学软件学院宋斌恒 7 An optimal code for a file is always represented by a full binary tree. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 清华大学软件学院宋斌恒 8 Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. Let c denote the character in alphabet C, f(c) denote the frequency of c in the file, dT (c) the depth of c in the tree. The number of bits required to encode a file is thus cost . ( ) ( ) ( ) which we define as the of the tree T B T f c d c c C å T Î = Figure 16.4 Trees corresponding to the coding schemes in Figure 16.3. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree. (a)The tree corresponding to the fixed-length code a=000,…,f=101.(b)The tree corresponding to the optimal prefix code a=0,b=101,…,f=1100. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 10 Constructing a Huffman code •Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code. •下面我们给出Huffman code算法的伪代码 清华大学软件学院宋斌恒 11 1. Huffman(C) 1. nfl|C| 2. QflC //Q is a priority queue =new Q(C) 3. For ifl1 to n-1 1. Do allocate a new node z 2. Left[z]flxflQ.Extract-Min() 3. Right[z]flyflQ.Extract-Min() 4. f[z]flf[x]+f[y] 5. Q.insert(z) 4. Return Q.Extract-Min() 清华大学软件学院宋斌恒 12 f:5 e:9 c:12 b:13 d:16 a:45
t)[e2][13 a45 清华大学們学院宋恒 清华大学软件学院宋 c1a]Q L] 清华大学学院未恒 清华大学软件学院宋 最小频率者最优编码可等长 Let c be an alphabet in which each character c C has frequency fIc]. Let x and y be tw characters in C having the lowest frequencie Then there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only in the last bit 清华大学轼字院宋恒 消华大学软件学院宋
3 清华大学软件学院宋斌恒 13 14 c:12 b:13 f:5 e:9 d:16 a:45 清华大学软件学院宋斌恒 14 14 d:16 a:45 f:5 e:9 25 c:12 b:13 清华大学软件学院宋斌恒 15 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 16 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 17 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学软件学院宋斌恒 18 最小频率者最优编码可等长 •Let C be an alphabet in which each character c ∈ C has frequency f [c].Let x and y be two characters in C having the lowest frequencies. • Then there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only in the last bit
证明 ·设丁是任意一个最优编码对应的树,a,b是 深度最深的一对兄弟【为什么存在?】 然后交换x、a;y、b;【见下图】 ·T'是最优编码对应的树 Q“ 清华大学們学院宋恒 清华大学软件学院宋 上图不同树的成本之差 Huffman的贪心选择律 B(T)-B(T)=∑fc4(c)-∑f(c)dr(c) ·设C是给定的字符集,f是C中元素c的频 =fx}(x)+flka)-f[x41(x)-fl}1(a) 率。设x,y是C中频率最小的两个元素。令 C=C-{xy}+{z},其中z是一个新的元素 fd,(x)+falr(a-x], (a)-fad,(x) 即z不属于C。我们在C上更新f,使得 Ga-fd,(a)-d,(x) f2]=fx]y]其它值不变。则如果T是C上 的任意一个最优前缀编码对应的树,则把z > 用一棵由x和y为叶子点的树代替而生成的 新树T是C上的一个最优编码对应的树 清华大学学院未恒 清华大学软件学院宋 证明提示 Huffman编码小结 1.树T和T编码成本关系B(TB(T)=f×]+y] ·前缀码 2.反证法,假设T不是最优编码树→ Exists 最优码的特征 T such that BT)<B), which has x and 最小频率的优化码可最长化 y are the deepest siblings, Construct T" to represents a prefix code for ·贪婪选择律 C+T"is a better coding than t,a 贪婪算法 contradiction 清华大学轼字院宋恒 消华大学软件学院宋
4 清华大学软件学院宋斌恒 19 证明 •设T是任意一个最优编码对应的树,a,b是 深度最深的一对兄弟【为什么存在?】, 然后交换x、a;y、b;【见下图】 •T’’是最优编码对应的树 清华大学软件学院宋斌恒 20 y a b x y x b a a b x y 清华大学软件学院宋斌恒 21 上图不同树的成本之差 å å Î Î - = - c C T c C T B(T) B(T') f (c)d (c) f (c)d (c) ' f[x]d (x) f [a]d (a ) f [x]d (x) f[a]d (a ) = T + T - T ' - T ' ( f[a] f[x])(d (a ) d (x)) = - T - T f[x]d (x) f [a ]d (a ) f[x]d (a) f [a]d (x) = T + T - T - T ³ 0, 清华大学软件学院宋斌恒 22 Huffman的贪心选择律 •设C是给定的字符集,f[c]是C中元素c的频 率。设x,y是C中频率最小的两个元素。令 C’=C-{x,y}+{z},其中z是一个新的元素, 即z不属于C。我们在C’上更新f,使得 f[z]=f[x]+f[y],其它值不变。则如果T’是C’上 的任意一个最优前缀编码对应的树,则把z 用一棵由x和y为叶子点的树代替而生成的 新树T是C上的一个最优编码对应的树。 清华大学软件学院宋斌恒 23 证明提示 1. 树T和T’编码成本关系B(T)-B(T’)=f[x]+f[y] 2. 反证法,假设T不是最优编码树Ë Exists T’’such that B(T’’)<B(T), which has x and y are the deepest siblings, Ë Construct T’’’to represents a prefix code for C’ËT’’’is a better coding than T’, a contradiction 清华大学软件学院宋斌恒 24 Huffman编码小结 •前缀码 •最优码的特征 •最小频率的优化码可最长化 •贪婪选择律 •贪婪算法
Matroid(拟阵) 线性代数中的线性无关概念回顾 Theoretical Foundation for Greedy 组向量:a=10,0},b=1,1,0},c=0,20} orithms d={1,1,1},e={0.0.0}.f={0,1,0} application ·所有线性无关向量组构成的集合 l={{a}.{b},@每o}.t{ab}.{ac},{ad a.吾.鲁.…… 的特征:如果{ab}属于l则{a,{b}都属于 清华大学們学院宋恒 清华大学软件学院宋 Matroids 拟阵定义(续) Definition: A matroid is an ordered pair 1. I is a nonempty family of subsets of s, called the independent subsets of s, such that if B∈ I and AcB, then a∈l.Wesa atisfying the following 3 conditions: that I is hereditary if it satisfies this property. Note that empty set is necessarily a member 1. S is a finite nonempty set 2. If A in l. B in s, then there is some element A such that AU x∈l We say that M es the exchange 清华大学学院未恒 清华大学软件学院宋 Concepts detail Example 1. Linear algebra 1. I is a set whose elements are subset of Let S=(a, a2, a3,...,a be a set of n and I =all subsets of s). 2. Every element in I is a set whose elements are independent 1. Proof M=S, I) is a Matroid! 1. The element inI is independent subset of s 3. A subset of independent elements must 1. LetA E I and b is a subset of A, It is trivial that B is in 4. If one independent set is small than another, then it can be expands to the same size with elements in the other set 清华大学轼字院宋恒 消华大学软件学院宋
5 清华大学软件学院宋斌恒 25 Matroid(拟阵) •Theoretical Foundation for Greedy Algorithms. •Its application 清华大学软件学院宋斌恒 26 线性代数中的线性无关概念回顾 •一组向量: a={1,0,0}, b={1,1,0}, c={0,2,0}, d={1,1,1}, e={0,0,0}, f={0,1,0} •所有线性无关向量组 构成的集合 • l ={f,{a}, {b}, {c}, {d}, {f}, {a,b}, {a,c}, {a,d}, {a,f}, {}, {},… … } • l 的特征:如果{a,b}属于l 则{a}, {b}都属于l 清华大学软件学院宋斌恒 27 Matroids • Definition: A matroid is an ordered pair M = (S, l ) satisfying the following 3 conditions: 1. S is a finite nonempty set 清华大学软件学院宋斌恒 28 拟阵定义(续) 1. l is a nonempty family of subsets of S, called the independent subsets of S, such that if B Î l and A Í B, then A Î l . We say that l is hereditary if it satisfies this property. Note that empty set is necessarily a member of l . 2. If A in l , B in l , and |A|<|B|, then there is some element x Î B-A such that A U {x} Î l . We say that M satisfies the exchange property 清华大学软件学院宋斌恒 29 Concepts detail 1. l is a set whose elements are subset of S 2. Every element in l is a set whose elements are independent. 3. A subset of independent elements must be independent too. 4. If one independent set is small than another, then it can be expands to the same size with elements in the other set 清华大学软件学院宋斌恒 30 Example 1. Linear algebra • Let S={a1 , a2 , a3 , … , an } be a set of n independent vectors, and l ={all subsets of S}. 1. Proof M={S, l } is a Matroid! 1. The element in l is independent subset of S. 1. Let A Î l and B is a subset of A, It is trivial that B is in l . 2. If |A|<|B|, then there exists an element b in B-A such that A U {b} is an element of l. 1. Its trivial too!
Example 2. linear algebra Example 3. Graph Let s=a a)be a set of n vectors, and I =A is subsets of S: As s=all edges in an undirected graph G) elements are linearly independent y ·={Acs: A defines a sub-graph G=ⅣV,A Proof M=(S, I ]is a Matroid too such that Ga is acyclic, that is Ga is a forest] This proof is somewhat harder than the one for the last example.如何利用线性代 ·定理:M={s,1} is a matroid! 数的知识来证明? 清华大学們学院宋恒 清华大学软件学院宋 证明 Pigeon hole principle: There exists one tree T in G such that it intersects more than one 1. Hereditary: is easy to prove,证明什 whose vertices u and v belongs two trees in 2. Exchange property: needs some tricks! GA【保证下一步加入(uv)后无环】 1.设A和B是1的元素,且|A<B, then Ga and AUI(u, v)) is another subset of s, and it defines a subgraph such that it is acyclic GB are two forests of Graph G 2. Ga has less tree than in G 3. How many trees are there in GA? 清华大学学院未恒 清华大学软件学院宋 Properties of Matriod Illustration of the same size of maximum independent subset Theorem 16.6: All maximal independent subsets in a matroid have the same size The prove is trivial a graphic matroid for a connected, undirected graph G. Every independent subset of m. must be a free tree with exactly VF1 edges that connect all the vertices of g. such a tree is called a spanning tree How to do in linear algebra? 清华大学轼字院宋恒 消华大学软件学院宋
6 清华大学软件学院宋斌恒 31 Example 2. Linear algebra •Let S={a1 , a2 , a3 , … , an } be a set of n vectors, and l ={A is subsets of S: A’s elements are linearly independent }. Proof M={S, l } is a Matroid too! •This proof is somewhat harder than the one for the last example. 如何利用线性代 数的知识来证明? 清华大学软件学院宋斌恒 32 Example 3. Graph •S = {all edges in an undirected graph G} • l = {AÍS: A defines a sub-graph GA={V, A} such that GA is acyclic, that is GA is a forest} •M={S, l } •定理:M={S, l } is a Matroid! 清华大学软件学院宋斌恒 33 证明 1. Hereditary: is easy to prove,证明什 么? 2. Exchange property: needs some tricks! 1. 设A和B是l 的元素,且 |A|<|B|, then GA and GB are two forests of Graph G, 2. GB has less tree than in GA . 3. ? How many trees are there in GA? 清华大学软件学院宋斌恒 34 – Pigeon hole principle: There exists one tree T in GB such that it intersects more than one tree in GA .【 GB中树少】 – In tree T, there exists at least an edges e whose vertices u and v belongs two trees in GA【保证下一步加入(u,v)后无环 】 – A U {(u,v)} is another subset of S, and it defines a subgraph such that it is acyclic. 清华大学软件学院宋斌恒 35 Properties of Matriod •Theorem 16.6: All maximal independent subsets in a matroid have the same size. •The prove is trivial. 清华大学软件学院宋斌恒 36 Illustration of the same size of maximum independent subset Properties •A graphic matroid for a connected, undirected graph G. Every independent subset of MG must be a free tree with exactly |V|-1 edges that connect all the vertices of G. Such a tree is called a spanning tree of G. •How to do in linear algebra?
加权拟阵 加权拟阵的优化问题 Weight function: W: S)R* and Problem of maximum weighted W(A)=ZA W()for any A contains in S independent subset in a weighted matre ?What does this function means in a graph? is a typical problem of matroid and it can e solved by a greedy algorithm.【要求权 函数必须是正的】 清华大学們学院宋恒 清华大学软件学院宋 变种问题 问题实例:图的最小支撑树 There are also another type of problem Spanning tree:无向连通图的最大无圈子 called the problem of "minimum weight for 图,叫支撑树 the maximum size independent subset " in minimum spanning tree problem:求路径 a weighted matroid, it can be changed to 最短的支撑树。 the problem of the first type【没有正权限 制】.【如何做】 以上问题是一个变种问题 清华大学学院未恒 清华大学软件学院宋 The greedy algorithm Analysis of the complexit Greedy ( M, w) Ordering the elements 1.A∈ empty set 2. Sort S[M] into monotonically decreasing To decide the whether A U x is in I[M] order by weight w(x) O(n Ig n n f(n)) 3. For each x in S[M 1. If A U l[M]then AEAU(x] 4. Return A 清华大学轼字院宋恒 消华大学软件学院宋
7 清华大学软件学院宋斌恒 37 加权拟阵 •Weight function: w: S‡R+ and •w(A) = åxÎA w(x) for any A contains in S •?What does this function means in a graph? 清华大学软件学院宋斌恒 38 加权拟阵的优化问题 •Problem of maximum weighted independent subset in a weighted matroid is a typical problem of matroid and it can be solved by a greedy algorithm.【要求权 函数必须是正的】 清华大学软件学院宋斌恒 39 变种问题 •There are also another type of problem called the problem of “minimum weight for the maximum size independent subset”in a weighted matroid, it can be changed to the problem of the first type【没有正权限 制】.【如何做】 清华大学软件学院宋斌恒 40 问题实例:图的最小支撑树 •Spanning tree:无向连通图的最大无圈子 图,叫支撑树。 •minimum spanning tree problem:求路径 最短的支撑树。 •以上问题是一个变种问题 清华大学软件学院宋斌恒 41 The greedy algorithm • Greedy(M,w) 1. Aflempty set 2. Sort S[M] into monotonically decreasing order by weight w(x) 3. For each x in S[M] 1. If A U {x} Î l [M] then AflA U {x} 4. Return A 清华大学软件学院宋斌恒 42 Analysis of the complexity •Ordering the elements. •To decide the whether A U {x} is in l [M] •O(n lg n + n f(n))
贪心选择律 贪心选择律之证明 Let M=(S, I] be a weighted matroid, and S If no such x exists, then the only independent is sorted into monotonically decreasing subset of s is the empty set. Hence we ar order by weight Let b be an optimal solution, then it is not empty. Let x be the first independent element in S assume x is not in B(if it is in B then the if such x exists If x exists then there exists an optimal Claim: There is no elements in B whose weigh s bigger than w(x). To see this, any element y in solution a such that x is an element of a B can construct an independent subset y), with 【最大的独立元素一定在某一个最优解 清华大学們学院宋恒 消华大学软件学院宋作 切粘技术 扩展引理 Construct an A such that A is optimal and contains X: using cut and paste, or the Let M=(S, 1] be any matroid, if x is an exchange property of M, we can grow element of s that is an extension of A=x by adding elements from B till to A some independent subset A of S, then x has the same size with B is also an extension of empty set. B and a only have one different element W(Bw(A)=w(y-W(x)≤0 The proof is simple. Really? Claim: A is optimal too 清华大学学院未恒 清华大学软件学院宋 淘汰无情 最优子结构定理 If x is not an extension of empty set, then it is not an extension of any independent Let x be the first element of s chosen by greedy for the weighted matroid M=(s, 1), the subset a of s remaining problem of finding a maximum-weight ndependent subset containing x reduces to finding a maximum-weight independent subset of the weighted M=(S,1), where ·S= dy in:{xy}nl} =[B subset of S-x]: B U(x) in 1) and The weight function to W is the weight function of M restrict to w 清华大学轼字院宋恒 消华大学软件学院宋
8 清华大学软件学院宋斌恒 43 贪心选择律 •Let M={S, l } be a weighted matroid,and S is sorted into monotonically decreasing order by weight; • Let x be the first independent element in S if such x exists; • If x exists then there exists an optimal solution A such that x is an element of A • 【最大的独立元素一定在某一个最优解 中】 清华大学软件学院宋斌恒 44 贪心选择律之证明 • If no such x exists, then the only independent subset of S is the empty set. Hence we are done. Else, • Let B be an optimal solution,then it is not empty, assume x is not in B (if it is in B then the property is true), • Claim: There is no elements in B whose weight is bigger than w(x). To see this, any element y in B can construct an independent subset {y}, with assumption we get the claim 清华大学软件学院宋斌恒 45 切粘技术 •Construct an A such that A is optimal and contains x: using cut and paste, or the exchange property of M, we can grow A={x} by adding elements from B till to A has the same size with B •B and A only have one different element: •w(B)-w(A)=w(y)-w(x)£0 •Claim: A is optimal too! 清华大学软件学院宋斌恒 46 扩展引理 Let M={S, l } be any matroid, if x is an element of S that is an extension of some independent subset A of S, then x is also an extension of empty set. The proof is simple. Really? 清华大学软件学院宋斌恒 47 淘汰无情! •If x is not an extension of empty set, then it is not an extension of any independent subset A of S. 清华大学软件学院宋斌恒 48 最优子结构定理 • Let x be the first element of S chosen by greedy for the weighted matroid M= (S, l ), the remaining problem of finding a maximum-weight independent subset containing x reduces to finding a maximum-weight independent subset of the weighted M’= (S’, l ’), where • S’={y in S: {x,y} in l } • l ’={B subset of S-{x}: B U {x} in l } and • The weight function to W’is the weight function of M restrict to W’
加权拟阵贪心算法的正确性 A task-scheduling problem ·定理: The greedy algorithm on matroid is Unit-time task: a task need unit time to correct 证明: is just the combination of the last chedule of task set S: is a permutation of lemmas and corollaries S specifying the order in which these tasks to be perfo Scheduling unit-time task with deadlines and penalties for a single processor 清华大学們学院宋恒 清华大学软件学院宋 Input of the problem Convert the problem to a matroid A set of task S=(a, a2,..., an] Independent tasks: if there exists a A set of integer deadlines d, d2,..., d schedule for these tasks such that no A set of nonnegative weights of penalties tasks are late such that if task a is I =all subsets of independent tasks finished before its deadlines there will · S is the set of all tasks incur a penalty wi. ·M={S,l} is a matroid 清华大学学院未恒 清华大学软件学院宋 拟阵小结 拟阵定义 拟阵实例 拟阵的简单性质 加权拟阵 加权拟阵的优化问题及其变种 ·上述问题的最优子结构 ·上述问题的贪婪选择律 贪婪算法,及其在最小支撑树的应用 清华大学轼字院宋恒
9 清华大学软件学院宋斌恒 49 加权拟阵贪心算法的正确性 •定理:The greedy algorithm on matroid is correct •证明: is just the combination of the last lemmas and corollaries. 清华大学软件学院宋斌恒 50 A task-scheduling problem •Unit-time task: A task need unit time to complete •Schedule of task set S: is a permutation of S specifying the order in which these tasks are to be performed •Scheduling unit-time task with deadlines and penalties for a single processor: 清华大学软件学院宋斌恒 51 Input of the problem •A set of task S={a1 ,a2 ,… ,an } •A set of integer deadlines d1 ,d2 ,… ,dn . •A set of nonnegative weights of penalties: w1 ,w2 ,… ,wn , such that if task ai is not finished before its deadlines there will incur a penalty wi . 清华大学软件学院宋斌恒 52 Convert the problem to a matroid •Independent tasks:if there exists a schedule for these tasks such that no tasks are late. • l ={all subsets of independent tasks} • S is the set of all tasks • M={S, l } is a matroid 清华大学软件学院宋斌恒 53 拟阵小结 •拟阵定义 •拟阵实例 •拟阵的简单性质 •加权拟阵 •加权拟阵的优化问题及其变种 •上述问题的最优子结构 •上述问题的贪婪选择律 •贪婪算法,及其在最小支撑树的应用