Programming Pearls, Second Edition Programming by Jon Bentley. Addison-Wesley,Inc.,2000. ISBN0-201-65788-0 239+xipp.$24.95 Second Edition This book is a collection of essays about a glamorous aspect of software:programming pearls whose origins lie beyond solid engineering,in the realm of insight and creativity.This book provides a guide for both students and experienced Jon Bentlev programmers about how to design and create programs,and how to think about programming. The book is full of small case studies,real examples,and interesting exercises for learning about how to program. This web page contains samples from the whole work for you to investigate.For teachers,the links below lead to some of the central material suitable for classroom use. Steve McConnell describes the book as''a celebration of design in the small".Browse this site to sample it yourself. NEw What's new on this web site? From The Book Table of Contents Preface Part I:Preliminaries Column 1:Cracking the Oyster Column 2:Aha!Algorithms [Sketch] Column 4:Writing Correct Programs [Sketch] Column 5:A Small Matter of Programming [Sketch] Part II:Performance Column 7:The Back of the Envelope Column 8:Algorithm Design Techniques [Sketch] Part III:The Product Column 14:Heaps [Sketch] Column 15:Strings of Pearls Epilog to the First Edition Epilog to the Second Edition Appendix 2:An Estimation Quiz Appendix 3:Cost Models for Time and Space
Programming Pearls, Second Edition by Jon Bentley. Addison-Wesley, Inc., 2000. ISBN 0-201-65788-0. 239 + xi pp. $24.95 This book is a collection of essays about a glamorous aspect of software: programming pearls whose origins lie beyond solid engineering, in the realm of insight and creativity. This book provides a guide for both students and experienced programmers about how to design and create programs, and how to think about programming. The book is full of small case studies, real examples, and interesting exercises for learning about how to program. This web page contains samples from the whole work for you to investigate. For teachers, the links below lead to some of the central material suitable for classroom use. Steve McConnell describes the book as ``a celebration of design in the small''. Browse this site to sample it yourself. What's new on this web site? From The Book Table of Contents Preface Part I: Preliminaries Column 1: Cracking the Oyster Column 2: Aha! Algorithms [Sketch] Column 4: Writing Correct Programs [Sketch] Column 5: A Small Matter of Programming [Sketch] Part II: Performance Column 7: The Back of the Envelope Column 8: Algorithm Design Techniques [Sketch] Part III: The Product Column 14: Heaps [Sketch] Column 15: Strings of Pearls Epilog to the First Edition Epilog to the Second Edition Appendix 2: An Estimation Quiz Appendix 3: Cost Models for Time and Space
Appendix 4:Rules for Code Tuning Solutions for Column 1 Column 5 Column 7 Column 15 Index About The Book Why a Second Edition? To Readers of the First Edition About the First Edition Errata Supporting Material Source Code Web Sites Relevant to the Book Animation of Sorting Algorithms Tricks of the Trade Teaching Material Other Links Addison-Wesley Computer Engineering Publishing Group Programming Pearls at Addison-Wesley Bookstores:Amazon.com,Barnes Noble,Borders.com,Fatbrain.com,Quantum Books. Copyright 1999 Lucent Technologies.All rights reserved.Thu 19 Oct 2000
Appendix 4: Rules for Code Tuning Solutions for Column 1 Column 5 Column 7 Column 15 Index About The Book Why a Second Edition? To Readers of the First Edition About the First Edition Errata Supporting Material Source Code Web Sites Relevant to the Book Animation of Sorting Algorithms Tricks of the Trade Teaching Material Other Links ● Addison-Wesley Computer & Engineering Publishing Group ● Programming Pearls at Addison-Wesley ● Bookstores: Amazon.com, Barnes & Noble, Borders.com, Fatbrain.com, Quantum Books. Copyright © 1999 Lucent Technologies. All rights reserved. Thu 19 Oct 2000
What's New on the Programming Pearls Programming Web Site November 2000 Column 15 is now on the site,complete with a new program for letter-level Markov text,and new examples of word Second Edition frequencies,long repeated strings,and letter-level and word- level Markov text. Jon Bentlev October 2000 The rules for code tuning from my 1982 book Writing Efficient Programs are now online,and so is a Powerpoint Show on Cache-Conscious Algorithms and Data Structures. August 2000 The errata just keeps on growing.If you see errors,please send them in. July 2000 Programming Pearls is often used for teaching undergraduates.This page describes how some of the topics in the book can be incorporated into college classrooms. March 2000 A theme running through the book concerns the Tricks of the Trade.This page describes that topic and contains a Powerpoint Show on the subject. Copyright1999 Lucent Technologies.All rights reserved.Mon 6 Nov 2000
What's New on the Programming Pearls Web Site November 2000 Column 15 is now on the site, complete with a new program for letter-level Markov text, and new examples of word frequencies, long repeated strings, and letter-level and wordlevel Markov text. October 2000 The rules for code tuning from my 1982 book Writing Efficient Programs are now online, and so is a Powerpoint Show on Cache-Conscious Algorithms and Data Structures. August 2000 The errata just keeps on growing. If you see errors, please send them in. July 2000 Programming Pearls is often used for teaching undergraduates. This page describes how some of the topics in the book can be incorporated into college classrooms. March 2000 A theme running through the book concerns the Tricks of the Trade. This page describes that topic and contains a Powerpoint Show on the subject. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000
Strings of Pearls (Column 15 of Programming Pearls) We are surrounded by strings.Strings of bits make integers and floating-point numbers.Strings of digits make telephone numbers,and strings of characters make words.Long strings of characters make web pages,and longer strings yet make books. Second Edition Extremely long strings represented by the letters A,C,G and T are in geneticists'databases and deep inside the cells of many readers of this book. Programs perform a dazzling variety of operations on such Jon Bentley strings.They sort them,count them,search them,and analyze them to discern patterns.This column introduces those topics by examining a few classic problems on strings The Rest of the Column These are the remaining sections in the column 15.1 Words 15.2 Phrases 15.3 Generating Text 15.4 Principles 15.5 Problems 15.6 Further Reading Related Content The teaching material contains overhead transparencies based on Sections 15.2 and 15.3;the slides are available in both Postscript and Acrobat. The code for Column 15 contains implementations of the algorithms. The Solutions to Column 15 give answers for some of the Problems. This column concentrates on programming techniques,and uses those techniques to build several programs for processing large text files.A few examples of the programs'output in the text illustrate the structure of English documents.This web site contains some additional fun examples,which may give further insight into the structure of the English language.Section 15.1 counts the words in one document;here are more examples of word frequency counts.Section 15.2 searches for large portions of duplicated text;here are more examples of long repeated strings.Section 15.3 describes randomly generated Markov text;these pages contain addtional examples of Markov text,generated at the letter level and word level
Strings of Pearls (Column 15 of Programming Pearls) We are surrounded by strings. Strings of bits make integers and floating-point numbers. Strings of digits make telephone numbers, and strings of characters make words. Long strings of characters make web pages, and longer strings yet make books. Extremely long strings represented by the letters A, C, G and T are in geneticists' databases and deep inside the cells of many readers of this book. Programs perform a dazzling variety of operations on such strings. They sort them, count them, search them, and analyze them to discern patterns. This column introduces those topics by examining a few classic problems on strings. The Rest of the Column These are the remaining sections in the column. 15.1 Words 15.2 Phrases 15.3 Generating Text 15.4 Principles 15.5 Problems 15.6 Further Reading Related Content The teaching material contains overhead transparencies based on Sections 15.2 and 15.3; the slides are available in both Postscript and Acrobat. The code for Column 15 contains implementations of the algorithms. The Solutions to Column 15 give answers for some of the Problems. This column concentrates on programming techniques, and uses those techniques to build several programs for processing large text files. A few examples of the programs' output in the text illustrate the structure of English documents. This web site contains some additional fun examples, which may give further insight into the structure of the English language. Section 15.1 counts the words in one document; here are more examples of word frequency counts. Section 15.2 searches for large portions of duplicated text; here are more examples of long repeated strings. Section 15.3 describes randomly generated Markov text; these pages contain addtional examples of Markov text, generated at the letter level and word level
The web references describe several web sites devoted to related topics. Copyright 1999 Lucent Technologies.All rights reserved.Wed 18 Oct 2000
The web references describe several web sites devoted to related topics. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000
Words (Section 15.1 of Programming Programming Pearls) Our first problem is to produce a list of the words contained in a document.(Feed such a program a few hundred books,and you have a fine start at a word list for a dictionary.)But what exactly is a word?We'll use the trivial definition of a sequence of Second Edition characters surrounded by white space,but this means that web pages will contain manywords"like","and ".Problem 1 asks how you might avoid such problems. Jon Bentley Our first C++program uses the sets and strings of the Standard Template Library,in a slight modification of the program in Solution 1.1: int main(void) set S; set::iterator j; string ti while (cin >t) S.insert(t); for (j=S.begin();j!=S.end();++j) cout<<*j<<"\n"; return 0; The while loop reads the input and inserts each word into the set S(by the STL specification,duplicates are ignored).The for loop then iterates through the set,and writes the words in sorted order.This program is elegant and fairly efficient(more on that topic soon). Our next problem is to count the number of times each word occurs in the document.Here are the 21 most common words in the King James Bible,sorted in decreasing numeric order and aligned in three columns to save space: the 62053 sha119756 they 6890 and 38546 he 9506 be 6672 of 34375 unto 8929 is 6595 to 13352 I 8699 with 5949 And12734 his 8352 not 5840 that 12428 7940 a115238 in12154 for 7139 thou 4629 Almost eight percent of the 789,616 words in the text were the word'the"(as opposed to 16 percent of the words in this sentence).By our definition of word,'and"and'And"have two separate counts
Words (Section 15.1 of Programming Pearls) Our first problem is to produce a list of the words contained in a document. (Feed such a program a few hundred books, and you have a fine start at a word list for a dictionary.) But what exactly is a word? We'll use the trivial definition of a sequence of characters surrounded by white space, but this means that web pages will contain many ``words'' like ``'', ``'' and `` ''. Problem 1 asks how you might avoid such problems. Our first C++ program uses the sets and strings of the Standard Template Library, in a slight modification of the program in Solution 1.1: int main(void) { set S; set::iterator j; string t; while (cin >> t) S.insert(t); for (j = S.begin(); j != S.end(); ++j) cout << *j << "\n"; return 0; } The while loop reads the input and inserts each word into the set S (by the STL specification, duplicates are ignored). The for loop then iterates through the set, and writes the words in sorted order. This program is elegant and fairly efficient (more on that topic soon). Our next problem is to count the number of times each word occurs in the document. Here are the 21 most common words in the King James Bible, sorted in decreasing numeric order and aligned in three columns to save space: the 62053 shall 9756 they 6890 and 38546 he 9506 be 6672 of 34375 unto 8929 is 6595 to 13352 I 8699 with 5949 And 12734 his 8352 not 5840 that 12428 a 7940 all 5238 in 12154 for 7139 thou 4629 Almost eight percent of the 789,616 words in the text were the word ``the'' (as opposed to 16 percent of the words in this sentence). By our definition of word, ``and'' and ``And'' have two separate counts
[Click for more examples of word frequency counts. These counts were produced by the following C++program,which uses the Standard Template Library map to associate an integer count with each string: int main(void) map M; map:iteratorj; string t; while (cin >t) M[t]++; for (j=M.begin();!M.end();++j) cout first second <"\n"; return 0; The while statement inserts each word t into the map Mand increments the associated counter (which is initialized to zero at initialization).The for statement iterates through the words in sorted order and prints each word (first) and its count (second). This C++code is straightforward,succinct and surprisingly fast.On my machine,it takes 7.6 seconds to process the Bible.About 2.4 seconds go to reading,4.9 seconds to the insertions,and 0.3 seconds to writing the ouput. We can reduce the processing time by building our own hash table,using nodes that contain a pointer to a word,a count of how often the word has been seen,and a pointer to the next node in the table.Here is the hash table after inserting the strings'in",the"and''in",in the unlikely event that both strings hash to 1: bin 0 count word next the in We'll implement the hash table with this C structure: typedef struct node *nodeptr; typedef struct node char *word; int count; nodeptr next; node; Even by our loose definition of''word",the Bible has only 29,131 distinct words.We'll follow the old lore of using a prime number near that for our hash table size,and the popular multiplier of 31: #define NHASH 29989 #define MULT 31
[Click for more examples of word frequency counts.] These counts were produced by the following C++ program, which uses the Standard Template Library map to associate an integer count with each string: int main(void) { map M; map::iterator j; string t; while (cin >> t) M[t]++; for (j = M.begin(); j != M.end(); ++j) cout first second << "\n"; return 0; } The while statement inserts each word t into the map M and increments the associated counter (which is initialized to zero at initialization). The for statement iterates through the words in sorted order and prints each word (first) and its count (second). This C++ code is straightforward, succinct and surprisingly fast. On my machine, it takes 7.6 seconds to process the Bible. About 2.4 seconds go to reading, 4.9 seconds to the insertions, and 0.3 seconds to writing the ouput. We can reduce the processing time by building our own hash table, using nodes that contain a pointer to a word, a count of how often the word has been seen, and a pointer to the next node in the table. Here is the hash table after inserting the strings ``in'', ``the'' and ``in'', in the unlikely event that both strings hash to 1: We'll implement the hash table with this C structure: typedef struct node *nodeptr; typedef struct node { char *word; int count; nodeptr next; } node; Even by our loose definition of ``word'', the Bible has only 29,131 distinct words. We'll follow the old lore of using a prime number near that for our hash table size, and the popular multiplier of 31: #define NHASH 29989 #define MULT 31
nodeptr bin [NHASH]; Our hash function maps a string to a positive integer less than NHASH: unsigned int hash(char *p) unsigned int h =0 for(;*p;p++) h MULT h *p return h号NHASH Using unsigned integers ensures that h remains positive. The main function initializes every bin to NULL,reads the word and increments the count of each,then iterates through the hash table to write the (unsorted)words and counts: int main(void) for i=[0,NHASH) bin[i]NULL while scanf("号s",buf)!=EoF incword(buf) for i=[0,NHASH) for (p bin[i];p !NULL;p p->next) print p->word,p->count return 0 The work is done by incword,which increments the count associated with the input word(and initializes it if it is not already there): void incword(char *s) h hash(s) for (p bin[h];p !NULL;p p->next) if strcmp(s,p->word)==0 (p->count)++ return p malloc(sizeof(hashnode)) p->count 1 p->word malloc(strlen(s)+1) strcpy(p->word,s) p->next bin[h] bin[h]p The for loop looks at every node with the same hash value.If the word is found,its count is incremented and the function returns.If the word is not found,the function makes a new node,allocates space and copies the string (experienced C programmers would use strdup for the task),and inserts the node at the front of the list. This C program takes about 2.4 seconds to read its input (the same as the C++version),but only 0.5 seconds for the insertions(down from 4.9)and only 0.06 seconds to write the output (down from 0.3).The complete run time is
nodeptr bin[NHASH]; Our hash function maps a string to a positive integer less than NHASH: unsigned int hash(char *p) unsigned int h = 0 for ( ; *p; p++) h = MULT * h + *p return h % NHASH Using unsigned integers ensures that h remains positive. The main function initializes every bin to NULL, reads the word and increments the count of each, then iterates through the hash table to write the (unsorted) words and counts: int main(void) for i = [0, NHASH) bin[i] = NULL while scanf("%s", buf) != EOF incword(buf) for i = [0, NHASH) for (p = bin[i]; p != NULL; p = p->next) print p->word, p->count return 0 The work is done by incword, which increments the count associated with the input word (and initializes it if it is not already there): void incword(char *s) h = hash(s) for (p = bin[h]; p != NULL; p = p->next) if strcmp(s, p->word) == 0 (p->count)++ return p = malloc(sizeof(hashnode)) p->count = 1 p->word = malloc(strlen(s)+1) strcpy(p->word, s) p->next = bin[h] bin[h] = p The for loop looks at every node with the same hash value. If the word is found, its count is incremented and the function returns. If the word is not found, the function makes a new node, allocates space and copies the string (experienced C programmers would use strdup for the task), and inserts the node at the front of the list. This C program takes about 2.4 seconds to read its input (the same as the C++ version), but only 0.5 seconds for the insertions (down from 4.9) and only 0.06 seconds to write the output (down from 0.3). The complete run time is
3.0 seconds(down from 7.6),and the processing time is 0.55 seconds(down from 5.2).Our custom-made hash table(in 30 lines of C)is an order of magnitude faster than the maps from the C++Standard Template Library. This little exercise illustrates the two main ways to represent sets of words.Balanced search trees operate on strings as indivisible objects;these structures are used in most implementations of the STL's sets and maps.They always keep the elements in sorted order,so they can efficiently perform operations such as finding a predecessor or reporting the elements in order.Hashing,on the other hand,peeks inside the characters to compute a hash function, and then scatters keys across a big table.It is very fast on the average,but it does not offer the worst-case performance guarantees of balanced trees,or support other operations involving order. Next:Section 15.2.Phrases. Copyright 1999 Lucent Technologies.All rights reserved.Wed 18 Oct 2000
3.0 seconds (down from 7.6), and the processing time is 0.55 seconds (down from 5.2). Our custom-made hash table (in 30 lines of C) is an order of magnitude faster than the maps from the C++ Standard Template Library. This little exercise illustrates the two main ways to represent sets of words. Balanced search trees operate on strings as indivisible objects; these structures are used in most implementations of the STL's sets and maps. They always keep the elements in sorted order, so they can efficiently perform operations such as finding a predecessor or reporting the elements in order. Hashing, on the other hand, peeks inside the characters to compute a hash function, and then scatters keys across a big table. It is very fast on the average, but it does not offer the worst-case performance guarantees of balanced trees, or support other operations involving order. Next: Section 15.2. Phrases. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000
Problems (Section 15.5 of Programming Pearls) The Solutions to Column 15 give answers for some of these problems. 1.Throughout this column we have used the simple definition Second Edition that words are separated by white space.Many real documents, such as those in HTML or RTF,contain formatting commands. How could you deal with such commands?Are there any other tasks that you might need to perform? Jon Bentley 2.On a machine with ample main memory,how could you use the C++STL sets or maps to solve the searching problem in Section 13.8?How much memory does it consume compared to Mcllroy's structure? 3.How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the hashing program of Section 15.1? 4.When a hash table is large and the hash function scatters the data well,almost every list in the table has only a few elements.If either of these conditions is violated,though,the time spent searching down the list can be substantial.When a new string is not found in the hash table in Section 15.1,it is placed at the front of the list.To simulate hashing trouble,set NHASH to 1 and experiment with this and other list strategies,such as appending to the end of the list,or moving the most recently found element to the front of the list. 5.When we looked at the output of the word frequency programs in Section 15.1,it was most interesting to print the words in decreasing frequency.How would you modify the C and C++programs to accomplish this task?How would you print only the Mmost common words,where Mis a constant such as 10 or 1000? 6.Given a new input string,how would you search a suffix array to find the longest match in the stored text?How would you build a GUI interface for this task? 7.Our program for finding duplicated strings was very fast for'typical"inputs,but it can be very slow(greater than quadratic)for some inputs.Time the program on such an input.Do such inputs ever arise in practice? 8.How would you modify the program for finding duplicated strings to find the longest string that occurs more than Mtimes? 9.Given two input texts,find the longest string that occurs in both. 10.Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on word boundaries.What effect does this have on the output produced by the program?
Problems (Section 15.5 of Programming Pearls) The Solutions to Column 15 give answers for some of these problems. 1. Throughout this column we have used the simple definition that words are separated by white space. Many real documents, such as those in HTML or RTF, contain formatting commands. How could you deal with such commands? Are there any other tasks that you might need to perform? 2. On a machine with ample main memory, how could you use the C++ STL sets or maps to solve the searching problem in Section 13.8? How much memory does it consume compared to McIlroy's structure? 3. How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the hashing program of Section 15.1? 4. When a hash table is large and the hash function scatters the data well, almost every list in the table has only a few elements. If either of these conditions is violated, though, the time spent searching down the list can be substantial. When a new string is not found in the hash table in Section 15.1, it is placed at the front of the list. To simulate hashing trouble, set NHASH to 1 and experiment with this and other list strategies, such as appending to the end of the list, or moving the most recently found element to the front of the list. 5. When we looked at the output of the word frequency programs in Section 15.1, it was most interesting to print the words in decreasing frequency. How would you modify the C and C++ programs to accomplish this task? How would you print only the M most common words, where M is a constant such as 10 or 1000? 6. Given a new input string, how would you search a suffix array to find the longest match in the stored text? How would you build a GUI interface for this task? 7. Our program for finding duplicated strings was very fast for ``typical'' inputs, but it can be very slow (greater than quadratic) for some inputs. Time the program on such an input. Do such inputs ever arise in practice? 8. How would you modify the program for finding duplicated strings to find the longest string that occurs more than M times? 9. Given two input texts, find the longest string that occurs in both. 10. Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on word boundaries. What effect does this have on the output produced by the program?