《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）Data-Intensive Text Processing（MapReduce book 20100307）.pdf_大学文库

ii Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 1.2 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.3 Why is this different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 What this book is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Functional programming roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Mappers and reducers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Partitioners and combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 The distributed file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 MapReduce algorithm design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Pairs and Stripes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 3.5 Relational Joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Inverted Indexing for Text Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

CONTENTS iii 4.3 Inverted Indexing: Revised Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Index Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 Byte-Aligned Codes 70 4.4.2 Bit-Aligned Codes 71 4.4.3 Postings Compression 73 4.5 What about retrieval?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5 Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 5.3 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.1 Maximum likelihood estimation 98 6.1.2 A latent variable marble game 100 6.1.3 MLE with latent variables 101 6.1.4 Expectation maximization 102 6.1.5 An EM example 103 6.2 Hidden Markov models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104 6.2.1 Three questions for hidden Markov models 105 6.2.2 The forward algorithm 107 6.2.3 The Viterbi algorithm 108 6.2.4 Parameter estimation for HMMs 110 6.2.5 Forward-backward training: summary 115 6.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.1 HMM training in MapReduce 117 6.4 Case study: word alignment for statistical machine translation. . . . . . . .118

CHAPTER Introduction MapReduce [31]is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers.It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop,whose development was led by Yahoo (now an Apache project).Today, a vibrant software ecosystem has sprung up around Hadoop,with significant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce.Given this focus,it makes sense to start with the most basic question: Why?There are many answers to this question,but we focus on two.First,"big data" is a fact of the world,and therefore an issue that real-world systems must grapple with. Second,across a wide range of text processing applications,more data translates into more effective algorithms,and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are defined by vast repositories of data,both public and private.Therefore,any practical application must be able to scale up to datasets of interest.For many,this means scaling up to the web,or at least a non-trivial frac- tion thereof.Any organization built around gathering,analyzing,monitoring,filtering, searching,or organizing web content must tackle large-data problems:"web-scale"pro- cessing is practically synonymous with data-intensive processing.This observation ap- plies not only to well-established internet companies,but also countless startups and niche players as well.Just think,how many companies do you know that start their pitch with "we're going to harvest information on the web and..."? Another strong area of growth is the analysis of user behavior data.Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner)be drowning in a torrent of log data.In fact,logging user behavior generates so much data that many organizations simply can't cope with the volume,and either turn the functionality off or throw away data after some time.This represents lost opportunities,as there is a broadly-held belief that great value lies in insights derived from mining such data.Knowing what users look at,what they click on,how much time they spend,etc.leads to better business decisions and competitive advantages.Broadly, this is known as business intelligence,which encompasses a wide range of technologies including data warehousing,data mining,and analytics

1 C H A P T E R 1 Introduction MapReduce [31] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with significant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce. Given this focus, it makes sense to start with the most basic question: Why? There are many answers to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that real-world systems must grapple with. Second, across a wide range of text processing applications, more data translates into more effective algorithms, and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are defined by vast repositories of data, both public and private. Therefore, any practical application must be able to scale up to datasets of interest. For many, this means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing web content must tackle large-data problems: “web-scale” processing is practically synonymous with data-intensive processing. This observation applies not only to well-established internet companies, but also countless startups and niche players as well. Just think, how many companies do you know that start their pitch with “we’re going to harvest information on the web and. . . ”? Another strong area of growth is the analysis of user behavior data. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. In fact, logging user behavior generates so much data that many organizations simply can’t cope with the volume, and either turn the functionality off or throw away data after some time. This represents lost opportunities, as there is a broadly-held belief that great value lies in insights derived from mining such data. Knowing what users look at, what they click on, how much time they spend, etc. leads to better business decisions and competitive advantages. Broadly, this is known as business intelligence, which encompasses a wide range of technologies including data warehousing, data mining, and analytics

2 CHAPTER 1.INTRODUCTION How much data are we talking about?A few examples:Google grew from pro- cessing 100 TB of data a day in 2004 [31]to processing 20 PB a day in 2008 [32].In April 2009,a blog post'was written about eBay's two enormous data warehouses:one with 2 petabytes of user data,and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day.Shortly there- after,Facebook revealed2 similarly impressive numbers,boasting of 2.5 petabytes of user data,growing at about 15 terabytes per day.Petabyte datasets are rapidly becom- ing the norm,and the trends are clear:our ability to store data is fast overwhelming our ability to process what we store.More distressing,increases in capacity are out- pacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [63].Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude);on the other hand,latency and bandwidth have improved relatively little in comparison (in the case of latency,perhaps 2x improvement during the last quarter century,and in the case of bandwidth,perhaps 50x).Given the tendency for individuals and organizations to continuously fill up whatever capacity is available,large-data problems are growing increasingly severe. Moving beyond the commercial sphere,many have recognized the importance of data management in many scientific disciplines,where petabyte-scale datasets are also becoming increasingly common [13.For example: The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [12].Today,the Large Hadron Collider (LHC)near Geneva is the world's largest particle accelerator,designed to probe the mysteries of the universe,including the fundamental nature of matter,by recreating conditions shortly following the Big Bang.When it becomes fully op- erational,the LHC will produce roughly 15 petabytes of data a year.3 Astronomers have long recognized the importance of a "digital observatory"that would support the data needs of researchers across the globe-the Sloan Digital Sky Survey [102]is perhaps the most well known of these projects.Looking into the future,the Large Synoptic Survey Telescope(LSST)is a wide-field instrument that is capable of observing the entire sky every few days.When the telescope comes online around 2015 in Chile,its 3.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [11]. The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored,organized,and delivered to scientists for 1http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ 3http://public.web.cern.ch/public/en/LHC/Computing-en.html

2 CHAPTER 1. INTRODUCTION How much data are we talking about? A few examples: Google grew from processing 100 TB of data a day in 2004 [31] to processing 20 PB a day in 2008 [32]. In April 2009, a blog post1 was written about eBay’s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Petabyte datasets are rapidly becoming the norm, and the trends are clear: our ability to store data is fast overwhelming our ability to process what we store. More distressing, increases in capacity are outpacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [63]. Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude); on the other hand, latency and bandwidth have improved relatively little in comparison (in the case of latency, perhaps 2× improvement during the last quarter century, and in the case of bandwidth, perhaps 50×). Given the tendency for individuals and organizations to continuously fill up whatever capacity is available, large-data problems are growing increasingly severe. Moving beyond the commercial sphere, many have recognized the importance of data management in many scientific disciplines, where petabyte-scale datasets are also becoming increasingly common [13]. For example: • The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [12]. Today, the Large Hadron Collider (LHC) near Geneva is the world’s largest particle accelerator, designed to probe the mysteries of the universe, including the fundamental nature of matter, by recreating conditions shortly following the Big Bang. When it becomes fully operational, the LHC will produce roughly 15 petabytes of data a year.3 • Astronomers have long recognized the importance of a “digital observatory” that would support the data needs of researchers across the globe—the Sloan Digital Sky Survey [102] is perhaps the most well known of these projects. Looking into the future, the Large Synoptic Survey Telescope (LSST) is a wide-field instrument that is capable of observing the entire sky every few days. When the telescope comes online around 2015 in Chile, its 3.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [11]. • The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored, organized, and delivered to scientists for 1http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ 3http://public.web.cern.ch/public/en/LHC/Computing-en.html

3 further study.Given the fundamental tenant in modern genetics that genotypes explain phenotypes,the impact of this technology is nothing less than transfor- mative [74].The European Bioinformatics Institute (EBI),which hosts a central repository of sequence data called EMBL-bank,has increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [99].Scientists are predicting that,in the not-so-distant future,sequencing an individual's genome will be no more complex than getting a blood test today-ushering a new era of personalized medicine,where interventions can be specifically targeted for an individual. Increasingly,scientific breakthroughs will be powered by advanced computing capabil- ities that help researchers manipulate,explore,and mine massive datasets [50]-this has been hailed as the emerging "fourth paradigm"of science [51](complementing the- ory,experiments,and simulations).In other areas of academia,particularly computer science,systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as "toy systems"with limited utility.Large data is a fact of today's world and data-intensive processing is fast becoming a necessity,not merely a luxury or curiosity. Although large data comes in a variety of forms,this book is primarily concerned with processing large amounts of text,but touches on other types of data as well (e.g., relational and graph data).The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing(NLP)and information retrieval (IR).Recent work in these fields is dominated by a data-driven,empirical approach, typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application.There are three components to this ap- proach:data,representations of the data,and some method for capturing regularities in the data.The first is called corpora(singular,corpus)by NLP researchers and col- lections by those from the IR community.Aspects of the representations of the data are called features,which may be "superficial"and easy to extract,such as the words and sequences of words themselves,or "deep"and more difficult to extract,such as the grammatical relationship between words.Finally,algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application One common application,classification,is to sort text into categories.Examples in- clude:Is this email spam or not spam?Is this word a part of an address?The first task is easy to understand,while the second task is an instance of what NLP researchers call named-entity detection,which is useful for local search and pinpointing locations on maps.Another common application is to rank texts according to some criteria-search is a good example,which involves ranking documents by relevance to the user's query. Another example is to automatically situate texts along a scale of "happiness",a task known as sentiment analysis,which has been applied to everything from understanding political discourse in the blogosphere to predicting the movement of stock prices

3 further study. Given the fundamental tenant in modern genetics that genotypes explain phenotypes, the impact of this technology is nothing less than transformative [74]. The European Bioinformatics Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, has increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [99]. Scientists are predicting that, in the not-so-distant future, sequencing an individual’s genome will be no more complex than getting a blood test today—ushering a new era of personalized medicine, where interventions can be specifically targeted for an individual. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate, explore, and mine massive datasets [50]—this has been hailed as the emerging “fourth paradigm” of science [51] (complementing theory, experiments, and simulations). In other areas of academia, particularly computer science, systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility. Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity, not merely a luxury or curiosity. Although large data comes in a variety of forms, this book is primarily concerned with processing large amounts of text, but touches on other types of data as well (e.g., relational and graph data). The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing (NLP) and information retrieval (IR). Recent work in these fields is dominated by a data-driven, empirical approach, typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application. There are three components to this approach: data, representations of the data, and some method for capturing regularities in the data. The first is called corpora (singular, corpus) by NLP researchers and collections by those from the IR community. Aspects of the representations of the data are called features, which may be “superficial” and easy to extract, such as the words and sequences of words themselves, or “deep” and more difficult to extract, such as the grammatical relationship between words. Finally, algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application. One common application, classification, is to sort text into categories. Examples include: Is this email spam or not spam? Is this word a part of an address? The first task is easy to understand, while the second task is an instance of what NLP researchers call named-entity detection, which is useful for local search and pinpointing locations on maps. Another common application is to rank texts according to some criteria—search is a good example, which involves ranking documents by relevance to the user’s query. Another example is to automatically situate texts along a scale of “happiness”, a task known as sentiment analysis, which has been applied to everything from understanding political discourse in the blogosphere to predicting the movement of stock prices

4 CHAPTER 1.INTRODUCTION There is a growing body of evidence,at least in text processing,that of the three components discussed above (data,features,algorithms),data probably matters the most.Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data.But why can't we have our cake and eat it too?Why not both sophisticated models and deep features applied to lots of data?Because inference over sophisticated models and extraction of deep features are often computationally intensive,they often don't scale. Consider a simple task such as determining the correct usage of easily confusable words such as“than”and“then”in English.One can view this as a supervised machine learning problem:we can train a classifier to disambiguate between the options,and then apply the classifier to new instances of the problem(say,as part of a grammar checker) Training data is fairly easy to come by-we can just gather a large corpus of texts and assume that writers make correct choices(the training data may be noisy,since people make mistakes,but no matter).In 2001,Banko and Brill [8]published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy,using this task as the specific example.They explored several classification algorithms(the exact ones aren't important),and not surprisingly, found that more data led to better accuracy.Across many different algorithms,the increase in accuracy was approximately linear in the log of the size of the training data.Furthermore,with increasing amounts of training data,the accuracy of different algorithms converged,such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale.This led to a somewhat controversial conclusion (at least at the time):machine learning algorithms really don't matter,all that matters is the amount of data you have.This led to an even more controversial conclusion,delivered somewhat tongue-in-cheek:we should just give up working on algorithms and simply spend our time gathering data. As another example,consider the problem of answering short,fact-based questions such as "Who shot Abraham Lincoln?"Instead of returning a list of documents that the user would then have to sort through,a question answering (QA)system would directly return the answer:John Wilkes Booth.This problem gained interest in the late 1990s,when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around 2001,researchers discovered a far simpler approach to answering such questions based on pattern matching [18,37,64].Suppose you wanted the answer to the above question.As it turns out,you can simply search for the phrase "shot Abraham Lincoln" on the web and look for what appears to its left.Or better yet,look through multiple instances of this phrase and tally up the words that appear to the left.This simple approach works surprisingly well,and has become known as the redundancy-based approach to question answering.It capitalizes on the insight that in a very large text

4 CHAPTER 1. INTRODUCTION There is a growing body of evidence, at least in text processing, that of the three components discussed above (data, features, algorithms), data probably matters the most. Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive, they often don’t scale. Consider a simple task such as determining the correct usage of easily confusable words such as “than” and “then” in English. One can view this as a supervised machine learning problem: we can train a classifier to disambiguate between the options, and then apply the classifier to new instances of the problem (say, as part of a grammar checker). Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that writers make correct choices (the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and Brill [8] published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy, using this task as the specific example. They explored several classification algorithms (the exact ones aren’t important), and not surprisingly, found that more data led to better accuracy. Across many different algorithms, the increase in accuracy was approximately linear in the log of the size of the training data. Furthermore, with increasing amounts of training data, the accuracy of different algorithms converged, such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter, all that matters is the amount of data you have. This led to an even more controversial conclusion, delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data. As another example, consider the problem of answering short, fact-based questions such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the user would then have to sort through, a question answering (QA) system would directly return the answer: John Wilkes Booth. This problem gained interest in the late 1990s, when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Around 2001, researchers discovered a far simpler approach to answering such questions based on pattern matching [18, 37, 64]. Suppose you wanted the answer to the above question. As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the web and look for what appears to its left. Or better yet, look through multiple instances of this phrase and tally up the words that appear to the left. This simple approach works surprisingly well, and has become known as the redundancy-based approach to question answering. It capitalizes on the insight that in a very large text

6 collection (i.e.,the web),answers to commonly-asked questions will be stated in obvious ways,such that pattern-matching techniques suffice to extract answers accurately. Yet another example concerns smoothing in web-scale language models [16].A language model is a probability distribution that characterizes the likelihood of ob- serving a particular sequence of words,estimated from a large corpus of texts.They are useful in a variety of applications,such as speech recognition (to determine what the speaker is more likely to have said)and machine translation (to determine which of possible translations is the most fluent,as we will discuss in Chapter 6.4).Since there are infinitely many possible strings,and probabilities must be assigned to all of them,language modeling is a more challenging task than simply keeping track of which strings were seen how many times:some number of likely strings will never have been seen at all,even with lots and lots of training data!Most modern language models make the Markov assumption:in a n-gram language model,the conditional probability of a word is given by the n-1 previous words.Thus,by the chain rule,the probability of a sequence of words can be decomposed into the product of n-gram probabilities.Nev- ertheless,an enormous number of parameters must still be estimated from a training corpus:potentially Vn parameters,where V is the number of words in the vocabulary. Even if we treat every word on the web as the training corpus to estimate the n-gram probabilities from,most n-grams-in any language,even English-will never have been seen.To cope with this sparseness,researchers have developed a number of smooth- ing techniques [24,73],which all share the basic idea of moving probability mass from observed to unseen events in a principled manner.Smoothing approaches vary in ef- fectiveness,both in terms of intrinsic and application-specific metrics.In 2007,Brants et al.[16]described language models trained on up to two trillion words.4 Their ex- periments compared a state-of-the-art approach known as Kneser-Ney smoothing [24] with another technique the authors affectionately referred to as "stupid backoff".5 Not surprisingly,stupid backoff didn't work as well as Kneser-Ney smoothing on smaller corpora.However,it was simpler and could be trained on more data,which ultimately yielded better language models.That is,a simpler technique on more data beat a more sophisticated technique on less data. Recently,three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [45].6 Why is this so?It boils down to the fact that language in the wild,just like human behavior in general,is messy.Unlike,say,the interaction of subatomic particles,human use of language is not constrained by succinct,universal "laws of grammar".There are of course rules 4As an side,it is interesting to observe the evolving definition of large over the years.Banko and Brill's paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation,and dealt with a corpus containing a billion words. 5As in,so stupid it couldn't possibly work. 6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [108]

5 collection (i.e., the web), answers to commonly-asked questions will be stated in obvious ways, such that pattern-matching techniques suffice to extract answers accurately. Yet another example concerns smoothing in web-scale language models [16]. A language model is a probability distribution that characterizes the likelihood of observing a particular sequence of words, estimated from a large corpus of texts. They are useful in a variety of applications, such as speech recognition (to determine what the speaker is more likely to have said) and machine translation (to determine which of possible translations is the most fluent, as we will discuss in Chapter 6.4). Since there are infinitely many possible strings, and probabilities must be assigned to all of them, language modeling is a more challenging task than simply keeping track of which strings were seen how many times: some number of likely strings will never have been seen at all, even with lots and lots of training data! Most modern language models make the Markov assumption: in a n-gram language model, the conditional probability of a word is given by the n − 1 previous words. Thus, by the chain rule, the probability of a sequence of words can be decomposed into the product of n-gram probabilities. Nevertheless, an enormous number of parameters must still be estimated from a training corpus: potentially V n parameters, where V is the number of words in the vocabulary. Even if we treat every word on the web as the training corpus to estimate the n-gram probabilities from, most n-grams—in any language, even English—will never have been seen. To cope with this sparseness, researchers have developed a number of smoothing techniques [24, 73], which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. Smoothing approaches vary in effectiveness, both in terms of intrinsic and application-specific metrics. In 2007, Brants et al. [16] described language models trained on up to two trillion words.4 Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [24] with another technique the authors affectionately referred to as “stupid backoff”.5 Not surprisingly, stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained on more data, which ultimately yielded better language models. That is, a simpler technique on more data beat a more sophisticated technique on less data. Recently, three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [45].6 Why is this so? It boils down to the fact that language in the wild, just like human behavior in general, is messy. Unlike, say, the interaction of subatomic particles, human use of language is not constrained by succinct, universal “laws of grammar”. There are of course rules 4As an side, it is interesting to observe the evolving definition of large over the years. Banko and Brill’s paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with a corpus containing a billion words. 5As in, so stupid it couldn’t possibly work. 6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [108]

6 CHAPTER 1.INTRODUCTION that govern the formation of words and sentences-for example,that verbs appear before objects in English,and that subjects and verbs must agree in number-but real-world language is affected by a multitude of other factors as well:people invent new words and phrases all the time,authors occasionally make mistakes,groups of individuals write within a shared context,etc.The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about fictional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing.7 The world,he would say,is the best description of itself.In the same way,the more observations we gather about language use,the more accurate a description we have about language itself.This,in turn,translates into more effective algorithms and systems. So,in summary,why large data?In some ways,the first answer is similar to the reason people climb mountains:because they're there.But the second answer is even more compelling.Data is the rising tide that lifts all boats-more data leads to better algorithms and systems for solving real-world problems.Now that we've addressed the why,let's tackle the how.Let's start with the obvious observation:data-intensive pro- cessing is beyond the capabilities of any individual machine and requires clusters-which means that large-data problems are fundamentally about organizing computations on dozens,hundreds,or even thousands of machines.This is exactly what MapReduce does,and the rest of this book is about the how. 1.1 COMPUTING IN THE CLOUDS For better or for worse,MapReduce cannot be untangled from the broader discourse on cloud computing.True,there is substantial promise in this new paradigm of computing, but unwarranted hype by the media and popular sources threatens its credibility in the long run.In some ways,cloud computing is simply brilliant marketing.Before clouds, there were grids,s and before grids,there were vector supercomputers,each having claimed to be the best thing since sliced bread. So what exactly is cloud computing?This is one of those questions where ten experts will give eleven different answers;in fact,countless papers have been written simply to attempt to define the term (e.g.,[4,21,105],just to name a few examples). 7 On Eractitude in Science [14. sWhat is the difference between cloud computing and grid computing?Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems,they start with different assumptions.Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations.As a result,grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment,e.g.,verifying credentials across multiple administrative domains.Grid computing has adopted a middleware-based approach for tackling many of these issues

6 CHAPTER 1. INTRODUCTION that govern the formation of words and sentences—for example, that verbs appear before objects in English, and that subjects and verbs must agree in number—but real-world language is affected by a multitude of other factors as well: people invent new words and phrases all the time, authors occasionally make mistakes, groups of individuals write within a shared context, etc. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about fictional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing.7 The world, he would say, is the best description of itself. In the same way, the more observations we gather about language use, the more accurate a description we have about language itself. This, in turn, translates into more effective algorithms and systems. So, in summary, why large data? In some ways, the first answer is similar to the reason people climb mountains: because they’re there. But the second answer is even more compelling. Data is the rising tide that lifts all boats—more data leads to better algorithms and systems for solving real-world problems. Now that we’ve addressed the why, let’s tackle the how. Let’s start with the obvious observation: data-intensive processing is beyond the capabilities of any individual machine and requires clusters—which means that large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines. This is exactly what MapReduce does, and the rest of this book is about the how. 1.1 COMPUTING IN THE CLOUDS For better or for worse, MapReduce cannot be untangled from the broader discourse on cloud computing. True, there is substantial promise in this new paradigm of computing, but unwarranted hype by the media and popular sources threatens its credibility in the long run. In some ways, cloud computing is simply brilliant marketing. Before clouds, there were grids,8 and before grids, there were vector supercomputers, each having claimed to be the best thing since sliced bread. So what exactly is cloud computing? This is one of those questions where ten experts will give eleven different answers; in fact, countless papers have been written simply to attempt to define the term (e.g., [4, 21, 105], just to name a few examples). 7On Exactitude in Science [14]. 8What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems, they start with different assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these issues