CHAPTER Big data Concepts and Tools Learning Objectives for Chapter 7 Learn what Big Data is and how it is changing the world of analytics Understand the motivation for and business drivers of Big Data analytics Become familiar with the wide range of enabling technologies for big data Learn about Hadoop, MapReduce, and nosQl as they relate to big data analytics Compare and contrast the complementary uses of data warehousing and Big Data Become familiar with the vendors of Big Data tools and services Understand the need for and appreciate the capabilities of stream analytics Learn about the applications of stream analytics CHAPTER OVERVIEW Big Data, which means many things to many people, is not a new technological fad. It is a business priority that has the potential to profoundly change the competitive landscape in todays globally integrated economy. In add ition to provid ing innovative solutions to enduring business challenges, Big Data and analytics instigate new ways to transform processes, organizations, entire industries, and even society all together. Yet extensive med ia coverage makes it hard to distinguish hype from reality. This chapter aims to provide a comprehensive coverage of Big Data, its enabling technologies, and related analytics concepts to help understand the capabilities and limitations of this emerging technology. The chapter starts with a definition and related concepts of Big Data followed by the technical details of the enabling technologies including Hadoop Copyright C2018 Pearson Education, Inc
1 Copyright © 2018Pearson Education, Inc. Big Data Concepts and Tools Learning Objectives for Chapter 7 ▪ Learn what Big Data is and how it is changing the world of analytics ▪ Understand the motivation for and business drivers of Big Data analytics ▪ Become familiar with the wide range of enabling technologies for Big Data analytics ▪ Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics ▪ Compare and contrast the complementary uses of data warehousing and Big Data ▪ Become familiar with the vendors of Big Data tools and services ▪ Understand the need for and appreciate the capabilities of stream analytics ▪ Learn about the applications of stream analytics CHAPTER OVERVIEW Big Data, which means many things to many people, is not a new technological fad. It is a business priority that has the potential to profoundly change the competitive landscape in today’s globally integrated economy. In addition to providing innovative solutions to enduring business challenges, Big Data and analytics instigate new ways to transform processes, organizations, entire industries, and even society all together. Yet extensive media coverage makes it hard to distinguish hype from reality. This chapter aims to provide a comprehensive coverage of Big Data, its enabling technologies, and related analytics concepts to help understand the capabilities and limitations of this emerging technology. The chapter starts with a definition and related concepts of Big Data followed by the technical details of the enabling technologies including Hadoop, CHAPTER 7
MapReduce, and NoSQL. After describing data scientist as a new fashionable organizational role/job, we provide a comparative analysis between data warehousing and Big Data analytics. The last part of the chapter is ded icated to stream analytics, which is one of the most promising value propositions of Big Data analytics CHAPTER OUTLINE 7. 1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Usin Big data methods 7.2 Definition of Big Data 7. 3 Fundamentals of Big data analytics 7.4 Big data Technologies 7.5 Big Data and Data Warehousing 7. 6 Big data Vendors and Platforms 7.7 Big Data and Stream Analytics 7. 8 Applications of Stream Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 7. I Review Questions 1. What problem did customer service cancellation pose to ATs business survival? The company identified that it was losing an alarming number of customers and that many of these customer losses happened as a result of customer service interactions. If the company continued to lose customers at this rate, it would no longer be economically viable Identify and explain the technical hurdles presented by the nature and characteristics ofat's data The company needed to analyze data from a variety of sources, as well as data formats. Data was stored in text as well as aud io. the data needed to be combined into a single location and format before analysis could occur What is sessionizing? Why was it necessary for At to sessionize its data? Copyright C2018 Pearson Education, Inc
2 Copyright © 2018Pearson Education, Inc. MapReduce, and NoSQL. After describing data scientist as a new fashionable organizational role/job, we provide a comparative analysis between data warehousing and Big Data analytics. The last part of the chapter is dedicated to stream analytics, which is one of the most promising value propositions of Big Data analytics. CHAPTER OUTLINE 7.1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Using Big Data Methods 7.2 Definition of Big Data 7.3 Fundamentals of Big Data Analytics 7.4 Big Data Technologies 7.5 Big Data and Data Warehousing 7.6 Big Data Vendors and Platforms 7.7 Big Data and Stream Analytics 7.8 Applications of Stream Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 7.1 Review Questions 1. What problem did customer service cancellation pose to AT’s business survival? The company identified that it was losing an alarming number of customers and that many of these customer losses happened as a result of customer service interactions. If the company continued to lose customers at this rate, it would no longer be economically viable. 2. Identify and explain the technical hurdles presented by the nature and characteristics of AT’s data. The company needed to analyze data from a variety of sources, as well as data formats. Data was stored in text as well as audio. The data needed to be combined into a single location and format before analysis could occur. 3. What is sessionizing? Why was it necessary for AT to sessionize its data?
While not addressed d irectly in this case, sessionizing is aggregating customer interactions concerning a single issue across multiple different contact methods that are being addressed, and also provides information and context about those In this case, sessionizing is important because it reflects the true number of issu events which will need to be analyzed Research other stud ies where customer churn models have been employed what types of variables were used in those stud ies? How is this vignette different? Student insights will vary based on the research completed Besides Teradata Aster, identify other popular Big Data analytics platforms that could handle the analysis described in the preceding case Student insights will vary based on the research completed Section 7.2 Review Questions 1. Why is Big Data important? What has changed to put it in the center of the analytics world? As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use 1g How do you define Big Data? Why is it difficult to define? Big Data means different things to people with different backgrounds and interests, which is one reason it is hard to define. Trad itionally, the term" Big Data" has been used to describe the massive volumes of data analyzed by huge organizations such as Google or research science projects at NASA. Big Data includes both structured and unstructured data, and it comes from everywhere data sources include Web logs, RFID, GPS systems, sensor networks, social networks. Internet-based text documents. internet search indexes detailed call records, to name just a few. Big data is not just about volume, but also variety, velocity, veracity, and value proposition 3. Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why? Although all of the Vs are important characteristics, value proposition is probably the most important for decision makers'"big" data in that it contains(or has a greater potential to contain)more patterns and interesting anomalies than"small Copyright C2018 Pearson Education, Inc
3 Copyright © 2018Pearson Education, Inc. While not addressed directly in this case, sessionizing is aggregating customer interactions concerning a single issue across multiple different contact methods. In this case, sessionizing is important because it reflects the true number of issues that are being addressed, and also provides information and context about those events which will need to be analyzed. 4. Research other studies where customer churn models have been employed. What types of variables were used in those studies? How is this vignette different? Student insights will vary based on the research completed. 5. Besides Teradata Aster, identify other popular Big Data analytics platforms that could handle the analysis described in the preceding case. Student insights will vary based on the research completed. Section 7.2 Review Questions 1. Why is Big Data important? What has changed to put it in the center of the analytics world? As more and more data becomes available in various forms and fashions, timely processing of the data with traditional means becomes impractical. The exponential growth, availability, and use of information, both structured and unstructured, brings Big Data to the center of the analytics world. Pushing the boundaries of data analytics uncovers new insights and opportunities for the use of Big Data. 2. How do you define Big Data? Why is it difficult to define? Big Data means different things to people with different backgrounds and interests, which is one reason it is hard to define. Traditionally, the term “Big Data” has been used to describe the massive volumes of data analyzed by huge organizations such as Google or research science projects at NASA. Big Data includes both structured and unstructured data, and it comes from everywhere: data sources include Web logs, RFID, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detailed call records, to name just a few. Big data is not just about volume, but also variety, velocity, veracity, and value proposition. 3. Out of the Vs that are used to define Big Data, in your opinion, which one is the most important? Why? Although all of the Vs are important characteristics, value proposition is probably the most important for decision makers’ “big” data in that it contains (or has a greater potential to contain) more patterns and interesting anomalies than “small
data. Thus, by analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. While users can detect the patterns in small data sets using simple statistical and machine-learning methods or ad hoc query and reporting tools, Big Data means"big"analytics. Big analytics means greater insight and better decisions, something that every organization needs nowadays. Different students may have different answers. 4. What do you think the future of Big Data will be like? Will it lose its popularity to something else? If so. what will it be? Big Data could evolve at a rapid pace. The buzzword"Big Data"might change to something else, but the trend toward increased computing capabilities, analytics methodologies, and data management of high volume heterogeneous information will continue. Different students may have different answers. Section 7.3 Review Questions What is Big Data analytics? How does it differ from regular analytics? Big Data analytics is analytics applied to Big Data architectures. This is a new paradigm; in order to keep up with the computational needs of Big Data,a number of new and innovative analytics computational techniques and platforms have been developed. These techniques are collectively called high-performance computing, and include in-memory analytics, in-database analytics, grid computing, and appliances. They differ from regular analytics which tend to focus on relational database technologies What are the critical success factors for Big Data analytics? Critical factors include a clear business need, strong and committed sponsorship alignment between the business and IT strategies, a fact-based decision culture, a strong data infrastructure, the right analytics tools, and personnel with ad vanced analytic skills 3. What are the big challenges that one should be mind ful of when considering implementation of Big Data analytics? Trad itional ways of capturing, storing, and analyzing data are not sufficient for Big Data. Major challenges are the vast amount of data volume the need for data integration to combine data of different structures in a cost-effective manner. the leed to process data quickly, data governance issues, skill availability, and solution costs What are the common business problems addressed by Big Data analytics? Here is a list of problems that can be addressed using Big Data analytics Copyright C2018 Pearson Education, Inc
4 Copyright © 2018Pearson Education, Inc. data. Thus, by analyzing large and feature rich data, organizations can gain greater business value that they may not have otherwise. While users can detect the patterns in small data sets using simple statistical and machine-learning methods or ad hoc query and reporting tools, Big Data means “big” analytics. Big analytics means greater insight and better decisions, something that every organization needs nowadays. (Different students may have different answers.) 4. What do you think the future of Big Data will be like? Will it lose its popularity to something else? If so, what will it be? Big Data could evolve at a rapid pace. The buzzword “Big Data” might change to something else, but the trend toward increased computing capabilities, analytics methodologies, and data management of high volume heterogeneous information will continue. (Different students may have different answers.) Section 7.3 Review Questions 1. What is Big Data analytics? How does it differ from regular analytics? Big Data analytics is analytics applied to Big Data architectures. This is a new paradigm; in order to keep up with the computational needs of Big Data, a number of new and innovative analytics computational techniques and platforms have been developed. These techniques are collectively called high-performance computing, and include in-memory analytics, in-database analytics, grid computing, and appliances. They differ from regular analytics which tend to focus on relational database technologies. 2. What are the critical success factors for Big Data analytics? Critical factors include a clear business need, strong and committed sponsorship, alignment between the business and IT strategies, a fact-based decision culture, a strong data infrastructure, the right analytics tools, and personnel with advanced analytic skills. 3. What are the big challenges that one should be mindful of when considering implementation of Big Data analytics? Traditional ways of capturing, storing, and analyzing data are not sufficient for Big Data. Major challenges are the vast amount of data volume, the need for data integration to combine data of different structures in a cost-effective manner, the need to process data quickly, data governance issues, skill availability, and solution costs. 4. What are the common business problems addressed by Big Data analytics? Here is a list of problems that can be addressed using Big Data analytics:
Process efficiency and cost reduction Revenue maximization, cross-selling, and up-selling Enhanced customer experience Churn identification, customer recruIting Improved customer service Identify ing new products and market opportunities Risk management Regulatory compliance Enhanced security capabilities Section 7.4 Review Questions 1. What are the common characteristics of emerging Big Data technologies? They take advantage of commod ity hardware to enable scale-out, parallel processing techniques, employ nonrelational data storage capabilities in order to process unstructured and semistructured data; and apply advanced analytics and data visualization technology to Big Data to convey insights to end users What is MapReduce? What does it do? How does it do it? MapReduce is a programming model that allows the processing of large-scale data analysis problems to be distributed and parallelized. The Map Reduce technique, popularized by Google, distributes the processing of very large multi structured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. The map function in Map Reduce breaks a problem into sub-problems, which can each be processed by single nodes in parallel. The reduce function merges(sorts, organizes, aggregates)the results from each of these nodes into the final result What is Hadoop? How does it work? Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed unstructured data. It is designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel typically commodity machines connected via the Internet. It utilizes th MapReduce framework to implement d istributed parallelism. The file organization is implemented in the Hadoop Distributed File System(hdfS), which is adept at storing large volumes of unstructured and semistructured data This is an alternative to the trad itional tables/rows/columns structure of a relational database. Data is replicated across multiple nodes, allowing for fault tolerance in the system Copyright C2018 Pearson Education, Inc
5 Copyright © 2018Pearson Education, Inc. • Process efficiency and cost reduction • Brand management • Revenue maximization, cross-selling, and up-selling • Enhanced customer experience • Churn identification, customer recruiting • Improved customer service • Identifying new products and market opportunities • Risk management • Regulatory compliance • Enhanced security capabilities Section 7.4 Review Questions 1. What are the common characteristics of emerging Big Data technologies? They take advantage of commodity hardware to enable scale-out, parallel processing techniques; employ nonrelational data storage capabilities in order to process unstructured and semistructured data; and apply advanced analytics and data visualization technology to Big Data to convey insights to end users. 2. What is MapReduce? What does it do? How does it do it? MapReduce is a programming model that allows the processing of large-scale data analysis problems to be distributed and parallelized. The MapReduce technique, popularized by Google, distributes the processing of very large multistructured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. The map function in MapReduce breaks a problem into sub-problems, which can each be processed by single nodes in parallel. The reduce function merges (sorts, organizes, aggregates) the results from each of these nodes into the final result. 3. What is Hadoop? How does it work? Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. It is designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel, typically commodity machines connected via the Internet. It utilizes the MapReduce framework to implement distributed parallelism. The file organization is implemented in the Hadoop Distributed File System (HDFS), which is adept at storing large volumes of unstructured and semistructured data. This is an alternative to the traditional tables/rows/columns structure of a relational database. Data is replicated across multiple nodes, allowing for fault tolerance in the system
4. What are the main Hadoop components? What functions do they perform? Major components of Hadoop are the HDFS, a Job Tracker operating on the master node, Name Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the default storage layer in any given Hadoop cluster. A Name Node is a node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail. Secondary nodes are backup name nodes. The Job Tracker is the node of a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data. Slave nodes store data and take direction to process it from the Job tracker Querying for data in the distributed system is accomplished via MapReduce. The client query is handled in a Map job, which is submitted to the Job Tracker. The Job tracker refers to the name node to determine which data it needs to access to complete the job and where in the cluster that data is located, then submits the query to the relevant nodes which operate in parallel. A Name Node acts as facilitator, communicating back to the client information such as which nodes ar available. where in the cluster certain data resides and which nodes have failed When each node completes its task, it stores its result. The client submits a Reduce job to the Job tracker, which then collects and aggregates the results from each of the nodes What is NosQL? How does it fit into the Big Data analytics picture? NoSQL, also known as"Not Only SQL, is a new style of database for processing large volumes of multi-structured data. Whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NosQL databases are mostly aimed at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. NosQL databases trade ACID (atomicity, consistency, isolation, durability)compliance for performance and scalability Section 7.5 Review Questions witnessing the end of the data warehousing era? Why or why nof? e we 1. What are the challenges facing data warehousing and Big Data? Al What has changed the landscape in recent years is the variety and complexity of data, which made data warehouses incapable of keeping up. It is not the volume of the structured data but the variety and the velocity that forced the world ofIT to devel aradigm, which we now call"Big Data. But this does not mean the end of data warehousing. Data warehousing and RDBMS still bring many strengths that make them relevant for BI and that Big data techniques do not currently provide 6 Copyright C2018 Pearson Education, Inc
6 Copyright © 2018Pearson Education, Inc. 4. What are the main Hadoop components? What functions do they perform? Major components of Hadoop are the HDFS, a Job Tracker operating on the master node, Name Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the default storage layer in any given Hadoop cluster. A Name Node is a node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail. Secondary nodes are backup name nodes. The Job Tracker is the node of a Hadoop cluster that initiates and coordinates MapReduce jobs or the processing of the data. Slave nodes store data and take direction to process it from the Job Tracker. Querying for data in the distributed system is accomplished via MapReduce. The client query is handled in a Map job, which is submitted to the Job Tracker. The Job Tracker refers to the Name Node to determine which data it needs to access to complete the job and where in the cluster that data is located, then submits the query to the relevant nodes which operate in parallel. A Name Node acts as facilitator, communicating back to the client information such as which nodes are available, where in the cluster certain data resides, and which nodes have failed. When each node completes its task, it stores its result. The client submits a Reduce job to the Job Tracker, which then collects and aggregates the results from each of the nodes. 5. What is NoSQL? How does it fit into the Big Data analytics picture? NoSQL, also known as “Not Only SQL,” is a new style of database for processing large volumes of multi-structured data. Whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are mostly aimed at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. NoSQL databases trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Section 7.5 Review Questions 1. What are the challenges facing data warehousing and Big Data? Are we witnessing the end of the data warehousing era? Why or why not? What has changed the landscape in recent years is the variety and complexity of data, which made data warehouses incapable of keeping up. It is not the volume of the structured data but the variety and the velocity that forced the world of IT to develop a new paradigm, which we now call “Big Data.” But this does not mean the end of data warehousing. Data warehousing and RDBMS still bring many strengths that make them relevant for BI and that Big Data techniques do not currently provide
What are the use cases for Big Data and Hadoop? In terms of its use cases, Hadoop is differentiated two ways: first, as the repository and refinery of raw data, and second, as an active archive of historical data Hadoop, with their distributed file system and flexibility of data formats (allowing both structured and unstructured data), is ad vantageous when workin with information commonly found on the web, includ ing social med ultimedia, and text. Also, because it can handle such huge volumes of data(and because storage costs are minimized due to the d istributed nature of the file system, historical(archive) data can be managed easily with this approach What are the use cases for data warehousing and RDBMS? Three main use cases for data warehousing are performance, integration, and the availability of a wide variety of BI tools. The relational data warehouse approach is quite mature, and database vend ors are constantly ad d ing new index types, partitioning, statistics, and optimizer features. This enables complex queries to be done quickly, a must for any BI application. Data warehousing, and the etL process, provide a robust mechanism for collecting, cleaning, and integrating data. And, it is increasingly easy for end users to create reports, graphs, and visualizations of the data 4. In what scenarios can hadoop and rdbms coexist? There are several possible scenarios under which using a combination of Hadoop and relational DBMS-based data warehousing technologies makes sense. For example, you can use Hadoop for storing and archiving multi-structured data, with a connector to a relational DBMS that extracts required data from Hadoop for analysis by the relational DBMS. Hadoop can also be used to filter and transform multi-structural data for transporting to a data warehouse, and can also be used to analyze multi-structural data for publishing into the data warehouse environment. Combining SQL and MapReduce query functions enables data scientists to analyze both structured and unstructured data. Also, front end quer tools are available for both platforms Section 7.6 Review Questions 1. What is special about the Big Data vendor landscape? Who are the big players? The Big Data vendor landscape is developing very rapidly. It is in a special period of evolution where entrepreneurial startup firms bring innovative solutions to the marketplace. Cloudera is a market leader in the Hadoop space. MapR and Hortonworks are two other Hadoop startups. Data Stax is an example of a NoSQL vendor Informatica, Pervasive Software, Syncsort, and MicroStrategy are also players. Most of the growth in the industry is with Hadoop and NoSQL distributors and analytics providers. There is still very little in terms of Big Data Copyright C2018 Pearson Education, Inc
7 Copyright © 2018Pearson Education, Inc. 2. What are the use cases for Big Data and Hadoop? In terms of its use cases, Hadoop is differentiated two ways: first, as the repository and refinery of raw data, and second, as an active archive of historical data. Hadoop, with their distributed file system and flexibility of data formats (allowing both structured and unstructured data), is advantageous when working with information commonly found on the Web, including social media, multimedia, and text. Also, because it can handle such huge volumes of data (and because storage costs are minimized due to the distributed nature of the file system), historical (archive) data can be managed easily with this approach. 3. What are the use cases for data warehousing and RDBMS? Three main use cases for data warehousing are performance, integration, and the availability of a wide variety of BI tools. The relational data warehouse approach is quite mature, and database vendors are constantly adding new index types, partitioning, statistics, and optimizer features. This enables complex queries to be done quickly, a must for any BI application. Data warehousing, and the ETL process, provide a robust mechanism for collecting, cleaning, and integrating data. And, it is increasingly easy for end users to create reports, graphs, and visualizations of the data. 4. In what scenarios can Hadoop and RDBMS coexist? There are several possible scenarios under which using a combination of Hadoop and relational DBMS-based data warehousing technologies makes sense. For example, you can use Hadoop for storing and archiving multi-structured data, with a connector to a relational DBMS that extracts required data from Hadoop for analysis by the relational DBMS. Hadoop can also be used to filter and transform multi-structural data for transporting to a data warehouse, and can also be used to analyze multi-structural data for publishing into the data warehouse environment. Combining SQL and MapReduce query functions enables data scientists to analyze both structured and unstructured data. Also, front end query tools are available for both platforms. Section 7.6 Review Questions 1. What is special about the Big Data vendor landscape? Who are the big players? The Big Data vendor landscape is developing very rapidly. It is in a special period of evolution where entrepreneurial startup firms bring innovative solutions to the marketplace. Cloudera is a market leader in the Hadoop space. MapR and Hortonworks are two other Hadoop startups. DataStax is an example of a NoSQL vendor. Informatica, Pervasive Software, Syncsort, and MicroStrategy are also players. Most of the growth in the industry is with Hadoop and NoSQL distributors and analytics providers. There is still very little in terms of Big Data
application vendors. Meanwhile, the next-generation data warehouse market has experienced significant consolidation. Four leading vendors in this space- Netezza, Greenplum, Vertica, and Aster Data-were acquired by IBM, EMC, HP and Teradata, respectively. Mega-vendors Oracle and I Bm also play in the bi Data space, connecting and consolidating their products with Hadoop and NosQl engines 2. How do you think the Big Data vendor land scape will change in the near future? Why? As the field matures, more and more traditional data vendors will incorporate Big Data into their architectures. We already saw something similar with the incorporation of XML data types and XPath processing engines in relational database engines. Also, the Big Data market will be increasingly cloud-based, and hosting services will include Big Data data storage options, along with the trad itional MySql and Sqlserver options. Vendors providing Big Data applications and services, for example in the finance domain or for scientific purposes, will begin to proliferate. ( Different students will have different swers What is the role of visual analytics in the world of Big data? Visual analytics help organizations uncover trends, relationships, and anomalies by visually sifting through very large quantities of data. Many vendors are developing visual analytics offerings, which have traditionally applied to structured data warehouse environments(relational and multid imensional), for the ig Data space. To be successful, a visual analytics application must allow for the coexistence and integration of relational and multistructured data Section 7.7 Review Questions 1. What is a stream(in the Big Data world)? A stream can be thought of as an unbounded flow or sequence of data elements rriving continuously at high velocity. Streams often cannot be efficiently or effectively stored for subsequent processing, thus Big Data concerns about Velocity (one of the six Vs)are especially prevalent when dealing with streams Examples of data streams include sensor data, computer network traffic, phone conversations. ATM transactions web searches and financial data 2. What are the motivations for stream analytics? In situations where data streams in rapidly and continuously, traditional analytics approaches that work with previously accumulated data (i.e, data at arrest)often either arrive at the wrong decisions because of using too much out-of-context data, or they arrive at the correct decisions but too late to be of any use to the Copyright C2018 Pearson Education, Inc
8 Copyright © 2018Pearson Education, Inc. application vendors. Meanwhile, the next-generation data warehouse market has experienced significant consolidation. Four leading vendors in this space— Netezza, Greenplum, Vertica, and Aster Data—were acquired by IBM, EMC, HP, and Teradata, respectively. Mega-vendors Oracle and IBM also play in the Big Data space, connecting and consolidating their products with Hadoop and NoSQL engines. 2. How do you think the Big Data vendor landscape will change in the near future? Why? As the field matures, more and more traditional data vendors will incorporate Big Data into their architectures. We already saw something similar with the incorporation of XML data types and XPath processing engines in relational database engines. Also, the Big Data market will be increasingly cloud-based, and hosting services will include Big Data data storage options, along with the traditional MySql and SqlServer options. Vendors providing Big Data applications and services, for example in the finance domain or for scientific purposes, will begin to proliferate. (Different students will have different answers.) 3. What is the role of visual analytics in the world of Big Data? Visual analytics help organizations uncover trends, relationships, and anomalies by visually sifting through very large quantities of data. Many vendors are developing visual analytics offerings, which have traditionally applied to structured data warehouse environments (relational and multidimensional), for the Big Data space. To be successful, a visual analytics application must allow for the coexistence and integration of relational and multistructured data. Section 7.7 Review Questions 1. What is a stream (in the Big Data world)? A stream can be thought of as an unbounded flow or sequence of data elements, arriving continuously at high velocity. Streams often cannot be efficiently or effectively stored for subsequent processing; thus Big Data concerns about Velocity (one of the six Vs) are especially prevalent when dealing with streams. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, web searches, and financial data. 2. What are the motivations for stream analytics? In situations where data streams in rapidly and continuously, traditional analytics approaches that work with previously accumulated data (i.e., data at arrest) often either arrive at the wrong decisions because of using too much out-of-context data, or they arrive at the correct decisions but too late to be of any use to the
organization. Therefore it is critical for a number of business situations to analyze the data soon after it is created and/or as soon as it is streamed into the analytics system. It is no longer feasible to"store everything Otherwise, analytics will either arrive at the wrong decisions because of using too much out-of-context data or at the correct decisions but too late to be of any use to the organization Therefore it is critical for a number of business situations to analyze the data as soon as it is streamed into the analytics system 3. What is stream analytics? How does it differ from regular analytics? Stream analytics is the process of extracting actionable information from continuously flowing/streaming data. It is also sometimes called"data in-motion analytics"or real-time data analytics. It differs from regular analytics in that it deals with high velocity(and transient)data streams instead of more permanent data stores like databases, files, or web pages 4. What is critical event processing? How does it relate to stream analytics? Critical event processing is a method of capturing, tracking, and analyzing streams of data to detect events(out of normal happenings) of certain types that are worthy of the effort. It involves combining data from multiple sources to infer events or patterns of interest. An event may also be defined generically as a change of state, which may be detected as a measurement exceeding a predefined threshold of time, temperature, or some other value. This applies to stream analytics because the events are happening in real time 5. Define data stream mining What add itional challenges are posed by data stream Data stream mining is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. Processing data streams, as opposed to more permanent data storages, is a challenge. Trad itional data mining techniques can process data recursively and repetitively because the data permanent. By contrast, a data stream is a continuous flow of ordered sequence of instances that can only be read once and must be processed immediately as th come in Copyright C2018 Pearson Education, Inc
9 Copyright © 2018Pearson Education, Inc. organization. Therefore it is critical for a number of business situations to analyze the data soon after it is created and/or as soon as it is streamed into the analytics system. It is no longer feasible to “store everything.” Otherwise, analytics will either arrive at the wrong decisions because of using too much out-of-context data, or at the correct decisions but too late to be of any use to the organization. Therefore it is critical for a number of business situations to analyze the data as soon as it is streamed into the analytics system. 3. What is stream analytics? How does it differ from regular analytics? Stream analytics is the process of extracting actionable information from continuously flowing/streaming data. It is also sometimes called “data in-motion analytics” or “real-time data analytics.” It differs from regular analytics in that it deals with high velocity (and transient) data streams instead of more permanent data stores like databases, files, or web pages. 4. What is critical event processing? How does it relate to stream analytics? Critical event processing is a method of capturing, tracking, and analyzing streams of data to detect events (out of normal happenings) of certain types that are worthy of the effort. It involves combining data from multiple sources to infer events or patterns of interest. An event may also be defined generically as a “change of state,” which may be detected as a measurement exceeding a predefined threshold of time, temperature, or some other value. This applies to stream analytics because the events are happening in real time. 5. Define data stream mining. What additional challenges are posed by data stream mining? Data stream mining is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. Processing data streams, as opposed to more permanent data storages, is a challenge. Traditional data mining techniques can process data recursively and repetitively because the data is permanent. By contrast, a data stream is a continuous flow of ordered sequence of instances that can only be read once and must be processed immediately as they come in
Section 7.8 Review Questions 1. What are the most fruitful industries for stream analytics? lany industries can benefit from stream analytics. Some prominent examples include e-commerce, telecommunications, law enforcement, cyber security, the power industry, health sciences, and the government How can stream analytics be used in e-commerce? Companies such as Amazon and e Bay use stream analytics to analyze customer behavior in real time. Every page visit, every product looked at, every search conducted, and every click made is recorded and analyzed to maximize the value gained from a user's visit. Behind the scenes, advanced analytics are crunching the real-time data coming from our clicks. and the clicks of thousands of others to "understand what it is that we are interested in (in some cases, even we do not know that)and make the most of that information by creative offerings In add ition to what is listed in this section, can you think of other industries and/or application areas where stream analytics can be used? Stream analytics could be of great benefit to any industry that faces an influx of relevant real-time data and needs to make quick decisions. One example is the news industry. By rapidly sifting through data streaming in, a news organization can recognize"newsworthy"themes (i.e, critical events). Another benefit would be for weather tracking in order to better predict tornadoes or other natural disasters. Different students will have different answers. 4. Compared to regular analytics, do you think stream analytics will have more (or less) use cases in the era of Big Data analytics? Why? Stream analytics can be thought of as a subset of analytics in general, just like may refer to traditional data warehousing approaches. which does constrain the n regular"analytics. The question is, what does"" mean? Regular analytics types of data sources and hence the use cases. Or, "regular"may mean analytics on any type of permanent stored architecture(as opposed to transient streams ). I this case, you have more use cases for"regular"(includ ing Big Data) than in the previous definition. In either case, there will probably be plenty of times when regular"use cases will continue to play a role, even in the era of Big Data analytics. Different students will have d ifferent answers. Copyright C2018 Pearson Education, Inc
10 Copyright © 2018Pearson Education, Inc. Section 7.8 Review Questions 1. What are the most fruitful industries for stream analytics? Many industries can benefit from stream analytics. Some prominent examples include e-commerce, telecommunications, law enforcement, cyber security, the power industry, health sciences, and the government. 2. How can stream analytics be used in e-commerce? Companies such as Amazon and eBay use stream analytics to analyze customer behavior in real time. Every page visit, every product looked at, every search conducted, and every click made is recorded and analyzed to maximize the value gained from a user’s visit. Behind the scenes, advanced analytics are crunching the real-time data coming from our clicks, and the clicks of thousands of others, to “understand” what it is that we are interested in (in some cases, even we do not know that) and make the most of that information by creative offerings. 3. In addition to what is listed in this section, can you think of other industries and/or application areas where stream analytics can be used? Stream analytics could be of great benefit to any industry that faces an influx of relevant real-time data and needs to make quick decisions. One example is the news industry. By rapidly sifting through data streaming in, a news organization can recognize “newsworthy” themes (i.e., critical events). Another benefit would be for weather tracking in order to better predict tornadoes or other natural disasters. (Different students will have different answers.) 4. Compared to regular analytics, do you think stream analytics will have more (or less) use cases in the era of Big Data analytics? Why? Stream analytics can be thought of as a subset of analytics in general, just like “regular” analytics. The question is, what does “regular” mean? Regular analytics may refer to traditional data warehousing approaches, which does constrain the types of data sources and hence the use cases. Or, “regular” may mean analytics on any type of permanent stored architecture (as opposed to transient streams). In this case, you have more use cases for “regular” (including Big Data) than in the previous definition. In either case, there will probably be plenty of times when “regular” use cases will continue to play a role, even in the era of Big Data analytics. (Different students will have different answers.)