《商务智能：数据分析的管理视角 Business Intelligence, Analytics, and Data Science：A Managerial Perspective》教学资源（教师手册，原书第4版）02 Descriptive Analytics I：Nature of Data, Statistical Modeling, and Visualization

团购合买资源类别：文库，文档格式：DOC，文档页数：31，文件大小：145.5KB

new bacon, "data is the new currency and"data is the king " are further stressing the renewed importance of data. But what type of data are we talking about? Obviously, not just any data. The"garbage in garbage out-GIGO concept/principle applies to todays Big Data phenomenon more so than any data definition that we have had in the past to be carefully created/identified, collected, integrated, cleaned, transformed, and ata has To live up to its promise, its value proposition, and its ability to turn into insight, data has properly contextualized for use in accurate and timely decision making. Data is the main heme of this chapter. Accordingly, the chapter starts with a description of the nature of data: what it is, what different types and forms it can come in, and how it can be preprocessed and made ready for analytics. The first few sections of the chapter are dedicated to a deep yet necessary understanding and processing of data. The next few sections describe the statistical methods used to prepare data as input to produce both descriptive and inferential measures. Following the statistics sections are sections or reporting and visualization. A report is a communication artifact prepared with the pecific intention of converting data into information and knowled ge and relaying that information in an easily understandable/digestible format. Nowadays, these reports are more visually oriented, often using colors and graphical icons that collectively look like a dashboard to enhance the information content. Therefore, the latter part of the chapter is ded icated to subsections that present the design, implementation, and best practices for information visualization, storytelling, and information dashboards CHAPTER OUTLINE 2. 1 Opening Vignette: Sirius XM Attracts and Engages a New Generation of Radio Consumers with Data-Driven Marketing 2.2 The Nature of Data 2.3 A Simple Taxonomy of Data 2. 4 The Art and Science of Data Preprocessing 2.5 Statistical Modeling for Business analytics 2.6 Regression Modeling For Inferential Statistics 2.7 Business Reporting 2. 8 Data visualization 2.9 Different Types of Charts and Graphs 2. 10 The Emergence of Visual Analytics 2.11 Information dashboards Copyright C2018 Pearson Education, Inc

2 Copyright © 2018Pearson Education, Inc. new bacon,” “data is the new currency,” and “data is the king” are further stressing the renewed importance of data. But what type of data are we talking about? Obviously, not just any data. The “garbage in garbage out—GIGO” concept/principle applies to today’s “Big Data” phenomenon more so than any data definition that we have had in the past. To live up to its promise, its value proposition, and its ability to turn into insight, data has to be carefully created/identified, collected, integrated, cleaned, transformed, and properly contextualized for use in accurate and timely decision making. Data is the main theme of this chapter. Accordingly, the chapter starts with a description of the nature of data: what it is, what different types and forms it can come in, and how it can be preprocessed and made ready for analytics. The first few sections of the chapter are dedicated to a deep yet necessary understanding and processing of data. The next few sections describe the statistical methods used to prepare data as input to produce both descriptive and inferential measures. Following the statistics sections are sections on reporting and visualization. A report is a communication artifact prepared with the specific intention of converting data into information and knowledge and relaying that information in an easily understandable/digestible format. Nowadays, these reports are more visually oriented, often using colors and graphical icons that collectively look like a dashboard to enhance the information content. Therefore, the latter part of the chapter is dedicated to subsections that present the design, implementation, and best practices for information visualization, storytelling, and information dashboards. CHAPTER OUTLINE 2.1 Opening Vignette: SiriusXM Attracts and Engages a New Generation of Radio Consumers with Data-Driven Marketing 2.2 The Nature of Data 2.3 A Simple Taxonomy of Data 2.4 The Art and Science of Data Preprocessing 2.5 Statistical Modeling for Business Analytics 2.6 Regression Modeling For Inferential Statistics 2.7 Business Reporting 2.8 Data Visualization 2.9 Different Types of Charts and Graphs 2.10 The Emergence of Visual Analytics 2.11 Information Dashboards

3 Copyright © 2018Pearson Education, Inc. ANSWERS TO END OF SECTION REVIEW QUESTIONS      Section 2.1 Review Questions 1. What does SiriusXM do? In what type of market does it conduct its business? SiriusXM is a provider of satellite radio. They primarily provide services in automobiles. 2. What were the challenges? Comment on both technology and data-related challenges. The company had several challenges. The first was the changing demographics of car owners. As cars were sold on the secondary market it was more difficult for them to identify new potential customers. Additionally, the company had a technical challenge because of an acquisition. There was uncertainty about their ability to use all of the technology available through the acquisition. 3. What were the proposed solutions? The company felt that it would be able to maintain a strategic advantage if it began working towards being a data-driven marketing company. This would allow them to more precisely target current and potential customers. 4. How did they implement the proposed solutions? Did they face any implementation challenges? The company decided to bring all marketing work in-house. It was determined that it was important for them to clean the data and manage it in a central repository. To do this they partnered with Teradata. There were challenges with the implementation due to the variability in the data itself and the complexity of the task. 5. What were the results and benefits? Were they worth the effort/investment? The company has been able to progress significantly in its goal of becoming a data-driven marketing organization. With the new systems in place, it is possible to move campaigns faster with better visibility. 6. Can you think of other companies facing similar challenges that can potentially benefit from similar data-driven marketing solutions? Most companies that market directly to end users could use a similar approach to managing and leveraging data in their marketing activities

5 Copyright © 2018Pearson Education, Inc. 2. What are the main categories of data? What types of data can we use for BI and analytics? The main categories of data are structured data and unstructured data. Both of these types of data can be used for business intelligence and analytics, although it is easier and more expedient to use structured data. 3. Can we use the same data representation for all analytics models? Why, or why not? No, other data types, including textual, spatial, imagery, video, and voice, need to be converted into some form of categorical or numeric representation before they can be processed by analytics methods. 4. What is a 1-of-N data representation? Why and where is it used in analytics? Nominal or ordinal variables are converted into numeric representations using some type of 1-of-N pseudo variables (e.g., a categorical variable with three unique values can be transformed into three pseudo variables with binary values—1 or 0). This allows it to be used in predictive analytics. Section 2.4 Review Questions 1. Why is the original/raw data not readily usable by analytics tasks? It is often dirty, misaligned, overly complex, and inaccurate. 2. What are the main data preprocessing steps? The main data preprocessing steps include data consolidation, data cleaning, data transformation, and data reduction. 3. What does it mean to clean/scrub the data? What activities are performed in this phase? In this step, the values in the data set are identified and dealt with. The analyst will identify noisy values in the data and smooth them out, as well as addressing any missing values. 4. Why do we need data transformation? What are the commonly used data transformation tasks? Data transformation is often needed to ensure that data is in a format in which it can be used for analysis. During data transformation the data is normalized, discretized, and attributes are created. 5. Data reduction can be applied to rows (sampling) and/or columns (variable selection). Which is more challenging?

6 Copyright © 2018Pearson Education, Inc. Data reduction as it applies to variable selection is more complex. This is because variables to be studied must be selected and others discarded. This is typically done by individuals who are experts in the field. Section 2.5 Review Questions 1. What is the relationship between statistics and business analytics? Statistics can be used as a part of business analytics, either to help generate reports or as a presentation format. 2. What are the main differences between descriptive and inferential statistics? Descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population. 3. List and briefly define the central tendency measures of descriptive statistics. Measures of centrality are the mathematical methods by which we estimate or describe central positioning of a given variable of interest. A measure of central tendency is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data. The arithmetic mean (or simply mean or average) is the sum of all the values/observations divided by the number of observations in the data set. The median is the measure of center value in a given data set. It is the number in the middle of a given set of data that has been arranged/sorted in order of magnitude (either ascending or descending). The mode is the observation that occurs most frequently (the most frequent value in our data set). 4. List and briefly define the dispersion measures of descriptive statistics. Measures of dispersion are the mathematical methods used to estimate or describe the degree of variation in a given variable of interest. The range is the difference between the largest and the smallest values in a given data set (i.e., variables). Variance is a method used to calculate the deviation of all data points in a given data set from the mean

The standard deviation is a measure of the spread of values within a set of data The standard deviation is calculated by simply taking the square root variations Mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and summing them Quartiles help us identify spread within a subset of the data. a quartile is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets 5. What is a box-and-whiskers plot? What types of statistical information does it epresent The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges 6. What are the two most commonly used shape characteristics to describe a data distribution Skewness is a measure of asymmetry in a distribution of the data that portrays a unimodal structure--only one peak exists in the distribution of the data. Kurtosis is another measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the Section 2.6 Review Questions What is regression, and what statistical purpose does it serve Regression is a relatively simple statistical technique to model the dependence of a variable(response or output variable) on one(or more)explanatory(input) What are the commonalities and differences between regression and correlation? Correlation makes no a priori assumption of whether one variable is dependent on the other(s)and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. On the other hand, regression attempts to describe the dependence of a response variable on one(or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s)to the response variable, regardless of whether the path of effect is d irect or indirect. Also, although correlation is interested in the low-level relationships between two variables Copyright C2018 Pearson Education, Inc

7 Copyright © 2018Pearson Education, Inc. The standard deviation is a measure of the spread of values within a set of data. The standard deviation is calculated by simply taking the square root of the variations. Mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and summing them. Quartiles help us identify spread within a subset of the data. A quartile is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets. 5. What is a box-and-whiskers plot? What types of statistical information does it represent? The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges. 6. What are the two most commonly used shape characteristics to describe a data distribution? Skewness is a measure of asymmetry in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data. Kurtosis is another measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the distribution. Section 2.6 Review Questions 1. What is regression, and what statistical purpose does it serve? Regression is a relatively simple statistical technique to model the dependence of a variable (response or output variable) on one (or more) explanatory (input) variables. 2. What are the commonalities and differences between regression and correlation? Correlation makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. On the other hand, regression attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. Also, although correlation is interested in the low-level relationships between two variables

regression is concerned with the relationships between all explanatory variables and the response variable 3. What is ols? How does olS determine the linear regression line? Ordinary least squares(OLS) method aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the egression line 4. List and describe the main steps to follow in developing a linear regression model First perform a quick assessment of the data through the use of a scatter plot and/or correlations. Next, perform model fitting by transforming the data into a more usable format and estimating any needed parameters. Third, model your assessment by testing assumptions and evaluating its fit. Finally, if the steps show that regression is warranted, deploy and calculate the regression 5 What are the most commonly pronounced assumptions for linear regression? The most commonly pronounced assumptions for linear regression include linearity, independence, normality, constant variance, and multicollinearity 6 What is logistics regression? How does it differ from linear regression? Logistics regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It differs from linear regression with one major point: its output(response variable) is a class as opposed to a numerical variable 7. What is time series? What are the main forecasting techniques for time series data? Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values Section 2.7 Review Questions 1. What is a report? What are reports used for? A report is any communication artifact prepared with the specific intention of conveying information in a presentable form to whoever needs it, whenever and wherever they may need it. It is usually a document that contains information usually driven from data and personal experiences)organized in a narrative, graphic, and/or tabular form, prepared periodically(recurring)or required(ad hoc)basis, referring to specific time periods, events, occurrences, or subjects Copyright C2018 Pearson Education, Inc

8 Copyright © 2018Pearson Education, Inc. regression is concerned with the relationships between all explanatory variables and the response variable. 3. What is OLS? How does OLS determine the linear regression line? Ordinary least squares (OLS) method aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the regression line. 4. List and describe the main steps to follow in developing a linear regression model. First perform a quick assessment of the data through the use of a scatter plot and/or correlations. Next, perform model fitting by transforming the data into a more usable format and estimating any needed parameters. Third, model your assessment by testing assumptions and evaluating its fit. Finally, if the steps show that regression is warranted, deploy and calculate the regression. 5. What are the most commonly pronounced assumptions for linear regression? The most commonly pronounced assumptions for linear regression include linearity, independence, normality, constant variance, and multicollinearity. 6. What is logistics regression? How does it differ from linear regression? Logistics regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It differs from linear regression with one major point: its output (response variable) is a class as opposed to a numerical variable. 7. What is time series? What are the main forecasting techniques for time series data? Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values. Section 2.7 Review Questions 1. What is a report? What are reports used for? A report is any communication artifact prepared with the specific intention of conveying information in a presentable form to whoever needs it, whenever and wherever they may need it. It is usually a document that contains information (usually driven from data and personal experiences) organized in a narrative, graphic, and/or tabular form, prepared periodically (recurring) or on an asrequired (ad hoc) basis, referring to specific time periods, events, occurrences, or subjects

What is a business report? What are the main characteristics of a good business a business report is a written document that contains information regard usiness matters. Business reporting(also called enterprise reporting)is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. The found ation of these reports is various sources of data coming from both inside and outside the organization Creation of these reports involves ETL(extract, transform, and load) procedures in coord ination with a data warehouse and then using one or more reporting tools While reports can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. Primary characteristics of a good business report include clarity, brevity, completeness, and correctness 3. Describe the cyclic process of management and comment on the role of business eports The cyclic process of management, as illustrated in Figure 2. 1, involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting(i.e, information generation) converting data from d ifferent sources into actionable information 4. List and describe the three major categories of business reports There are a wide variety of business reports, which for managerial purposes can be grouped into three major categories: metric management reports, dashboard type reports, and balanced scorecard-type reports Metric management reports involve outcome-oriented metrics based on service level agreements and/or key performance indicators. Dashboard-type reports present a range of performance indicators on one page, with both static/predefined elements and customizable wid gets and views. Balanced scorecard reports present an integrated view of a company's health and include financial, customer, business process, and learning/growth perspectives 5. What are the main components of a business reporting system? a business reporting system includes several components. One is the online transaction processing system(ERP, POS, etc. )that records transactions. A second is a data supply that takes recorded events and transactions and delivers them to the reporting system. Next comes an EtL component that ensures quality and performs necessary transformations prior to load ing the data into a data store Then there is the data storage itself (such as a data warehouse ). Business logic converts the data into the reporting outputs. Publication distributes or hosts the reports for end users. And finally assurance provides a quality control check on the reports and their dissemination Copyright C2018 Pearson Education, Inc

9 Copyright © 2018Pearson Education, Inc. 2. What is a business report? What are the main characteristics of a good business report? A business report is a written document that contains information regarding business matters. Business reporting (also called enterprise reporting) is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. The foundation of these reports is various sources of data coming from both inside and outside the organization. Creation of these reports involves ETL (extract, transform, and load) procedures in coordination with a data warehouse and then using one or more reporting tools. While reports can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. Primary characteristics of a good business report include clarity, brevity, completeness, and correctness. 3. Describe the cyclic process of management and comment on the role of business reports. The cyclic process of management, as illustrated in Figure 2.1, involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting (i.e., information generation)— converting data from different sources into actionable information. 4. List and describe the three major categories of business reports. There are a wide variety of business reports, which for managerial purposes can be grouped into three major categories: metric management reports, dashboardtype reports, and balanced scorecard-type reports. Metric management reports involve outcome-oriented metrics based on service level agreements and/or key performance indicators. Dashboard-type reports present a range of performance indicators on one page, with both static/predefined elements and customizable widgets and views. Balanced scorecard reports present an integrated view of a company’s health and include financial, customer, business process, and learning/growth perspectives. 5. What are the main components of a business reporting system? A business reporting system includes several components. One is the online transaction processing system (ERP, POS, etc.) that records transactions. A second is a data supply that takes recorded events and transactions and delivers them to the reporting system. Next comes an ETL component that ensures quality and performs necessary transformations prior to loading the data into a data store. Then there is the data storage itself (such as a data warehouse). Business logic converts the data into the reporting outputs. Publication distributes or hosts the reports for end users. And finally assurance provides a quality control check on the reports and their dissemination

Section 2.8 Review Questions What is data visualization? Why is it needed? Data visualization, perhaps more appropriately called"information visualization is the use of visual representations to explore, make sense of, and communicate data. It is closely related to the fields of information graphics, scientific ualization, and statistical graphics. What is portrayed in visualizations is the information(aggregations, summarizations, and contextualization ) and not the data. Companies and individuals increasingly rely on data to make good decisions. Because data is so voluminous, there is a need for visual tools that help people understan What are the historical roots of data visualization Predecessors to data visualization date back to the second century AD Todays most popular visual forms date back a few centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s the now familiar line and bar charts date back to the late 1700s. Charles Joseph Minard used visualizations to graphically portray the losses suffered by Napoleon's army in the russian campaign of 1812 The 1900s saw the rise of a more formal, empirical attitude toward visualization which tended to focus on aspects such as color, value scales, and labeling. In the 2000s the Internet has emerged as a new medium for visualization, and added interactivity to previously static graphics 3. Carefully analyze Charles Joseph Minard's graphical portrayal of Napoleon march. Identify and comment on all of the information dimensions captured this ancient diagram In this graphic Minard managed to simultaneously represent several data dimensions, including the size of the army, direction of movement, geographic locations, outside temperature, etc. He did this in an artistic and informative manner. The background of the image is a map depicting the location of battles There is a thick lighter band that shows the size of Napoleon's army at each position, and a dark lower one that depicts the retreat. a line at the bottom depict temperatures at each position in time and space 4. Who is Edward Tufte? Why do you think we should know about his work? Edward Tufte is a statistician whose website chronicles many historical data visualizations, including Minard's graphic of Napoleons defeat. His work can bring insights into how to follow best practices for information visualization 5. What do you think is the next big thing" in data visualization? The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional Copyright C2018 Pearson Education, Inc

10 Copyright © 2018Pearson Education, Inc. Section 2.8 Review Questions 1. What is data visualization? Why is it needed? Data visualization, perhaps more appropriately called “information visualization,” is the use of visual representations to explore, make sense of, and communicate data. It is closely related to the fields of information graphics, scientific visualization, and statistical graphics. What is portrayed in visualizations is the information (aggregations, summarizations, and contextualization) and not the data. Companies and individuals increasingly rely on data to make good decisions. Because data is so voluminous, there is a need for visual tools that help people understand it. 2. What are the historical roots of data visualization? Predecessors to data visualization date back to the second century AD. Today’s most popular visual forms date back a few centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s. The now familiar line and bar charts date back to the late 1700s. Charles Joseph Minard used visualizations to graphically portray the losses suffered by Napoleon’s army in the Russian campaign of 1812. The 1900s saw the rise of a more formal, empirical attitude toward visualization, which tended to focus on aspects such as color, value scales, and labeling. In the 2000s the Internet has emerged as a new medium for visualization, and added interactivity to previously static graphics. 3. Carefully analyze Charles Joseph Minard’s graphical portrayal of Napoleon’s march. Identify and comment on all of the information dimensions captured in this ancient diagram. In this graphic Minard managed to simultaneously represent several data dimensions, including the size of the army, direction of movement, geographic locations, outside temperature, etc. He did this in an artistic and informative manner. The background of the image is a map depicting the location of battles. There is a thick lighter band that shows the size of Napoleon’s army at each position, and a dark lower one that depicts the retreat. A line at the bottom depicts temperatures at each position in time and space. 4. Who is Edward Tufte? Why do you think we should know about his work? Edward Tufte is a statistician whose website chronicles many historical data visualizations, including Minard’s graphic of Napoleon’s defeat. His work can bring insights into how to follow best practices for information visualization. 5. What do you think is the “next big thing” in data visualization? The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional

点击下载完整版文档（DOC格式）

共31页，可试读12页，点击继续阅读 ↓↓

点击下载（DOC格式）

浏览记录