Business Intelligence, 4e(sharda/Delen/Turban) Chapter 4 Predictive Analytics I: Data Mining Process, Methods, and algorithms 1)In the opening case, police detectives used data mining to identify possible new areas of Answer: FALSE Diff: 1 Page Ref: 190-19 2)The cost of data storage has plummeted recently, making data mining feasible for more firm Answer: TRUE Diff: 2 Page Ref: 194 3)Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales Answer: FALSE Diff: 2 Page Ref: 193 4)If using a mining analogy, knowledge mining"would be a more appropriate term than"data mining Answer: TRUE Diff: 2 Page Ref: 196 5) The entire focus of the predictive analytics system in the Infinity P&c case was on detecting and handling fraudulent claims for the company's benefit Answer: FALSE Diff: 3 Page Ref: 194-195 6) Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system Answer: FALSE Diff: 2 Page Ref: 197 7) Ratio data is a type of categorical data Answer: FALSE Diff: 1 Page Ref: 202 8)Converting continuous valued numerical variables to ranges and categories is referred to as discretization Answer: TRUE Diff: 2 Page Ref: 202 9)In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime Answer: FALSE Diff: 1 Page Ref: 190-191 Copyright C 2018 Pearson Education, Inc
1 Copyright © 2018 Pearson Education, Inc. Business Intelligence, 4e (Sharda/Delen/Turban) Chapter 4 Predictive Analytics I: Data Mining Process, Methods, and Algorithms 1) In the opening case, police detectives used data mining to identify possible new areas of inquiry. Answer: FALSE Diff: 1 Page Ref: 190-191 2) The cost of data storage has plummeted recently, making data mining feasible for more firms. Answer: TRUE Diff: 2 Page Ref: 194 3) Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales. Answer: FALSE Diff: 2 Page Ref: 193 4) If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining." Answer: TRUE Diff: 2 Page Ref: 196 5) The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit. Answer: FALSE Diff: 3 Page Ref: 194-195 6) Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system. Answer: FALSE Diff: 2 Page Ref: 197 7) Ratio data is a type of categorical data. Answer: FALSE Diff: 1 Page Ref: 202 8) Converting continuous valued numerical variables to ranges and categories is referred to as discretization. Answer: TRUE Diff: 2 Page Ref: 202 9) In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime. Answer: FALSE Diff: 1 Page Ref: 190-191
10)In data mining, classification models help in prediction Answer: TRUE Diff: 2 Page Ref: 215 11)Statistics and data mining both look for data sets that are as large as possible Answer: FALSE Diff: 2 Page Ref: 216 12)Using data mining on data about imports and exports can help to detect tax avoidance and money laundering Answer: TRUE Diff: 1 Page Ref: 206 13)In the cancer research case study, data mining algorithms that predict cancer survivabilit with high predictive power are good replacements for medical professional cer survivabilit Answer: FALSE Diff: 2 Page Ref: 209-210 14) During classification in data mining, a false positive is an occurrence classified as true by the algorithm while being false in reality Answer: TRUE Diff: 2 Page Ref: 216 15)K-fold cross-validation is also called sliding estimation Answer: FALSE Diff: 2 Page Ref: 218 decision trees may be a useful approach at impact 16 When a problem has many attributes that impact the classification of different patterns Answer: TRUE Diff:2 Page Ref: 221 17)In the dell cases study, the largest issue was how to properly spend the online marketing buds Answer: FALSE Diff: 2 Page Ref: 198-199 18)Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance Answer: FALSE Diff: 2 Page Ref: 227 19)Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica Answer: FAL Diff: 1 Page Ref: 231 Copyright C 2018 Pearson Education, Inc
2 Copyright © 2018 Pearson Education, Inc. 10) In data mining, classification models help in prediction. Answer: TRUE Diff: 2 Page Ref: 215 11) Statistics and data mining both look for data sets that are as large as possible. Answer: FALSE Diff: 2 Page Ref: 216 12) Using data mining on data about imports and exports can help to detect tax avoidance and money laundering. Answer: TRUE Diff: 1 Page Ref: 206 13) In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals. Answer: FALSE Diff: 2 Page Ref: 209-210 14) During classification in data mining, a false positive is an occurrence classified as true by the algorithm while being false in reality. Answer: TRUE Diff: 2 Page Ref: 216 15) K-fold cross-validation is also called sliding estimation. Answer: FALSE Diff: 2 Page Ref: 218 16) When a problem has many attributes that impact the classification of different patterns, decision trees may be a useful approach. Answer: TRUE Diff: 2 Page Ref: 221 17) In the Dell cases study, the largest issue was how to properly spend the online marketing budget. Answer: FALSE Diff: 2 Page Ref: 198-199 18) Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance. Answer: FALSE Diff: 2 Page Ref: 227 19) Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica. Answer: FALSE Diff: 1 Page Ref: 231
20)Data that is collected, stored, and analyzed in data mining is often private and personal There is no way to maintain individuals' privacy other than being very careful about physical data security Answer: FALSE Diff: 2 Page Ref: 237 21) In the Influence Health case study, what was the goal of the system? A)locating clinic patients B)understanding follow-up care C)decreasing operational costs D)increasing service use Answer: D Diff: 3 Page Ref: 224 22)Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from A)collecting data about customers and transactions B)developing a philosophy that is data analytics-centric C) analyzing the vast data amounts routinely collected D)asking the customers what they want Answer: C Diff: 3 Page Ref: 193 23)All of the following statements about data mining are true EXCEPT A)the process aspect means that data mining should be a one-step process to results B)the novel aspect means that previously unknown patterns are discovered C)the potentially useful aspect means that results should lead to some business benefit D)the valid aspect means that the discovered patterns should hold true on new data Answer:A Diff:3 Page Ref: 196 24)What is the main reason parallel processing is sometimes used for data mining? A)because the hardware exists in most organizations, and it is available to use B)because most of the algorithms used for data mining require it C) because of the massive data amounts and search efforts involved D)because any strategic application requires parallel processing Answer: C Diff:3 Page Ref: 197 25)The data field"ethnic group"can be best described as A)nominal data B)interval data C)ordinal data D)ratio data Answer: A Diff:2 Page Ref: 208 Copyright C 2018 Pearson Education, Inc
3 Copyright © 2018 Pearson Education, Inc. 20) Data that is collected, stored, and analyzed in data mining is often private and personal. There is no way to maintain individuals' privacy other than being very careful about physical data security. Answer: FALSE Diff: 2 Page Ref: 237 21) In the Influence Health case study, what was the goal of the system? A) locating clinic patients B) understanding follow-up care C) decreasing operational costs D) increasing service use Answer: D Diff: 3 Page Ref: 224 22) Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from A) collecting data about customers and transactions. B) developing a philosophy that is data analytics-centric. C) analyzing the vast data amounts routinely collected. D) asking the customers what they want. Answer: C Diff: 3 Page Ref: 193 23) All of the following statements about data mining are true EXCEPT A) the process aspect means that data mining should be a one-step process to results. B) the novel aspect means that previously unknown patterns are discovered. C) the potentially useful aspect means that results should lead to some business benefit. D) the valid aspect means that the discovered patterns should hold true on new data. Answer: A Diff: 3 Page Ref: 196 24) What is the main reason parallel processing is sometimes used for data mining? A) because the hardware exists in most organizations, and it is available to use B) because most of the algorithms used for data mining require it C) because of the massive data amounts and search efforts involved D) because any strategic application requires parallel processing Answer: C Diff: 3 Page Ref: 197 25) The data field "ethnic group" can be best described as A) nominal data. B) interval data. C) ordinal data. D) ratio data. Answer: A Diff: 2 Page Ref: 208
26)A data mining study is specific to addressing a well-defined business task, and different business tasks require A)general organizational data B)general industry data C) general economic data D)different sets of data Answer: D Diff: 2 Page Ref: 208 27)Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes? A)asso oBlations C)classification D)clustering Answer: C Diff: 2 Page Ref: 200 28)Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features? A)associations B)visualization C)classification D)clustering Answer: D Diff: 2 Page Ref: 200 29)Clustering partitions a collection of things into segments whose members share A)similar characteristics B)dissimilar characteristics C)similar collection methods D)dissimilar collection methods Answer:A Diff: 2 Page Ref: 202 30) Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications? A) B)retailing and logistics C) customer relationship management D) computer hardware and software Answer:A Diff: 2 Page Ref: 204 Copyright C 2018 Pearson Education, Inc
4 Copyright © 2018 Pearson Education, Inc. 26) A data mining study is specific to addressing a well-defined business task, and different business tasks require A) general organizational data. B) general industry data. C) general economic data. D) different sets of data. Answer: D Diff: 2 Page Ref: 208 27) Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes? A) associations B) visualization C) classification D) clustering Answer: C Diff: 2 Page Ref: 200 28) Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features? A) associations B) visualization C) classification D) clustering Answer: D Diff: 2 Page Ref: 200 29) Clustering partitions a collection of things into segments whose members share A) similar characteristics. B) dissimilar characteristics. C) similar collection methods. D) dissimilar collection methods. Answer: A Diff: 2 Page Ref: 202 30) Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications? A) insurance B) retailing and logistics C) customer relationship management D) computer hardware and software Answer: A Diff: 2 Page Ref: 204
31)All of the following statements about data mining are true EXCEPt A) The term is relatively new B) Its techniques have their roots in traditional statistical analysis and artificial intelligence C)The ideas behind it are relatively new D )Intense, global competition make its application more important Ar Diff: 2 Page Ref: 194 32)Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets com rankings? A)SEMMA B)proprietary organizational methodologies C)KDD Process D)CRISP-DM Answer: D Diff:2 Page Ref: 214 33)Prediction problems where the variables have numeric values are most accurately defined as A)classifications B)regressions C)associations D)computations Answer: B Diff: 3 Page Ref: 215 34)What does the robustness of a data mining method refer to? A)its ability to predict the outcome of a previously unknown data set accurately B)its speed of computation and computational costs in using the mode C)its ability to construct a prediction model efficiently given a large amount of data D)its ability to overcome noisy data to make somewhat accurate predictions Answer: D Diff: 3 Page Ref: 216 35)What does the scalability of a data mining method refer to? A)its ability to predict the outcome of a previously unknown data set accurately B)its speed of computation and computational costs in using the mode C)its ability to construct a prediction model efficiently given a large amount of data D)its ability to overcome noisy data to make somewhat accurate predictions Answer: C Diff: 3 Page Ref: 216 Copyright C 2018 Pearson Education, Inc
5 Copyright © 2018 Pearson Education, Inc. 31) All of the following statements about data mining are true EXCEPT: A) The term is relatively new. B) Its techniques have their roots in traditional statistical analysis and artificial intelligence. C) The ideas behind it are relatively new. D) Intense, global competition make its application more important. Answer: C Diff: 2 Page Ref: 194 32) Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets.com rankings? A) SEMMA B) proprietary organizational methodologies C) KDD Process D) CRISP-DM Answer: D Diff: 2 Page Ref: 214 33) Prediction problems where the variables have numeric values are most accurately defined as A) classifications. B) regressions. C) associations. D) computations. Answer: B Diff: 3 Page Ref: 215 34) What does the robustness of a data mining method refer to? A) its ability to predict the outcome of a previously unknown data set accurately B) its speed of computation and computational costs in using the mode C) its ability to construct a prediction model efficiently given a large amount of data D) its ability to overcome noisy data to make somewhat accurate predictions Answer: D Diff: 3 Page Ref: 216 35) What does the scalability of a data mining method refer to? A) its ability to predict the outcome of a previously unknown data set accurately B) its speed of computation and computational costs in using the mode C) its ability to construct a prediction model efficiently given a large amount of data D) its ability to overcome noisy data to make somewhat accurate predictions Answer: C Diff: 3 Page Ref: 216
36) In estimating the accuracy of data mining(or other ) classification models, the true positive rate Is A)the ratio of correctly classified positives divided by the total positive count B)the ratio of correctly classified negatives divided by the total negative count C)the ratio of correctly classified positives divided by the sum of correctly classified positives rectly classified positives D)the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives Answer:A Diff: 2 Page Ref: 216-217 37) In data mining, finding an affinity of two products to be commonly together in a shopping is know A)association rule mining B)cluster analysis C)decision trees D)artificial neural networks Answer:A Diff: 2 Page Ref: 227 38)Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by A)asking data users to use the data ethically B)leaving in identifiers(e.g, name), but changing other variabl C)removing identifiers such as names and social security numbers D) letting individuals in the data know their data is being accessed Answer: C Diff:3 Page Ref: 237 39) In the Target case study, why did Target send a teen maternity ads? A)Target's analytic model confused her with an older woman with a similar name B)Target was sending ads to all women in a particular neighborhood C) Target's analytic model suggested she was pregnant based on her buying habits D)Target was using a special promotion that targeted all teens in her geographical area Answer: C Diff: 2 Page Ref: 238 40)Which of the following is a data mining myth? A)Data mining is a multistep process that requires deliberate, proactive design and use B) Data mining requires a separate, dedicated database C)The current state-of-the-art is ready to go for almost any business D) Newer Web-based tools enable managers of all educational levels to do data mining Answer: B Diff:2 Page Ref: 239-240 Copyright C 2018 Pearson Education, Inc
6 Copyright © 2018 Pearson Education, Inc. 36) In estimating the accuracy of data mining (or other) classification models, the true positive rate is A) the ratio of correctly classified positives divided by the total positive count. B) the ratio of correctly classified negatives divided by the total negative count. C) the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified positives. D) the ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives. Answer: A Diff: 2 Page Ref: 216-217 37) In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as A) association rule mining. B) cluster analysis. C) decision trees. D) artificial neural networks. Answer: A Diff: 2 Page Ref: 227 38) Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by A) asking data users to use the data ethically. B) leaving in identifiers (e.g., name), but changing other variables. C) removing identifiers such as names and social security numbers. D) letting individuals in the data know their data is being accessed. Answer: C Diff: 3 Page Ref: 237 39) In the Target case study, why did Target send a teen maternity ads? A) Target's analytic model confused her with an older woman with a similar name. B) Target was sending ads to all women in a particular neighborhood. C) Target's analytic model suggested she was pregnant based on her buying habits. D) Target was using a special promotion that targeted all teens in her geographical area. Answer: C Diff: 2 Page Ref: 238 40) Which of the following is a data mining myth? A) Data mining is a multistep process that requires deliberate, proactive design and use. B) Data mining requires a separate, dedicated database. C) The current state-of-the-art is ready to go for almost any business. D) Newer Web-based tools enable managers of all educational levels to do data mining. Answer: B Diff: 2 Page Ref: 239-240
41)In the Influence Health case, the company was able to evaluate over mullion records in only two days Ansy Diff: 3 Page Ref: 225 42)There has been an increase in data mining to deal with global competition and customers more sophisticated and wants Answer: needs Diff: 2 Page Ref: 194 43)Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging are all alternative names for Answer: data mining Diff: 1 Page Ref: 196 .4) Data are often buried deep within very large which sometimes contain data from several years Answer: databases Diff: 1 Page Ref: 196 45) was proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining Answer: CRISP-DM Diff: 2 Page Ref: 207 46)In the Dell case study, engineers working closely with marketing, used lean software development strategies and numerous technologies to create a highly scalable, singular Answer: data mart Diff:2 Page Ref: 199 volume of data in modern times has created a need for more automatic approache K 47)Patterns have been manually from data by humans for centuries, but the increasin Answer: extracted Diff: 2 Page Ref: 200 48)While prediction is largely experience and opinion based is data and model based Answer: forecasting Diff: 2 Page Ref: 200 49)Whereas starts with a well-defined proposition and hypothesis, data mining starts with a loosely defined discovery statement Answer: statisti Diff:2 Page Ref: 203 Copyright C 2018 Pearson Education, Inc
7 Copyright © 2018 Pearson Education, Inc. 41) In the Influence Health case, the company was able to evaluate over ________ million records in only two days. Answer: 195 Diff: 3 Page Ref: 225 42) There has been an increase in data mining to deal with global competition and customers' more sophisticated ________ and wants. Answer: needs Diff: 2 Page Ref: 194 43) Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging are all alternative names for ________. Answer: data mining Diff: 1 Page Ref: 196 44) Data are often buried deep within very large ________, which sometimes contain data from several years. Answer: databases Diff: 1 Page Ref: 196 45) ________ was proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining. Answer: CRISP-DM Diff: 2 Page Ref: 207 46) In the Dell case study, engineers working closely with marketing, used lean software development strategies and numerous technologies to create a highly scalable, singular ________. Answer: data mart Diff: 2 Page Ref: 199 47) Patterns have been manually ________ from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches. Answer: extracted Diff: 2 Page Ref: 200 48) While prediction is largely experience and opinion based, ________ is data and model based. Answer: forecasting Diff: 2 Page Ref: 200 49) Whereas ________ starts with a well-defined proposition and hypothesis, data mining starts with a loosely defined discovery statement. Answer: statistics Diff: 2 Page Ref: 203
50)Customer management extends traditional marketing by creating one-on-one relationships with customers Answer: relationsh Diff: 2 Page Ref: 203 51)In the terrorist funding case study an observed price may be related to income tax avoidance/evasion, money laundering, or terrorist financin answer: deviation Diff: 3 Page Ref: 206 62)Data preparation, the third step in the CRISP-DM data mining process, is more commonly known as Answer: data preprocessing Diff: 2 Page Ref: 208 53)The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and hidden deep in large and complex medical databases Answer: relationships Diff:3 Page Ref: 209-210 54) Fayyad et al. (1996)defined in databases as a process of using data mining methods to find useful information and patterns in the data Answer: knowledge discovery Diff:2 Page Ref: 213 55)In a classification method, the complete data set is randomly split into mutually exclusive subsets of approximately equal size and tested multiple times on each left-out subset using the others as a training set Answer: k-fold cross-validation Diff: 2 Page Ref: 218 56) The basic idea behind a(n) is that it recursively divides a training set until each division consists entirely or primarily of examples from one class Answer: decision tree Diff: 3 Page Ref: 221 57)As described in the Influence Health case study, customers are more often services from a variety of healthcare service providers before selecting one Answer: comparing Diff: 2 Page Ref: 224 58)Because of its successful application to retail business problems, association rule mining commonly called Answer: market-basket analysis Diff: 2 Page Ref: 22 Copyright C 2018 Pearson Education, Inc
8 Copyright © 2018 Pearson Education, Inc. 50) Customer ________ management extends traditional marketing by creating one-on-one relationships with customers. Answer: relationship Diff: 2 Page Ref: 203 51) In the terrorist funding case study, an observed price ________ may be related to income tax avoidance/evasion, money laundering, or terrorist financing. Answer: deviation Diff: 3 Page Ref: 206 52) Data preparation, the third step in the CRISP-DM data mining process, is more commonly known as ________. Answer: data preprocessing Diff: 2 Page Ref: 208 53) The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and ________ hidden deep in large and complex medical databases. Answer: relationships Diff: 3 Page Ref: 209-210 54) Fayyad et al. (1996) defined ________ in databases as a process of using data mining methods to find useful information and patterns in the data. Answer: knowledge discovery Diff: 2 Page Ref: 213 55) In ________, a classification method, the complete data set is randomly split into mutually exclusive subsets of approximately equal size and tested multiple times on each left-out subset, using the others as a training set. Answer: k-fold cross-validation Diff: 2 Page Ref: 218 56) The basic idea behind a(n) ________ is that it recursively divides a training set until each division consists entirely or primarily of examples from one class. Answer: decision tree Diff: 3 Page Ref: 221 57) As described in the Influence Health case study, customers are more often ________ services from a variety of healthcare service providers before selecting one. Answer: comparing Diff: 2 Page Ref: 224 58) Because of its successful application to retail business problems, association rule mining is commonly called ________. Answer: market-basket analysis Diff: 2 Page Ref: 227
59)The is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets Answer: Apriori algorithm Diff: 2 Page Ref: 229 60)One way to accomplish privacy and protection of individuals rights when data mining is by of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual Answer: de-identification Diff: 2 Page Ref: 237 61)List five reasons for the growing popularity of data mining in the business world nswer More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace General recognition of the untapped value hidden in large data sources Consolidation and integration of database records, which enables a single view of customers vendors transactions etc Consolidation of databases and other data repositories into a single location in the form of a data warehouse The exponential increase in data processing and storage technologies Significant reduction in the cost of hardware and software for data storage and processing Movement toward the demassification(conversion of information resources into nonphysical form)of business practices Diff:2 Page Ref: 194 62)List 3 common data mining myths and realities Answ 1)Myth: Data mining provides instant, crystal-ball-like predictions Reality: Data mining is a multistep process that requires deliberate, proactive design and use 2)Myth: Data mining is not yet viable for mainstream business applications Reality: The current state of the art is ready to go for almost any business type and/or size 3)Myth: Data mining requires a separate, dedicated database Reality: Because of the advances in database technology, a dedicated database is not required 4)Myth: Only those with advanced degrees can do data mining Reality: Newer Web-based tools enable managers of all educational levels to do data mining 5)Myth: Data mining is only for large firms that have lots of customer data Reality: If the data accurately reflect the business or its customers, any company can use data mining Diff: 2 Page Ref: 239 9 Copyright C 2018 Pearson Education, Inc
9 Copyright © 2018 Pearson Education, Inc. 59) The ________ is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets. Answer: Apriori algorithm Diff: 2 Page Ref: 229 60) One way to accomplish privacy and protection of individuals' rights when data mining is by ________ of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual. Answer: de-identification Diff: 2 Page Ref: 237 61) List five reasons for the growing popularity of data mining in the business world. Answer: • More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace • General recognition of the untapped value hidden in large data sources • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse • The exponential increase in data processing and storage technologies • Significant reduction in the cost of hardware and software for data storage and processing • Movement toward the demassification (conversion of information resources into nonphysical form) of business practices Diff: 2 Page Ref: 194 62) List 3 common data mining myths and realities. Answer: 1) Myth: Data mining provides instant, crystal-ball-like predictions. Reality: Data mining is a multistep process that requires deliberate, proactive design and use. 2) Myth: Data mining is not yet viable for mainstream business applications. Reality: The current state of the art is ready to go for almost any business type and/or size. 3) Myth: Data mining requires a separate, dedicated database. Reality: Because of the advances in database technology, a dedicated database is not required. 4) Myth: Only those with advanced degrees can do data mining. Reality: Newer Web-based tools enable managers of all educational levels to do data mining. 5) Myth: Data mining is only for large firms that have lots of customer data. Reality: If the data accurately reflect the business or its customers, any company can use data mining. Diff: 2 Page Ref: 239
63) List and briefly describe the six steps of the Crisp-dm data mining process Ar nswer Step 1: Business Understanding- The key element of any data mining study is to know what the study is for. Answering such a question begins with a thorough understanding of the managerial need for new knowledge and an explicit specification of the business objective regarding the study to be conducted Step 2: Data Understanding- A data mining study is specific to addressing a well-defined business task, and different business tasks require different sets of data. Following the business understanding, the main activity of the data mining process is to identify the relevant data from many available databases Step 3: Data Preparation-The purpose of data preparation(or more commonly called data preprocessing)is to take the data identified in the previous step and prepare it for analysis by data mining methods. Compared to the other steps in CRISP-DM, data preprocessing consumes the most time and effort; most believe that this step accounts for roughly 80 percent of the total time spent on a data mining project Step 4: Model Building- Here, various modeling techniques are selected and applied to an already prepared data set in order to address the specific business need. The model-building step also encompasses the assessment and comparative analysis of the various models built Step 5: Testing and Evaluation--In step 5, the developed models are assessed and evaluated for their accuracy and generality. This step assesses the degree to which the selected model(or models)meets the business objectives and if so to what extent(i.e. do more models need to be Step 6: Deployment- Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment Diff: 2 Page Ref: 207-212 64) Describe the role of the simple split in estimating the accuracy of classification models Answer: The simple split(or holdout or test sample estimation) partitions the data into two mutually exclusive subsets called a training set and a test set(or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer(model builder), and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing Diff: 2 Page Ref: 217 Copyright C 2018 Pearson Education, Inc
10 Copyright © 2018 Pearson Education, Inc. 63) List and briefly describe the six steps of the CRISP-DM data mining process. Answer: Step 1: Business Understanding — The key element of any data mining study is to know what the study is for. Answering such a question begins with a thorough understanding of the managerial need for new knowledge and an explicit specification of the business objective regarding the study to be conducted. Step 2: Data Understanding — A data mining study is specific to addressing a well-defined business task, and different business tasks require different sets of data. Following the business understanding, the main activity of the data mining process is to identify the relevant data from many available databases. Step 3: Data Preparation — The purpose of data preparation (or more commonly called data preprocessing) is to take the data identified in the previous step and prepare it for analysis by data mining methods. Compared to the other steps in CRISP-DM, data preprocessing consumes the most time and effort; most believe that this step accounts for roughly 80 percent of the total time spent on a data mining project Step 4: Model Building — Here, various modeling techniques are selected and applied to an already prepared data set in order to address the specific business need. The model-building step also encompasses the assessment and comparative analysis of the various models built. Step 5: Testing and Evaluation — In step 5, the developed models are assessed and evaluated for their accuracy and generality. This step assesses the degree to which the selected model (or models) meets the business objectives and, if so, to what extent (i.e., do more models need to be developed and assessed). Step 6: Deployment — Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment steps. Diff: 2 Page Ref: 207-212 64) Describe the role of the simple split in estimating the accuracy of classification models. Answer: The simple split (or holdout or test sample estimation) partitions the data into two mutually exclusive subsets called a training set and a test set (or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer (model builder), and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing. Diff: 2 Page Ref: 217