Predictive analytics I CHAPTER Data mining process, Methods, and algorithms Learning Objectives for Chapter 4 Define data Understand the objectives and benefits of data mining Become familiar with the wide range of applications of data mining Learn the standardized data mining processes Learn d ifferent methods and algorithms of data mining Build awareness of the existing data mining software tools Understand the privacy issues, pitfalls, and myths of data mining CHAPTER OVERVIEW Generally speaking, data mining is a way to develop intelligence (i.e, actionable information or knowledge) from data that an organization collects, organizes, and stores understand ing of their customers and their own operations and to solve compler a better a wide range of data mining techniques are being used by organizations to gain a better organizational problems In this chapter, we study data mining as an enabling technology for business analytics learn about the standard processes of conducting data mining projects, understand and build expertise in the use of major data mining techniques, develop awareness of the existing software tools, and explore privacy issues, and common myths and pitfalls that are often associated with data mining Copyright C2018 Pearson Education, Inc
1 Copyright © 2018Pearson Education, Inc. Predictive Analytics I: Data Mining Process, Methods, and Algorithms Learning Objectives for Chapter 4 ▪ Define data mining as an enabling technology for business analytics ▪ Understand the objectives and benefits of data mining ▪ Become familiar with the wide range of applications of data mining ▪ Learn the standardized data mining processes ▪ Learn different methods and algorithms of data mining ▪ Build awareness of the existing data mining software tools ▪ Understand the privacy issues, pitfalls, and myths of data mining CHAPTER OVERVIEW Generally speaking, data mining is a way to develop intelligence (i.e., actionable information or knowledge) from data that an organization collects, organizes, and stores. A wide range of data mining techniques are being used by organizations to gain a better understanding of their customers and their own operations and to solve complex organizational problems. In this chapter, we study data mining as an enabling technology for business analytics, learn about the standard processes of conducting data mining projects, understand and build expertise in the use of major data mining techniques, develop awareness of the existing software tools, and explore privacy issues, and common myths and pitfalls that are often associated with data mining. CHAPTER 4
CHAPTER OUTLINE 4. 1 Opening Vignette: Miami-Dade Police Department Is Using Predictive Analytics to Foresee and Fight Crime 4.2 Data Mining Concepts and Applications 4.3 Data Mining Applications 4.4 Data Mining Process 4.5 Data Mining Methods 4.6 Data Mining Software Tools 4.7 Data Mining Privacy Issues, Myths, and Blunders ANSWERS TO END OF SECTION REVIEW QUESTIONS.o..0 Section 4.1 Review Questions 1. Why do law enforcement agencies and departments like Miami-Dade Police Department embrace ad vanced analytics and data mining Law enforcement agencies have embraced advanced analytics and data mining because it allows them to ad dress many of the needs that they have in their departments. Specifically, it allows them to be more efficient in their use of money and resources. They are able to do this by being more selective in the types of activities that they engage in. add itionally, data may be used to help them look at outstand ing crimes, and to find new avenues that may be explored 2. What are the top challenges for law enforcement agencies and departments like Miami-Dade Police Department? Can you think of other challenges(not mentioned in this case) that can benefit from data mining? The top challenges for many agencies revolve around being able to provide the best possible service within a limited budget This means that agencies must be able to be efficient in their use of time and resources, as well as ensuring that their results are positive. These issues are believed to be consistent across many departments and jurisdictions. In addition, other areas may struggle with specific questions about the use of funds, and the cost-benefit of d ifferent types of enforcement or possible prevention programs Copyright C2018 Pearson Education, Inc
2 Copyright © 2018Pearson Education, Inc. CHAPTER OUTLINE 4.1 Opening Vignette: Miami-Dade Police Department Is Using Predictive Analytics to Foresee and Fight Crime 4.2 Data Mining Concepts and Applications 4.3 Data Mining Applications 4.4 Data Mining Process 4.5 Data Mining Methods 4.6 Data Mining Software Tools 4.7 Data Mining Privacy Issues, Myths, and Blunders ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 4.1 Review Questions 1. Why do law enforcement agencies and departments like Miami-Dade Police Department embrace advanced analytics and data mining? Law enforcement agencies have embraced advanced analytics and data mining because it allows them to address many of the needs that they have in their departments. Specifically, it allows them to be more efficient in their use of money and resources. They are able to do this by being more selective in the types of activities that they engage in. Additionally, data may be used to help them look at outstanding crimes, and to find new avenues that may be explored. 2. What are the top challenges for law enforcement agencies and departments like Miami-Dade Police Department? Can you think of other challenges (not mentioned in this case) that can benefit from data mining? The top challenges for many agencies revolve around being able to provide the best possible service within a limited budget. This means that agencies must be able to be efficient in their use of time and resources, as well as ensuring that their results are positive. These issues are believed to be consistent across many departments and jurisdictions. In addition, other areas may struggle with specific questions about the use of funds, and the cost-benefit of different types of enforcement or possible prevention programs
3. What are the sources of data that law enforcement agencies and departments like Miami-Dade Police Department use for their predictive modeling and data mining The majority of this data comes from reports and information captured in the normal course of their work. It may be helpful in the future to tailor this collection of information so that it can more specifically be a benefit to mining efforts 4. What type of analytics do law enforcement agencies and departments like Miami- Dade police Department use to fight crime? a variety of types of analytic information can be used. This can include information from reports, as well as specific information about cases being worked by detectives What does" the big picture starts small"mean in this case? Explain In this case, the phrase means that large changes can begin with small ideas that may have initially seemed insignificant. Being able to see those small ideas, and evaluate which small ideas have merit, is a possible benefit of this type of analytic work Section 4.2 Review Questions 1. Define data mining. Why are there many different names and definitions for data Data mining is the through which previously unknown patterns in data were discovered. Another definition would be "a process that uses statistical mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data. This includes most types of automated data analysis. a third definition: Data mining is the process of find ing mathematical patterns from(usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining 2. What recent factors have increased the popularity of data mining Following are some of the most pronounced reasons: More intense competition at the global scale driven by customers changing needs and wants in an increasingly saturated marketplace ever- Copyright C2018 Pearson Education, Inc
3 Copyright © 2018Pearson Education, Inc. 3. What are the sources of data that law enforcement agencies and departments like Miami-Dade Police Department use for their predictive modeling and data mining projects? The majority of this data comes from reports and information captured in the normal course of their work. It may be helpful in the future to tailor this collection of information so that it can more specifically be a benefit to mining efforts. 4. What type of analytics do law enforcement agencies and departments like MiamiDade Police Department use to fight crime? A variety of types of analytic information can be used. This can include information from reports, as well as specific information about cases being worked by detectives. 5. What does “the big picture starts small” mean in this case? Explain. In this case, the phrase means that large changes can begin with small ideas that may have initially seemed insignificant. Being able to see those small ideas, and evaluate which small ideas have merit, is a possible benefit of this type of analytic work. Section 4.2 Review Questions 1. Define data mining. Why are there many different names and definitions for data mining? Data mining is the process through which previously unknown patterns in data were discovered. Another definition would be “a process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data.” This includes most types of automated data analysis. A third definition: Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models. Data mining has many definitions because it’s been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining. 2. What recent factors have increased the popularity of data mining? Following are some of the most pronounced reasons: • More intense competition at the global scale driven by customers’ everchanging needs and wants in an increasingly saturated marketplace
General recognition of the untapped value hidden in large data sources Consolidation and integration of database records, which enables a single w of customers. vendors transactions. etc Consolidation of databases and other data repositories into a single location in the form of a data warehouse The exponential increase in data processing and storage technologies Significant reduction in the cost of hard ware and software for data storage and processing resources into nonphysical form)of business practices orm Movement toward the de-massification(conversion of information Is data mining a new discipline? Explain Although the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in trad itional statistical analysis and artificial intelligence work done since the early part of the 1980s New or increased use of data mining applications makes it seem like data mining Is a new discipline In general, data mining seeks to identify four major types of patterns associations, predictions, clusters, and sequential relationships. These types of patterns have been manually extracted from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches As datasets have grown in size and complexity, direct manual data analysis has increasingly been augmented with ind irect, automatic data processing tools that use sophisticated methodologies, methods, and algorithms. The manifestation of ch evolution of automated and semiautomated means of processing large datasets is now commonly referred to as data mining 4. What are some major data mining methods and algorithms? Generally speaking, data mining tasks can be classified into three main categories prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e independent variables or decision variables)as well as the class attribute(i.e output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes. Figure 4.3(p. 157)shows a simple taxonomy for data mining tasks, along with the learning methods, and popular algorithms for each of the data mining tasks Copyright C2018 Pearson Education, Inc
4 Copyright © 2018Pearson Education, Inc. • General recognition of the untapped value hidden in large data sources. • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse. • The exponential increase in data processing and storage technologies. • Significant reduction in the cost of hardware and software for data storage and processing. • Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices. 3. Is data mining a new discipline? Explain. Although the term data mining is relatively new, the ideas behind it are not. Many of the techniques used in data mining have their roots in traditional statistical analysis and artificial intelligence work done since the early part of the 1980s. New or increased use of data mining applications makes it seem like data mining is a new discipline. In general, data mining seeks to identify four major types of patterns: associations, predictions, clusters, and sequential relationships. These types of patterns have been manually extracted from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches. As datasets have grown in size and complexity, direct manual data analysis has increasingly been augmented with indirect, automatic data processing tools that use sophisticated methodologies, methods, and algorithms. The manifestation of such evolution of automated and semiautomated means of processing large datasets is now commonly referred to as data mining. 4. What are some major data mining methods and algorithms? Generally speaking, data mining tasks can be classified into three main categories: prediction, association, and clustering. Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes. Figure 4.3 (p. 157) shows a simple taxonomy for data mining tasks, along with the learning methods, and popular algorithms for each of the data mining tasks
5. What are the key differences between the major data mining methods? Prediction the act of telling about the future. It differs from simple guessing by aking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but crit ical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and foreca sting, respectively. In data mining terminology, prediction and forecasting are used synonymously, and the term prediction is used as the common representation of the act Classification: analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups Clustering: find ing groups of entities with similar characteristics Association: establishing relationships among items that occur together Sequence discovery: find ing time-based associations Visualization: presenting results obtained through one or more of the other methods Regression: a statistical estimation technique based on fitting a curve defined by a mathematical equation of known type but unknown parameters to existing data Forecasting: estimating a future data value based on past data values Section 4.3 Review Questions What are the major application areas for data mining? Applications are listed near the beginning of this section(pp. 160-161): CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports 2. Identify at least five specific applications of data mining and list five common characteristics of these applications This question expands on the prior question by asking for common characteristics Several such applications and their characteristics are listed on pp. 160-161 Copyright C2018 Pearson Education, Inc
5 Copyright © 2018Pearson Education, Inc. 5. What are the key differences between the major data mining methods? Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymously, and the term prediction is used as the common representation of the act. Classification: analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups Clustering: finding groups of entities with similar characteristics Association: establishing relationships among items that occur together Sequence discovery: finding time-based associations Visualization: presenting results obtained through one or more of the other methods Regression: a statistical estimation technique based on fitting a curve defined by a mathematical equation of known type but unknown parameters to existing data Forecasting: estimating a future data value based on past data values Section 4.3 Review Questions 1. What are the major application areas for data mining? Applications are listed near the beginning of this section (pp. 160-161): CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports. 2. Identify at least five specific applications of data mining and list five common characteristics of these applications. This question expands on the prior question by asking for common characteristics. Several such applications and their characteristics are listed on pp. 160-161
3. What do you think is the most prominent application area for data mining? Why? Students answers will differ depending on which of the applications(most likely banking, retailing and logistics, manufacturing and production, government, healthcare, medicine, or homeland security) they think is most in need of greater certainty. Their reasons for selection should relate to the application areas need for better certainty and the ability to pay for the investments in data mining Can you think of other application areas for data mining not discussed in this section? Explain Students should be able to identify an area that can benefit from greater prediction or certainty. Answers will vary depend ing on their creativit Section 4.4 Review Questions What are the major data mining processes Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD 2. Why do you think the early phases(understanding of the business and understand ing of the data) take the longest in data mining projects? Students should explain that the early steps are the most unstructured phases because they involve learning. Those phases(learning/understanding) cannot be automated. Extra time and effort are needed upfront because any mistake in understand ing the business or data will most likely result in a failed BI project 3. List and briefly define the phases in the CriSP-dM proce CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understand ing of the business issues to be addressed are developed concurrently Next, data are prepared for modeling, are modeled; model results are evaluated and the models can be employed for regular use What are the main data preprocessing steps? Briefly describe each step and provide relevant examples Data preprocessing is essential to any successful data mining study. Good data leads to good information; good information leads to good decisions. Data preprocessing includes four main steps(listed in Table 4. 1 on page 167) data consolidation: access. collect. select and filter data 6 Copyright C2018 Pearson Education, Inc
6 Copyright © 2018Pearson Education, Inc. 3. What do you think is the most prominent application area for data mining? Why? Students’ answers will differ depending on which of the applications (most likely banking, retailing and logistics, manufacturing and production, government, healthcare, medicine, or homeland security) they think is most in need of greater certainty. Their reasons for selection should relate to the application area’s need for better certainty and the ability to pay for the investments in data mining. 4. Can you think of other application areas for data mining not discussed in this section? Explain. Students should be able to identify an area that can benefit from greater prediction or certainty. Answers will vary depending on their creativity. Section 4.4 Review Questions 1. What are the major data mining processes? Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD. 2. Why do you think the early phases (understanding of the business and understanding of the data) take the longest in data mining projects? Students should explain that the early steps are the most unstructured phases because they involve learning. Those phases (learning/understanding) cannot be automated. Extra time and effort are needed upfront because any mistake in understanding the business or data will most likely result in a failed BI project. 3. List and briefly define the phases in the CRISP-DM process. CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use. 4. What are the main data preprocessing steps? Briefly describe each step and provide relevant examples. Data preprocessing is essential to any successful data mining study. Good data leads to good information; good information leads to good decisions. Data preprocessing includes four main steps (listed in Table 4.1 on page 167): data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction reduce number of attributes and records balance skewed data 5. How does crisp-dm differ from SEMMa? The main difference between CRISP-DM and SEMMA is that CRiSP-DM takes a more comprehensive approach--includ ing understand ing of the business and the relevant data-to data mining projects, whereas SEMMa implicitly assumes that the data mining project s goals and objectives along with the appropriate data sources have been identified and understood Section 4.5 Review Questions Identify at least three of the main data mining methods Classification learns patterns from past data(a set of information--traits variables, features--on characteristics of the previously labeled items, objects,or events)in order to place new instances(with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases(e.g, people, things, events) into groups or clusters, so that the degree of association is strong among members of the san cluster and weak among members of different clusters Association rule mining is a popular data mining method that is commonly used lIning Is technologically less savvy audience. Association rule mining aims to find interesting relationships(affinities) between variables(items) in large databases Give examples of situations in which classification would be an appropriate data mining technique. Give examples of situations in which regression would be an appropriate data mining technique Students' answers will differ, but should be based on the following issues Classification is for prediction that can be based on historical data and relationships, such as predicting the weather, product demand, or a students success in a university. If what is being predicted is a class label(e.g,"sunny rainy, or" cloudy )the prediction problem is called a classification, whereas if it is a numeric value(e. g, temperature such as 68F), the prediction problem is called a regression Copyright C2018 Pearson Education, Inc
7 Copyright © 2018Pearson Education, Inc. data cleaning: handle missing data, reduce noise, fix errors data transformation: normalize the data, aggregate data, construct new attributes data reduction: reduce number of attributes and records; balance skewed data 5. How does CRISP-DM differ from SEMMA? The main difference between CRISP-DM and SEMMA is that CRISP-DM takes a more comprehensive approach—including understanding of the business and the relevant data—to data mining projects, whereas SEMMA implicitly assumes that the data mining project’s goals and objectives along with the appropriate data sources have been identified and understood. Section 4.5 Review Questions 1. Identify at least three of the main data mining methods. Classification learns patterns from past data (a set of information—traits, variables, features—on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior. Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Association rule mining is a popular data mining method that is commonly used as an example to explain what data mining is and what it can do to a technologically less savvy audience. Association rule mining aims to find interesting relationships (affinities) between variables (items) in large databases. 2. Give examples of situations in which classification would be an appropriate data mining technique. Give examples of situations in which regression would be an appropriate data mining technique. Students’ answers will differ, but should be based on the following issues. Classification is for prediction that can be based on historical data and relationships, such as predicting the weather, product demand, or a student’s success in a university. If what is being predicted is a class label (e.g., “sunny,” “rainy,” or “cloudy”) the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression
3. List and briefly define at least two classification techniques Decision tree analysis. Decision tree analysis(a machine-learning technique) is arguably the most popular classification technique in the data mining arena. Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class(or category) Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples Rough sets. This method takes into account the partial membership of class labels to predefined categories in build ing models(collection of rules) for classification problems What are some of the criteria for comparing and selecting the best classification technique? The amount and availability of historical data The types of data, categorical, interval, ration, etc What is being predicted--class or numeric value The purpose or objective 5. Briefly describe the general algorithm used in decision trees a general algorithm for build ing a decision tree is as follows 1. Create a root node and assign all of the training data to it 2. Select the best splitting attribute 3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive(non-overlapping) subsets along the lines of the specif ic split and mode to the branches Copyright C2018 Pearson Education, Inc
8 Copyright © 2018Pearson Education, Inc. 3. List and briefly define at least two classification techniques. • Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems. 4. What are some of the criteria for comparing and selecting the best classification technique? • The amount and availability of historical data • The types of data, categorical, interval, ration, etc. • What is being predicted—class or numeric value • The purpose or objective 5. Briefly describe the general algorithm used in decision trees. A general algorithm for building a decision tree is as follows: 1. Create a root node and assign all of the training data to it. 2. Select the best splitting attribute. 3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive (non-overlapping) subsets along the lines of the specific split and mode to the branches
4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e. g, the node is dominated by a single class label) Define gini index What does it measure? The Gini index and information gain(entropy) are two popular ways to determine branching choices in a decision tree. The Gini index measures the purity of sample. If everything in a sample belongs to one class, the gini index value is zero Give examples of situations in which cluster analysis would be an appropriate data mining technique Cluster algorithms are used when the data records do not have predefined class identifiers(i.e, it is not known to what class a particular record belongs) 8. What is the major difference between cluster analysis and classification? Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained, they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters What are some of the methods for cluster analysis? The most commonly used clustering algorithms are k-means and self-organizing maps 10. Give examples of situations in which association would be an appropriate data mining technique Association rule mining is appropriate to use when the objective is to discover two or more items(or events or concepts) that go together. Students' answers will differ 11. Give examples of situations in which association would be an appropriate data mining technique Examples include the following Sales transactions Cred it card transactions Banking services Insurance service products Telecommunication services Copyright C2018 Pearson Education, Inc
9 Copyright © 2018Pearson Education, Inc. 4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e.g., the node is dominated by a single class label). 6. Define Gini index. What does it measure? The Gini index and information gain (entropy) are two popular ways to determine branching choices in a decision tree. The Gini index measures the purity of a sample. If everything in a sample belongs to one class, the Gini index value is zero. 7. Give examples of situations in which cluster analysis would be an appropriate data mining technique. Cluster algorithms are used when the data records do not have predefined class identifiers (i.e., it is not known to what class a particular record belongs). 8. What is the major difference between cluster analysis and classification? Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained, they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters. 9. What are some of the methods for cluster analysis? The most commonly used clustering algorithms are k-means and self-organizing maps. 10. Give examples of situations in which association would be an appropriate data mining technique. Association rule mining is appropriate to use when the objective is to discover two or more items (or events or concepts) that go together. Students’ answers will differ. 11. Give examples of situations in which association would be an appropriate data mining technique. Examples include the following: • Sales transactions • Credit card transactions • Banking services • Insurance service products • Telecommunication services
Med ical records Section 4.6 Review Questions 1. What are the most popular commercial data mining tools? Examples of these vendors include IBM(IBM SPSS Modeler), SAS(Enterprise Miner), Stat Soft(Statistica Data Miner), KXEN (Infinite Insight), Salford(CARt. MARS, TreeNet, Random Forest), Angoss(KnowledgeSTUDIO Knowledge Seeker), and Megaputer(Poly Analyst). Most of the more popular tools are developed by the largest statistical software companies(SPSS, SAS, and Stat Soft) Why do you think the most popular tools are developed by statistics companies? Data mining techniques involve the use of statistical analysis and modeling. So it's a natural extension of their business offerings 3. What are the most popular free data mining tools? Why are they gaining overwhelming popularity(especially r)? Probably the most popular free and open source data mining tool is Weka Othe include rapid Miner and Microsofts SQL Server. Their popularity continues to popular as a default language because of its feature base supporting data ns very grow because of their availability, features, and user communities. R rema What are the main differences between commercial and free data mining software tools? The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and Rapid Miner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible(. e, crashing due to the inefficient use of computer memory) 5. What would be your top five selection criteria for a data mining tool? Explain Students'answers will differ. Criteria they are likely to mention include cost, user interface, ease of use, computational efficiency, hardware compatibility, type of business problem, vendor support, and vendor reputation Copyright C2018 Pearson Education, Inc
10 Copyright © 2018Pearson Education, Inc. • Medical records Section 4.6 Review Questions 1. What are the most popular commercial data mining tools? Examples of these vendors include IBM (IBM SPSS Modeler), SAS (Enterprise Miner), StatSoft (Statistica Data Miner), KXEN (Infinite Insight), Salford (CART, MARS, TreeNet, RandomForest), Angoss (KnowledgeSTUDIO, KnowledgeSeeker), and Megaputer (PolyAnalyst). Most of the more popular tools are developed by the largest statistical software companies (SPSS, SAS, and StatSoft). 2. Why do you think the most popular tools are developed by statistics companies? Data mining techniques involve the use of statistical analysis and modeling. So it’s a natural extension of their business offerings. 3. What are the most popular free data mining tools? Why are they gaining overwhelming popularity (especially R)? Probably the most popular free and open source data mining tool is Weka. Others include RapidMiner and Microsoft’s SQL Server. Their popularity continues to grow because of their availability, features, and user communities. R remains very popular as a default language because of its feature base supporting data manipulation. 4. What are the main differences between commercial and free data mining software tools? The main difference between commercial tools, such as Enterprise Miner and Statistica, and free tools, such as Weka and RapidMiner, is computational efficiency. The same data mining task involving a rather large dataset may take a whole lot longer to complete with the free software, and in some cases it may not even be feasible (i.e., crashing due to the inefficient use of computer memory). 5. What would be your top five selection criteria for a data mining tool? Explain. Students’ answers will differ. Criteria they are likely to mention include cost, user interface, ease of use, computational efficiency, hardware compatibility, type of business problem, vendor support, and vendor reputation