Data Mining: Concepts and Techniques Chapter1一 Richeng Zhang Office: New Main Building g521 Email:Zhangrc@act.buaa.edu.cn This slide is made based on the slides provided by jiawei Han Micheline Kamber and Jian pei. 2012 Han Kamber pei
1 Data Mining: Concepts and Techniques — Chapter 1 — Richong Zhang Office: New Main Building, G521 Email:zhangrc@act.buaa.edu.cn This slide is made based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei. © 2012 Han, Kamber & Pei
Chapter 1, Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of data Can be mined? What kinds of patterns can be mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted? Major issues in Data Mining A Brief History of data Mining and Data Mining Societ Summary
10 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kinds of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Kinds of Technologies Are Used? ◼ What Kinds of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary
Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web computerized society Major sources of abundant data Business: Web, e-commerce transactions, stocks, Science: Remote sensing bioinformatics, scientific simulation Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge Necessity is the mother of invention"-Data mining-Automated analysis of massive data sets
11 Why Data Mining? ◼ The Explosive Growth of Data: from terabytes to petabytes ◼ Data collection and data availability ◼ Automated data collection tools, database systems, Web, computerized society ◼ Major sources of abundant data ◼ Business: Web, e-commerce, transactions, stocks, … ◼ Science: Remote sensing, bioinformatics, scientific simulation, … ◼ Society and everyone: news, digital cameras, YouTube ◼ We are drowning in data, but starving for knowledge! ◼ “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
Why Data Mining Credit ratings/targeted marketing Given a database of 100,000 names which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Fraud detection Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor?: Data Mining helps extract such information
Why Data Mining ◼ Credit ratings/targeted marketing: ◼ Given a database of 100,000 names, which persons are the least likely to default on their credit cards? ◼ Identify likely responders to sales promotions ◼ Fraud detection ◼ Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? ◼ Customer relationship management: ◼ Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information
Data mining Process of semi-automatically analyzing large databases to find patterns that are valid: hold on new data with some certainty a novel non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern a also known as Knowledge discovery in Databases(KDD)
Data mining ◼ Process of semi-automatically analyzing large databases to find patterns that are: ◼ valid: hold on new data with some certainity ◼ novel: non-obvious to the system ◼ useful: should be possible to act on the item ◼ understandable: humans should be able to interpret the pattern ◼ Also known as Knowledge Discovery in Databases (KDD)
Chapter 1, Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kinds of data Can be mined? What kinds of patterns can be mined? What Kinds of Technologies Are Used? What Kinds of Applications Are Targeted? Major issues in Data Mining A Brief History of data Mining and Data Mining Societ Summary 14
14 Chapter 1. Introduction ◼ Why Data Mining? ◼ What Is Data Mining? ◼ A Multi-Dimensional View of Data Mining ◼ What Kinds of Data Can Be Mined? ◼ What Kinds of Patterns Can Be Mined? ◼ What Kinds of Technologies Are Used? ◼ What Kinds of Applications Are Targeted? ◼ Major Issues in Data Mining ◼ A Brief History of Data Mining and Data Mining Society ◼ Summary
What Is Data Mining? Data mining( knowledge discovery from data Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery(mining) in databases(KDD), knowledge extraction, data/pattern analysis data archeology data dredging, information harvesting business intelligence, etc Watch out: Is everything"data mining"? Simple search and query processing (Deductive)expert systems 迹 15
15 What Is Data Mining? ◼ Data mining (knowledge discovery from data) ◼ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data ◼ Alternative names ◼ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. ◼ Watch out: Is everything “data mining”? ◼ Simple search and query processing ◼ (Deductive) expert systems
What is(not)Data Mining? What is not data What is data mining? Mining Look up phone Certain names are more number in phone prevalent in certain US directory locations(O'Brien, ORurke O'Reilly . in Boston area) Query a Web Group together similar search engine for documents returned by information search engine according to about amazon their context (e.g. Amazon rainforest, Amazon. com,)
What is (not) Data Mining? What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon
Applications Banking: loan/credit card approval predict good customers based on old customers a Customer relationship management identify those who are likely to leave for a competitor. targeted marketing identify likely responders to promotions fraud detection telecommunications financial transactions from an online stream of event identify fraudulent events Manufacturing and production automatically adjust knobs when process parameter changes
Applications ◼ Banking: loan/credit card approval ◼ predict good customers based on old customers ◼ Customer relationship management: ◼ identify those who are likely to leave for a competitor. ◼ Targeted marketing: ◼ identify likely responders to promotions ◼ Fraud detection: telecommunications, financial transactions ◼ from an online stream of event identify fraudulent events ◼ Manufacturing and production: ◼ automatically adjust knobs when process parameter changes
Applications(continued) Medicine disease outcome, effectiveness of treatments analyze patient disease history: find relationship between di seases Molecular/Pharmaceutical: identify new drugs a Scientific data analysis identify new galaxies by searching for sub clusters a Web site/store design and promotion find affinity of visitor to pages and modify layout
Applications (continued) ◼ Medicine: disease outcome, effectiveness of treatments ◼ analyze patient disease history: find relationship between diseases ◼ Molecular/Pharmaceutical: identify new drugs ◼ Scientific data analysis: ◼ identify new galaxies by searching for sub clusters ◼ Web site/store design and promotion: ◼ find affinity of visitor to pages and modify layout