Data Mining and Model choice in Supervised Learning Gilbert Saporta Chaire de statistique appliquee CEDRIC, CNAM 292 rue Saint Martin 6003 paris gilbert saporta@cnam. fr http://cedric.cnam.fr/usaporta
Data Mining and Model Choice in Supervised Learning Gilbert Saporta Chaire de Statistique Appliquée & CEDRIC, CNAM, 292 rue Saint Martin, F-75003 Paris gilbert.saporta@cnam.fr http://cedric.cnam.fr/~saporta
Outline 1. What is data mining 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. a scoring case study 6. Discussion Beijing, 2008 2
Beijing, 2008 2 Outline 1. What is data mining? 2. Association rule discovery 3. Statistical models 4. Predictive modelling 5. A scoring case study 6. Discussion
1. What is data mining Data mining is a new field at the frontiers of statistics and information technologies(database management, artificial intelligence, machine learning etc which aims at discovering structures and patterns in large data sets Beijing, 2008 3
Beijing, 2008 3 1. What is data mining? ◼ Data mining is a new field at the frontiers of statistics and information technologies (database management, artificial intelligence, machine learning, etc.) which aims at discovering structures and patterns in large data sets
1.1 Definitions U M Fayyad, G Piatetski-Shapiro :Data Mining is the nontrivial process of identifying valid novel potentially useful and ultimately understandable patterns in data D.J. Hand shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets Beijing, 2008
Beijing, 2008 4 1.1 Definitions: ◼ U.M.Fayyad, G.Piatetski-Shapiro : “ Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data ” ◼ D.J.Hand : “ I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets
The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases(Kardaun, T Alanko, 1998) Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs(Hand, 2000) Beijing, 2008 5
Beijing, 2008 5 ◼ The metaphor of Data Mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. ◼ Data Mining is concerned with data which were collected for another purpose: it is a secondary analysis of data bases that are collected Not Primarily For Analysis, but for the management of individual cases (Kardaun, T.Alanko,1998) . ◼ Data Mining is not concerned with efficient methods for collecting data such as surveys and experimental designs (Hand, 2000)
What is new? Is it a revolution The idea of discovering facts from data is as old as Statistics which"is the science of learning from data OKettenring former ASa president) In the 60s: Exploratory Data Analysis(tukey, Benzecri) Data analysis is a tool for extracting the diamond of truth from the mud of data,>> O P Benzecri 1973) Beijing, 2008 6
Beijing, 2008 6 ◼ The idea of discovering facts from data is as old as Statistics which “ is the science of learning from data ” (J.Kettenring, former ASA president). ◼ In the 60’s: Exploratory Data Analysis (Tukey, Benzecri..) « Data analysis is a tool for extracting the diamond of truth from the mud of data. » (J.P.Benzécri 1973) What is new? Is it a revolution ?
2 Data Mining started from an evolution of DBms towards Decision support Systems using a data Warehouse Storage of huge data sets: credit card transactions, phone calls, supermarket bills: giga and terabytes of data are collected automatically Marketing operations: CRM customer relationship management Research in artificial Intelligence, machine learning KDD for Knowledge Discovery in Data Bases Beijing, 2008 7
Beijing, 2008 7 1.2 Data Mining started from: ◼ an evolution of DBMS towards Decision Support Systems using a Data Warehouse. ◼ Storage of huge data sets: credit card transactions, phone calls, supermarket bills: giga and terabytes of data are collected automatically. ◼ Marketing operations: CRM (customer relationship management) ◼ Research in Artificial Intelligence, machine learning, KDD for Knowledge Discovery in Data Bases
1.3 Goals and tools Data Mining is a secondary analysis >> of data collected in an other purpose(management eg Data Mining aims at finding structures of two kinds: models and patterns Patterns a characteristic structure exhibited by a few number of points a small subgroup of customers with a high commercial value, or conversely highly risked Tools: cluster analysis visualisation by dimension reduction PCA, CA etc association rules Beijing, 2008 8
Beijing, 2008 8 1.3 Goals and tools ◼ Data Mining is a « secondary analysis » of data collected in an other purpose (management eg) ◼ Data Mining aims at finding structures of two kinds : models and patterns ◼ Patterns ◼ a characteristic structure exhibited by a few number of points : a small subgroup of customers with a high commercial value, or conversely highly risked. ◼ Tools: cluster analysis, visualisation by dimension reduction: PCA, CA etc. association rules
Models Building models is a major activity for statisticians econometricians and other scientists a model is a global summary of relationships between variables, which both helps to understand phenomenons and allows predictions dM is not concerned with estimation and tests off prespecified models but with discovering models through an algorithmic search process exploring linear and non linear models explicit or not: neural networks, decision trees, Support Vector Machines logistic regression, graphical models etc In DM Models do not come from a theory but from data exploration Beijing, 2008 9
Beijing, 2008 9 Models ◼ Building models is a major activity for statisticians econometricians, and other scientists. A model is a global summary of relationships between variables, which both helps to understand phenomenons and allows predictions. ◼ DM is not concerned with estimation and tests, of prespecified models, but with discovering models through an algorithmic search process exploring linear and non-linear models, explicit or not: neural networks, decision trees, Support Vector Machines, logistic regression, graphical models etc. ◼ In DM Models do not come from a theory, but from data exploration
process or tools? DM often appears as a collection of tools presented usually in one package, in such a way that several techniques may be compared on the same data-set But DM is a process not only tools Data Information Knowledge preprocessIng analysis Beijing, 2008 10
Beijing, 2008 10 process or tools? ◼ DM often appears as a collection of tools presented usually in one package, in such a way that several techniques may be compared on the same data-set. ◼ But DM is a process, not only tools: Data Information Knowledge preprocessing analysis