COMP 578 Data Warehousing Data Mining Keith C.C. han Department of Computing The Hong Kong Polytechnic University
COMP 578 Data Warehousing & Data Mining Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University
Text and references Chan, K.C. C, Course Notes on Data Mining Data Warehousing, Department of Computing The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, 2003 Inmon, W.H., Building the Data Warehouse, 2nd Edition, J. Wiley sons, New York, NY, 1996 Whitehorn, M, Business Intelligence: the IBM Solution: Datawarehousing and OLAP Springer, London, 1999. Han, J, and Kamber, M. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, 2001 O P. Rud, Data Mining Cookbook: Modeling Data for Marketing, Risk and Customer Relationship Management, J. Wiley, New York, NY, 2001 Groth, R, Data Mining: Building Competitive Advantage, Prentice Hall, Upper Saddle River, NJ,1998 Kovalerchuk, B, Data Mining in Finance: Advances in Relational and Hybrid Methods, Kluwer Academic, Boston 2000 Berry, MJ.A, Mastering Data Mining: the Art and Science of Customer Relationship Management, Wilery, New York NY, 2000 Berry, M.J. A Data Mining Techniques for Marketing, Sales and Customer Support, Wilery New York NY, 1997 Mattison, R, Data Warehousing and Data Mining for Telecommunications, Artech House Boston, 1997
5 Text and References • Chan, K.C.C., Course Notes on Data Mining & Data Warehousing, Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, 2003. • Inmon, W.H., Building the Data Warehouse, 2 nd Edition, J. Wliley & Sons, New York, NY, 1996. • Whitehorn, M., Business Intelligence: the IBM Solution: Datawarehousing and OLAP, Springer, London, 1999. • Han, J., and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, 2001. • O.P. Rud, Data Mining Cookbook: Modeling Data for Marketing, Risk, and Customer Relationship Management, J. Wiley, New York, NY, 2001. • Groth, R., Data Mining: Building Competitive Advantage, Prentice Hall, Upper Saddle River, NJ, 1998. • Kovalerchuk, B., Data Mining in Finance: Advances in Relational and Hybrid Methods, Kluwer Academic, Boston, 2000. • Berry, M.J.A., Mastering Data Mining: the Art and Science of Customer Relationship Management, Wilery, New York NY, 2000. • Berry, M.J.A., Data Mining Techniques for Marketing, Sales and Customer Support, Wilery, New York NY, 1997. • Mattison, R., Data Warehousing and Data Mining for Telecommunications, Artech House, Boston, 1997
Course Outline (1) Data Mining From data warehousing to data mining Data pre-processing and data mining life-cycle Association and sequence analysis classification and clustering Fuzzy Logic, Neural Networks, and Genetic Algorithms Mining Complex Data OLAP mining; spatial data mining; text mining time-series data mining; web mining; visual data mining
6 Course Outline (1) • Data Mining – From data warehousing to data mining. – Data pre-processing and data mining life-cycle. – Association and sequence analysis; classification and clustering. – Fuzzy Logic, Neural Networks, and Genetic Algorithms. – Mining Complex Data. • OLAP mining; spatial data mining; text mining; time-series data mining; web mining; visual data mining
Course Outline(2) ° Data warehousing Introduction; basic concepts of data warehousing; data warehouse VS. Operational DB, data warehouse and the industry Architecture and design; two-tier and three tier architecture, star schema and snowflake schema, data capturing, replication, transformation and cleansing Data characteristics metadata static and dynamic data; derived data Data Marts; OLAP, data mining, data Warehouse administration
7 Course Outline (2) • Data warehousing. – Introduction; basic concepts of data warehousing; data warehouse vs. Operational DB; data warehouse and the industry. – Architecture and design; two-tier and threetier architecture; star schema and snowflake schema; data capturing, replication, transformation and cleansing. – Data characteristics; metadata; static and dynamic data; derived data. – Data Marts; OLAP; data mining; data warehouse administration
Aims and objectives The hype about data姗器版 CUSTOMER REL ATIONEHIF MANAGEMENT warehousing and Analytics and the Data Warehouse data mining o Better understand tools by IBM, IT solutions meet Microsoft oracle marketers goals SAS, SPSS Job mobility and prospects. Projects and research thesis
8 Aims and Objectives • The hype about data warehousing and data mining. • Better understand tools by IBM, Microsoft, Oracle, SAS, SPSS. • Job mobility and prospects. • Projects and research thesis
Data Warehousing and Industry One of the hottest topic in IS Over 90% of larger companies either have a DW or are starting one Warehousing is big business $2 billion in 1995 $3.5 billion in early 1997 $8 billion in 1998 [Metagroupl over $200 billion over next 5 years
9 Data Warehousing and Industry • One of the hottest topic in IS. • Over 90% of larger companies either have a DW or are starting one. • Warehousing is big business – $2 billion in 1995 – $3.5 billion in early 1997 – $8 billion in 1998 [Metagroup] – over $200 billion over next 5 years
Data Warehousing and Industry(2) A 1996 study of 62 data warehousing projects showed An average return on investment of 321% with an average payback period of 2.73 years WalMart has largest warehouse 900-CPU, 2,700 disk, 23 TB Teradata system NTTB in warehouse 40-50GB per day 10
10 Data Warehousing and Industry (2) • A 1996 study of 62 data warehousing projects showed: – An average return on investment of 321%, with an average payback period of 2.73 years. • WalMart has largest warehouse – 900-CPU, 2,700 disk, 23 TB Teradata system – ~7TB in warehouse – 40-50GB per day
What is a data Warehouse? Defined in many different ways non-rigorously A DB for decision support Maintained separately from an organizations operational database a data warehouse is a subjiect-oriented integrated time-variant, and nonvolatile collection of data in support of management's decision-making process.-- W.H. Inmon o Data warehousing The process of constructing and using data warehouses
11 What is a Data Warehouse? • Defined in many different ways non-rigorously. – A DB for decision support. – Maintained separately from an organization’s operational database. • A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.— W. H. Inmon • Data warehousing: – The process of constructing and using data warehouses
Why Data Warehousing? Advance of information technology Data collected in huge amounts Need to make good use of data? Architecture and tools to Bring together scattered information from multiple sources to provide consistent data source for decision support. Support information processing by providing a solid platform of consolidated, historical data for analysis
12 Why Data Warehousing? • Advance of information technology. • Data collected in huge amounts. • Need to make good use of data? • Architecture and tools to – Bring together scattered information from multiple sources to provide consistent data source for decision support. – Support information processing by providing a solid platform of consolidated, historical data for analysis
Why Data Mining? Data explosion problem Automated data collection tools and mature database technology Leading to tremendous amounts of data stored in databases, data warehouses and other information repositories o We are drowning in data, but starving for knowledge
13 Why Data Mining? • Data explosion problem: – Automated data collection tools and mature database technology. – Leading to tremendous amounts of data stored in databases, data warehouses and other information repositories. • We are drowning in data, but starving for knowledge!