Big Data Analysis and Mining Weixiong rao饶卫雄 Tongji University同济大学软件学院 2015Fl wxrao@tongji.edu.cn Some of the slides are from dr Jure Leskovec's and prof. Zachary g. lves 2021/1/30 同济大学软件学院
2021/1/30 1 Big Data Analysis and Mining Weixiong Rao 饶卫雄 Tongji University 同济大学软件学院 2015 Fall wxrao@tongji.edu.cn *Some of the slides are from Dr Jure Leskovec’s and Prof. Zachary G. Ives
Traditional DAM Oracle DB IBM DW product on Operational very powerful servers ETL SAP ERP ERP Exraction Transformation Loading Salesforce CRM Raw data ■■■口■■■ Olap Analysis Reporting Data Warehouse Flat files from Flat Data Mining Legancy System Files (C)2008 datawarehouse 4u. info DAM tools 2021/1/30 同济大学软件学院
2021/1/30 5 Traditional DAM Oracle DB SAP ERP Salesforce CRM Flat Files from Legancy System IBM DW product on very powerful servers DAM tools
Big data a Typical large enterprise .5,000-50,000 servers, Terabytes of data, millions of Txn per day In contrast, many Internet companies o Millions of servers, petabytes of data Google o Lots and lots of Web pages a Billions of Google queries per day ◆ Facebook: d abillion facebook users n Billion+ Facebook pages Twitter a hundreds of million twitter accounts n Hundreds of million Tweets per day 2021/1/30 同济大学软件学院 6
2021/1/30 6 Big Data ◼ Typical large enterprise: ◆ 5,000-50,000 servers, Terabytes of data, millions of Txn per day. ◼ In contrast, many Internet companies ◆ Millions of servers, petabytes of data ◆ Google: Lots and lots of Web pages Billions of Google queries per day ◆ Facebook: A billion Facebook users Billion+ Facebook pages ◆ Twitter: Hundreds of million Twitter accounts Hundreds of million Tweets per day
Nowsdays DAM solutions a Google, Facebook, LinkedIn, eBay, Amazon didnot use the traditional data warehouse products for dAM a Why? CAP theorem Different assumptions lead to different solutions a What? ◆ Massive parallism a Hadoop Map Reduce paradigm rhade a UC Berkeley shark/spark Soar k Lightning-fast cluster comput 2021/1/30 同济大学软件学院
2021/1/30 7 Nowsdays DAM solutions ◼ Google, Facebook, LinkedIn, eBay, Amazon... didnot use the traditional data warehouse products for DAM. ◼ Why? CAP theorem ◆ Different assumptions lead to different solutions ◼ What? ◆ Massive parallism Hadoop MapReduce paradigm UC Berkeley shark/spark
What's DAM? Analysis of data is a process of inspecting cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making a Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes 2021/1/30 同济大学软件学院
2021/1/30 8 What’ s DAM? ◼ Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. ◼ Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes
What's big dAM? Big data is the term for a collection of data sets so large and complex that it becomes dificult to process using on-hand database management tools or traditional data processing applications The challenges include capture, curation, storage search sharing, transfer, analysis and visualization a Our course: How to do daM in the Big data context Data Mining≈ Predictive Analytics≈ Data Science≈ Business Intelligence ◆ Big data mining≈ Massive data analysis 2021/1/30 同济大学软件学院
2021/1/30 9 What’s big DAM? ◼ Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. ◆ The challenges include capture, curation, storage search, sharing, transfer, analysis and visualization ◼ Our course: How to do DAM in the Big data context ◆ Data Mining ≈ Predictive Analytics ≈Data Science ≈ Business Intelligence ◆ Big data mining ≈ Massive data analysis
Let's focus on big DAM what matters when dealing with data? Challenges Usage Context Streaming Scalability Collect Data Modalities Reason Data Operators 2021/1/30 同济大学软件学院
2021/1/30 10 Let’s focus on big DAM -what matters when dealing with data?
Let's focus on big DAM cultures of data minging a Data mining overlaps with Databases: Large-scale data, simple queries Machine learning: Small data complex models CS Theory:(Randomized) Algorithms Statistics Machine Learning a Different cultures: To a DB person, data mining is an extreme Data Mining form of analytic processing -queries that examine large amounts of data Database n Result is the query answer o to a ml person data-mining is the inference of models a Result is the parameters of the mode 2021/1/30 同济大学软件学院 11
2021/1/30 11 Let’s focus on big DAM - cultures of data minging? ◼ Data mining overlaps with: ◆ Databases: Large-scale data, simple queries ◆ Machine learning: Small data, Complex models ◆ CS Theory: (Randomized) Algorithms ◼ Different cultures: ◆ To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer ◆ To a ML person, data-mining is the inference of models Result is the parameters of the model
Let's focus on big data mining a This class overlaps with machine learning, statistics artificial intelligence databases but more stress on ◆ Scalability( big data) ◆ Algorithms o Computing architectures Sti atistIcs Machine o Automation for handling real big data Learning the required background Data Mining Data structure and algorithm design o Probability and linear algebra stems ◆ Operating system ◆ Java program design 2021/1/30 同济大学软件学院
2021/1/30 12 Let’s focus on big data mining ◼ This class overlaps with machine learning, statistics, artificial intelligence, databases but more stress on ◆ Scalability (big data) ◆ Algorithms ◆ Computing architectures ◆ Automation for handling real big data ◼ The required background ◆ Data structure and Algorithm design ◆ Probability and Linear algebra ◆ Operating System ◆ Java program design
What will we learn? a We will learn to mine different types of data: ◆ Data is high dim yonal ◆ Data is a graph *Data-is infinite/never-ending Data is labeled a We will learn to use different models of computation: ◆ Matlab+ Hadoop+ Spark e Streams and online algorith o Single machine in-memory 2021/1/30 同济大学软件学院
2021/1/30 13 What will we learn? ◼ We will learn to mine different types of data: ◆ Data is high dimensional ◆ Data is a graph ◆ Data is infinite/never-ending ◆ Data is labeled ◼ We will learn to use different models of computation: ◆ Matlab + Hadoop + Spark ◆ Streams and online algorithms ◆ Single machine in-memory