分海總南亞王季大门原章 Big Data,Machine Learning and Statistics Professor Yongmiao Hong Cornell University July8,2020
Big Data, Machine Learning and Statistics Professor Yongmiao Hong Cornell University July 8, 2020
CONTENTS 10.1 Introduction 10.2 Empirical Studies and Statistical Inference 10.3 Important Features of Big Data 10.4 Big Data Analysis and Statistics 10.5 Machine Learning and Statistics 10.6 Conclusion Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 2170
Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 2/70 10.1 Introduction 10.2 Empirical Studies and Statistical Inference 10.3 Important Features of Big Data 10.4 Big Data Analysis and Statistics 10.5 Machine Learning and Statistics 10.6 Conclusion CONTENTS
Parameter Estimation and Evaluation Introduction Introduction With the rapid development of internet and mobil in- ternet techologies as well as their applications,the rise of Big data together with machine learning,a main computer-based automatic analytic tool for Big data,has profound implications on statistical sciences. Compared with traditional historical data,Big data of- ten has an extraordinarily large volume of data,with structured,semi-structruraled and unstructured formats, which are often produced in real-time or near real-time. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics Juy8,2020 3/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 3/70 Introduction Introduction
Parameter Estimation and Evaluation Introduction Introduction What is Big data? Has Big data altered the foundation of statistical sciences,such as sampling inference for population,causal analysis,sufficiency principle,data reduction, prediction,and etc? What challenges and opportunities does Big data bring to the theory and practice of statistical modelling and inference? What is machine learning? What are the key differences between machine learning and statistical mod- elling? What is the relationship between machine learning and statistical inference? As is well-known,machine learning often has accurate out-of-sample pre- dictions,but it looks like a black box.Can statistics provide meaningful interpretations for machine learning methods? Can machine learning and statistics be synthesized together,and if so,how this will affect the future development of statistical sciences? Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 4170
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 4/70 Introduction Introduction
Parameter Estimation and Evaluation Introduction Introduction Our analysis delivers the following main conclusions: Big data does not change the foundation of statistical sampling inference for population,and many statistical methods such as the sufficiency principle, data reduction,and causal inference remain to be very useful for Big data analysis. Big data shakes the conventional practice of using statistical significance to decide important variables in the model. It poses some new challenges to statistical modelling and inference,including the basic assumptions of model uniqueness,correct model specification,and stationarity. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics Juy8,2020 5/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 5/70 Introduction Introduction
Parameter Estimation and Evaluation Introduction Introduction Machine learning,which arises due to availability of Big data,shares some common grounds as statistical inference,particularly in terms of sampling inference for population.Like any statistical inference methods,machine learning may suffer from sample bias. As an algorithm-based approach,Machine learning is much more general and flexible than statistical parametric modelling,including the determination of the set of important explanatory variables. ● Statistical nonparametric modelling can provide meaningful interpretations for some important machine learning algorithms,such as decision trees and artificial neural networks. The synthesis of machine learning and statistical inference is expected to open several new directions for statistical sciences. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 6/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 6/70 Introduction Introduction
CONTENTS 10.1 Introduction 10.2 Empirical Studies and Statistical Inference 10.3 Important Features of Big Data 10.4 Big Data Analysis and Statistics 10.5 Machine Learning and Statistics 10.6 Conclusion Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 7170
Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 7/70 10.1 Introduction 10.2 Empirical Studies and Statistical Inference 10.3 Important Features of Big Data 10.4 Big Data Analysis and Statistics 10.5 Machine Learning and Statistics 10.6 Conclusion CONTENTS
Parameter Estimation and Evaluation Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference The basic idea of statistical inference is to assume that the system under study is a stochastic process governed by some probability law, and data observed in practice are realizations of the underlying system which is then called a data generating process(DGP). The main objective of statistical analysis is to use the observed data to make inference of the probability law of the DGP and then use it for various applications,such as explaining important empirically styled facts,testing theory and hypotheses,forecasting future trends and changes,evaluating programs and policies,and etc. In statistical modelling and inference,it is usually assumed that the probability law of the DGP can be adaquately characterized by a unique mathematical model which links the dependent variable to a small set of explanatory variables or covariates. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 8/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 8/70 Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference
Parameter Estimation and Evaluation Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference -Often the mathematical model is assumed to have a known func- tional form but subject to some low-dimensional unknown pa- rameters. The main objective of statistical inference is to use the observed data to estimate the unknown model parameters and conduct hypothesis testing about the parameters. .A popular procedure in empirical studies is to use a prespecified (say 5%)significance level (or equivalently a P-value)to judge whether an estimated parameter is statistically significant.If it is,the associated explanatory variable will be considered as an important factor and thus retained in the model.If a statistically significant variable is not included in the model,it will be called an omitted variable. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics July8,2020 9/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 9/70 Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference
Parameter Estimation and Evaluation Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference Commonly used examples of standard models include: -classical linear regression models; -probit or logit models in discrete choices; Cox's (1960)proportional hazard models in survival or duration analysis. The important inputs,the recorded data,are often observa- tional in nature,namely they are not produced from controlled experiments.This is usually the case in social sciences and eco- nomics.Observed data typically have moderate sample sizes. Big Data,Machine Learning and Statistics Introduction to Statistics and Econometrics Juy8,2020 10/70
Parameter Estimation and Evaluation Big Data, Machine Learning and Statistics Introduction to Statistics and Econometrics July 8, 2020 10/70 Empirical Studies and Statistical Inference Empirical Studies and Statistical Inference