Practical Regression and Anova using R Julian J.Faraway July 2002
Practical Regression and Anova using R Julian J. Faraway July 2002
Preface There are many books on regression and analysis of variance.These books expect different levels of pre- paredness and place different emphases on the material.This book is not introductory.It presumes some knowledge of basic statistical theory and practice.Students are expected to know the essentials of statistical inference like estimation,hypothesis testing and confidence intervals.A basic knowledge of data analysis is presumed.Some linear algebra and calculus is also required. The emphasis of this text is on the practice of regression and analysis of variance.The objective is to learn what methods are available and more importantly,when they should be applied.Many examples are presented to clarify the use of the techniques and to demonstrate what conclusions can be made.There is relatively less emphasis on mathematical theory,partly because some prior knowledge is assumed and partly because the issues are better tackled elsewhere.Theory is important because it guides the approach we take.I take a wider view of statistical theory.It is not just the formal theorems.Qualitative statistical concepts are just as important in Statistics because these enable us to actually do it rather than just talk about it.These qualitative principles are harder to learn because they are difficult to state precisely but they guide the successful experienced Statistician. Data analysis cannot be learnt without actually doing it.This means using a statistical computing pack- age.There is a wide choice of such packages.They are designed for different audiences and have different strengths and weaknesses.I have chosen to use R(ref.Ihaka and Gentleman(1996)).Why do I use R? The are several reasons. 1.Versatility.R is a also a programming language,so I am not limited by the procedures that are preprogrammed by a package.It is relatively easy to program new methods in R 2.Interactivity.Data analysis is inherently interactive.Some older statistical packages were designed when computing was more expensive and batch processing of computations was the norm.Despite improvements in hardware,the old batch processing paradigm lives on in their use.R does one thing at a time,allowing us to make changes on the basis of what we see during the analysis. 3.R is based on S from which the commercial package S-plus is derived.R itself is open-source software and may be freely redistributed.Linux,Macintosh,Windows and other UNIX versions are maintained and can be obtained from the R-project at www.r-project.org.R is mostly compatible with S-plus meaning that S-plus could easily be used for the examples given in this book 4.Popularity.SAS is the most common statistics package in general but R or S is most popular with researchers in Statistics.A look at common Statistical journals confirms this popularity.R is also popular for quantitative applications in Finance. The greatest disadvantage of R is that it is not so easy to learn.Some investment of effort is required before productivity gains will be realized.This book is not an introduction to R.There is a short introduction 2
Preface There are many books on regression and analysis of variance. These books expect different levels of preparedness and place different emphases on the material. This book is not introductory. It presumes some knowledge of basic statistical theory and practice. Students are expected to know the essentials of statistical inference like estimation, hypothesis testing and confidence intervals. A basic knowledge of data analysis is presumed. Some linear algebra and calculus is also required. The emphasis of this text is on the practice of regression and analysis of variance. The objective is to learn what methods are available and more importantly, when they should be applied. Many examples are presented to clarify the use of the techniques and to demonstrate what conclusions can be made. There is relatively less emphasis on mathematical theory, partly because some prior knowledge is assumed and partly because the issues are better tackled elsewhere. Theory is important because it guides the approach we take. I take a wider view of statistical theory. It is not just the formal theorems. Qualitative statistical concepts are just as important in Statistics because these enable us to actually do it rather than just talk about it. These qualitative principles are harder to learn because they are difficult to state precisely but they guide the successful experienced Statistician. Data analysis cannot be learnt without actually doing it. This means using a statistical computing package. There is a wide choice of such packages. They are designed for different audiences and have different strengths and weaknesses. I have chosen to use R (ref. Ihaka and Gentleman (1996)). Why do I use R ? The are several reasons. 1. Versatility. R is a also a programming language, so I am not limited by the procedures that are preprogrammed by a package. It is relatively easy to program new methods in R . 2. Interactivity. Data analysis is inherently interactive. Some older statistical packages were designed when computing was more expensive and batch processing of computations was the norm. Despite improvements in hardware, the old batch processing paradigm lives on in their use. R does one thing at a time, allowing us to make changes on the basis of what we see during the analysis. 3. R is based on S from which the commercial package S-plus is derived. R itself is open-source software and may be freely redistributed. Linux, Macintosh, Windows and other UNIX versions are maintained and can be obtained from the R-project at www.r-project.org. R is mostly compatible with S-plus meaning that S-plus could easily be used for the examples given in this book. 4. Popularity. SAS is the most common statistics package in general but R or S is most popular with researchers in Statistics. A look at common Statistical journals confirms this popularity. R is also popular for quantitative applications in Finance. The greatest disadvantage of R is that it is not so easy to learn. Some investment of effort is required before productivity gains will be realized. This book is not an introduction to R . There is a short introduction 2
3 in the Appendix but readers are referred to the R-project web site at www.r-project.org where you can find introductory documentation and information about books on R.I have intentionally included in the text all the commands used to produce the output seen in this book.This means that you can reproduce these analyses and experiment with changes and variations before fully understanding R.The reader may choose to start working through this text before learning R and pick it up as you go. The web site for this book is at www.stat.Isa.umich.edu/faraway/book where data de- scribed in this book appears.Updates will appear there also. Thanks to the builders of R without whom this book would not have been possible
3 in the Appendix but readers are referred to the R-project web site at www.r-project.org where you can find introductory documentation and information about books on R . I have intentionally included in the text all the commands used to produce the output seen in this book. This means that you can reproduce these analyses and experiment with changes and variations before fully understanding R . The reader may choose to start working through this text before learning R and pick it up as you go. The web site for this book is at www.stat.lsa.umich.edu/˜faraway/book where data described in this book appears. Updates will appear there also. Thanks to the builders of R without whom this book would not have been possible
Contents 1 Introduction 8 1.1 Before you start 1.1.1 Formulation... 8 1.1.2 Data Collection.. 9 1.1.3 Initial Data Analysis 9 1.2 When to use Regression Analysis. 13 1.3 History 14 2 Estimation 16 2.1 Example 16 2.2 Linear Model 16 2.3 Matrix Representation 17 2.4 Estimatingβ., 17 2.5 Least squares estimation 18 2.6 Examples of calculating B 19 2.7 Why is B a good estimate? o 2.8 Gauss-Markov Theorem 2.9 Mean and Variance of B 21 2.10 Estimating o2 21 2.11 Goodness of Fit ” 21 2.12 Example 3 Inference 3.1 Hypothesis tests to compare models 2 3.2 Some Examples,,,.,,,··,·. 。。 3.2.1 Test of all predictors 2 3.2.2 Testing just one predictor. 3.2.3 Testing a pair of predictors 31 3.2.4 Testing a subspace.... ” 3.3 Concerns about Hypothesis Testing 3 3.4 Confidence Intervals for B 36 3.5 Confidence intervals for predictions 39 3.6 Orthogonality 41 3.7 Identifiability 44 38 Summary 46 3.9 What can go wrong? 46 3.9.1 Source and quality of the data 46 4
Contents 1 Introduction 8 1.1 Before you start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.1.3 Initial Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 When to use Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Estimation 16 2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Estimating β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Examples of calculating ˆβ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Why is ˆβ a good estimate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Gauss-Mark o v Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 Mean and Variance of ˆβ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.10 Estimating σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.11 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.12 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Inference 26 3.1 Hypothesis tests to compare models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Test of all predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Testing just one predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 Testing a pair of predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Testing a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Concerns about Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Confidence Intervals for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Confidence intervals for predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 What can go wrong? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9.1 Source and quality of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4
CONTENTS 5 3.9.2 Error component 47 3 9 3 Structural Component. ””” ” 47 3.10 Interpreting Parameter Estimates 48 4 Errors in Predictors 55 5 Generalized Least Squares 59 5.1 The general case...... 59 5.2 Weighted Least Squares 62 5.3 Iteratively Reweighted Least Squares... 64 6 Testing for Lack of Fit 65 6.1 o2 known.. 66 6.2c2 unknown.,·.·· 67 7 Diagnostics 72 7.1 Residuals and Leverage ..... 72 7.2 Studentized Residuals....... 74 7.3 An outlier test..... 75 7.4 Influential Observations 78 7.5 Residual Plots..... 80 7.6 Non-Constant Variance 83 7.7 Non-Linearity 85 7.8 Assessing Normality S 7.9 Half-normal plots 91 7.10 Correlated Errors 92 8 Transformation 95 8.1 Transforming the response 8.2 Transforming the predictors.... 98 8.2.1 Broken Stick Regression 8.2.2 Polynomials 100 8.3 Regression Splines 102 8.4 Modern Methods 104 9 Scale Changes,Principal Components and Collinearity 106 9.1 Changes of Scale··.·,,·,······ 106 9.2 Principal Components....... 107 9.3 Partial Least Squares 113 9.4 Collinearity 117 9.5 Ridge Regression 120 10 Variable Selection 124 10.1 Hierarchical Models... 124 10.2 Stepwise Procedures 125 10.2.1 Forward Selection. 125 10.2.2 Stepwise Regression 126 10.3 Criterion-based procedures..... 128
CONTENTS 5 3.9.2 Error component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9.3 Structural Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.10 Interpreting Parameter Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 Errors in Predictors 55 5 Generalized Least Squares 59 5.1 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Iteratively Reweighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Testing for Lack of Fit 65 6.1 σ 2 known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 σ 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7 Diagnostics 72 7.1 Residuals and Leverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.2 Studentized Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.3 An outlier test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.4 Influential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.6 Non-Constant Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.7 Non-Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.8 Assessing Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.9 Half-normal plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.10 Correlated Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8 Transformation 95 8.1 Transforming the response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.2 Transforming the predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.2.1 Broken Stick Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.2.2 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.3 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4 Modern Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9 Scale Changes, Principal Components and Collinearity 106 9.1 Changes of Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 9.2 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 9.3 Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 9.4 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.5 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10 Variable Selection 124 10.1 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 10.2 Stepwise Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.2.1 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 10.2.2 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 10.3 Criterion-based procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
CONTENTS 6 10.4 Summary 133 11 Statistical Strategy and Model Uncertainty 134 11.1 Strategy 134 11.2 Experiment 135 11.3 Discussion... 136 12 Chicago Insurance Redlining-a complete example 138 13 Robust and Resistant Regression 150 14 Missing Data 156 15 Analysis of Covariance 160 15.1 A two-level example 161 15.2 Coding qualitative predictors 164 15.3 A Three-level example 165 16 ANOVA 168 16.1 One-Way Anova..... 168 16.1.1 The model 168 16.1.2 Estimation and testing 168 16.1.3 An example 169 16.1.4 Diagnostics 171 16.1.5 Multiple Comparisons .172 16.1.6 Contrasts.. 177 l6.l.7 Scheffe's theorem for multiple comparisons..,···.···, 177 16.1.8 Testing for homogeneity of variance. 179 16.2 Two-Way Anova..... 179 16.2.1 One observation per cell... 180 16.2.2 More than one observation per cell................... 180 16.2.3 Interpreting the interaction effect.... 180 16.2.4 Replication,.,·,··,····· 184 16.3 Blocking designs··,··.····.··· 185 16.3.1 Randomized Block design 185 16.3.2 Relative advantage of RCBD over CRD 190 16.4 Latin Squares.. 191 16.5 Balanced Incomplete Block design 195 16.6 Factorial experiments....... 200 A Recommended Books 204 A.1 Books on R... 204 A.2 Books on Regression and Anova..... 204 B R functions and data 205
CONTENTS 6 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Statistical Strategy and Model Uncertainty 134 11.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 12 Chicago Insurance Redlining - a complete example 138 13 Robust and Resistant Regression 150 14 Missing Data 156 15 Analysis of Covariance 160 15.1 A two-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 15.2 Coding qualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 15.3 A Three-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 16 ANOVA 168 16.1 One-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.2 Estimation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.3 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 16.1.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 16.1.5 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 16.1.6 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 16.1.7 Scheffe’´ s theorem for multiple comparisons . . . . . . . . . . . . . . . . . . . . . . 177 16.1.8 Testing for homogeneity of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2 Two-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2.1 One observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.2 More than one observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.3 Interpreting the interaction effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.3 Blocking designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.1 Randomized Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.2 Relative advantage of RCBD over CRD . . . . . . . . . . . . . . . . . . . . . . . . 190 16.4 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 16.5 Balanced Incomplete Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 16.6 Factorial experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A Recommended Books 204 A.1 Books on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.2 Books on Regression and Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B R functions and data 205
CONTENTS 7 C Quick introduction to R 207 C.1 Reading the data in...··、·-.··,,····…····,,······· 207 C.2 Numerical Summaries 207 C.3 Graphical Summaries 209 C.4 Selecting subsets of the data·:......·,,.·.,.,,.···,,,·,···· 209 C.5 Learning more about R,·,····,·····,··,· 210
CONTENTS 7 C Quick introduction to R 207 C.1 Reading the data in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.2 Numerical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.3 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.4 Selecting subsets of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.5 Learning more about R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem,continues with the collection of data,proceeds with the data analysis and finishes with conclusions.It is a common mistake of inexperienced Statisticians to plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis.Look before you leap! 1.1.1 Formulation The formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill.Albert Einstein To formulate the problem correctly,you must 1.Understand the physical background.Statisticians often work in collaboration with others and need to understand something about the subject area.Regard this as an opportunity to learn something new rather than a chore. 2.Understand the objective.Again,often you will be working with a collaborator who may not be clear about what the objectives are.Beware of"fishing expeditions"-if you look hard enough,you'll almost always find something but that something may just be a coincidence 3.Make sure you know what the client wants.Sometimes Statisticians perform an analysis far more complicated than the client really needed.You may find that simple descriptive statistics are all that are needed. 4.Put the problem into statistical terms.This is a challenging step and where irreparable errors are sometimes made.Once the problem is translated into the language of Statistics,the solution is often routine.Difficulties with this step explain why Artificial Intelligence techniques have yet to make much impact in application to Statistics.Defining the problem is hard to program. That a statistical method can read in and process the data is not enough.The results may be totally meaningless. 8
Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem, continues with the collection of data, proceeds with the data analysis and finishes with conclusions. It is a common mistake of inexperienced Statisticians to plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis. Look before you leap! 1.1.1 Formulation The formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill. Albert Einstein To formulate the problem correctly, you must 1. Understand the physical background. Statisticians often work in collaboration with others and need to understand something about the subject area. Regard this as an opportunity to learn something new rather than a chore. 2. Understand the objective. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of “fishing expeditions” - if you look hard enough, you’ll almost always find something but that something may just be a coincidence. 3. Make sure you know what the client wants. Sometimes Statisticians perform an analysis far more complicated than the client really needed. You may find that simple descriptive statistics are all that are needed. 4. Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimes made. Once the problem is translated into the language of Statistics, the solution is often routine. Difficulties with this step explain why Artificial Intelligence techniques have yet to make much impact in application to Statistics. Defining the problem is hard to program. That a statistical method can read in and process the data is not enough. The results may be totally meaningless. 8
1.1.BEFORE YOU START 1.1.2 Data Collection It's important to understand how the data was collected. Are the data observational or experimental?Are the data a sample of convenience or were they obtained via a designed sample survey.How the data were collected has a crucial impact on what conclusions can be made. Is there non-response?The data you don't see may be just as important as the data you do see. Are there missing values?This is a common problem that is troublesome and time consuming to deal with. How are the data coded?In particular,how are the qualitative variables represented. What are the units of measurement?Sometimes data is collected or represented with far more digits than are necessary.Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors.This problem is all too common-almost a certainty in any real dataset of at least moderate size.Perform some data sanity checks 1.1.3 Initial Data Analysis This is a critical step that should always be performed.It looks simple but it is vital. Numerical summaries-means,sds,five-number summaries,correlations. Graphical summaries -One variable-Boxplots,histograms etc Two variables-scatterplots. Many variables-interactive graphics. Look for outliers,data-entry errors and skewed or unusual distributions.Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming.It often takes more time than the data analysis itself.In this course,all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let's look at an example.The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.The following variables were recorded:Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure(mm Hg),Triceps skin fold thickness(mm),2-Hour serum insulin(mu U/ml), Body mass index(weight in kg/(height in m2)),Diabetes pedigree function,Age(years)and a test whether the patient shows signs of diabetes(coded 0 if negative,1 if positive).The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/"mlearn/MLRepository.html. Of course,before doing anything else,one should find out what the purpose of the study was and more about how the data was collected.But let's skip ahead to a look at the data:
1.1. BEFORE YOU START 9 1.1.2 Data Collection It’s important to understand how the data was collected. Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey. How the data were collected has a crucial impact on what conclusions can be made. Is there non-response? The data you don’t see may be just as important as the data you do see. Are there missing values? This is a common problem that is troublesome and time consuming to deal with. How are the data coded? In particular, how are the qualitative variables represented. What are the units of measurement? Sometimes data is collected or represented with far more digits than are necessary. Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks. 1.1.3 Initial Data Analysis This is a critical step that should always be performed. It looks simple but it is vital. Numerical summaries - means, sds, five-number summaries, correlations. Graphical summaries – One variable - Boxplots, histograms etc. – Two variables - scatterplots. – Many variables - interactive graphics. Look for outliers, data-entry errors and skewed or unusual distributions. Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. In this course, all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: Number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), 2-Hour serum insulin (mu U/ml), Body mass index (weight in kg/(height in m 2 )), Diabetes pedigree function, Age (years) and a test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive). The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/˜mlearn/MLRepository.html. Of course, before doing anything else, one should find out what the purpose of the study was and more about how the data was collected. But let’s skip ahead to a look at the data:
1.1.BEFORE YOU START 10 library(faraway) >data(pima)) pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 033.6 0.627 50 2 1 85 66 29 026.6 0.351 31 0 3 8 183 64 0 023.3 0.672 32 1 ..much deleted .. 768 1 93 70 31 030.4 0.315 23 0 The library (faraway)makes the data used in this book available while data(pima)calls up this particular dataset.Simply typing the name of the data frame,pima prints out the data.It's too long to show it all here.For a dataset of this size,one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods We start with some numerical summaries: summary (pima) pregnant glucose diastolic triceps insulin Min.:0.00 Min. 0 Min. :0.0 Min. :0.0 Min. :0.0 1stQu.:1.00 1st Qu.:99 1stQu.:62.0 1stQu.:0.0 1stQu.:0.0 Median 3.00 Median 117 Median :72.0 Median 23.0 Median 30.5 Mean :3.85 Mean :121 Mean :69.1 Mean :20.5 Mean 79.8 3rdQu.:6.00 3rdQu.:140 3rdQu.:80.0 3rdQu.:32.0 3rdQu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. :0.0 Min. :0.078 Min. :21.0 Min. :0.000 1stQu.:27.3 1stQu.:0.244 1stQu.:24.0 1stQu.:0.000 Median 32.0 Median :0.372 Median 29.0 Median 0.000 Mean :32.0 Mean:0.472 Mean :33.2 Mean :0.349 3rdQu.:36.6 3rdQu.:0.626 3rdQu.:41.0 3rdQu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary (command is a quick way to get the usual univariate summary information.At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error.For this purpose,a close look at the minimum and maximum values of each variable is worthwhile.Starting with pregnant, we see a maximum value of 17.This is large but perhaps not impossible.However,we then see that the next 5 variables have minimum values of zero.No blood pressure is not good for the health-something must be wrong.Let's look at the sorted values: sort(pimasdiastolic) [1]0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0000 00 00 0 0 0 0 0 0 0 0 0 24 [37]303038404444444446464848 48484850 50 50 ...etc... We see that the first 36 values are zero.The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code.For one reason or another,the researchers did not obtain the blood pressures of 36 patients.In a real investigation,one would likely be able to question the researchers about what really happened.Nevertheless,this does illustrate the kind of misunderstanding
1.1. BEFORE YOU START 10 > library(faraway) > data(pima) > pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 ... much deleted ... 768 1 93 70 31 0 30.4 0.315 23 0 The library(faraway) makes the data used in this book available while data(pima) calls up this particular dataset. Simply typing the name of the data frame, pima prints out the data. It’s too long to show it all here. For a dataset of this size, one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods. We start with some numerical summaries: > summary(pima) pregnant glucose diastolic triceps insulin Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0 1st Qu.: 1.00 1st Qu.: 99 1st Qu.: 62.0 1st Qu.: 0.0 1st Qu.: 0.0 Median : 3.00 Median :117 Median : 72.0 Median :23.0 Median : 30.5 Mean : 3.85 Mean :121 Mean : 69.1 Mean :20.5 Mean : 79.8 3rd Qu.: 6.00 3rd Qu.:140 3rd Qu.: 80.0 3rd Qu.:32.0 3rd Qu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. : 0.0 Min. :0.078 Min. :21.0 Min. :0.000 1st Qu.:27.3 1st Qu.:0.244 1st Qu.:24.0 1st Qu.:0.000 Median :32.0 Median :0.372 Median :29.0 Median :0.000 Mean :32.0 Mean :0.472 Mean :33.2 Mean :0.349 3rd Qu.:36.6 3rd Qu.:0.626 3rd Qu.:41.0 3rd Qu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary() command is a quick way to get the usual univariate summary information. At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error. For this purpose, a close look at the minimum and maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum value of 17. This is large but perhaps not impossible. However, we then see that the next 5 variables have minimum values of zero. No blood pressure is not good for the health — something must be wrong. Let’s look at the sorted values: > sort(pima$diastolic) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 [37] 30 30 38 40 44 44 44 44 46 46 48 48 48 48 48 50 50 50 ...etc... We see that the first 36 values are zero. The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code. For one reason or another, the researchers did not obtain the blood pressures of 36 patients. In a real investigation, one would likely be able to question the researchers about what really happened. Nevertheless, this does illustrate the kind of misunderstanding