Introduction to Probability and Statistics Using R G.Jay Kerns FIRST EDITION
Introduction to Probability and Statistics Using R G. Jay Kerns First Edition
Contents Preface vii List of Figures xiii List of Tables 1 An Introduction to Probability and Statistics 1 1.1 1 l.2 Statistics...·········· 1 Chapter Exercises.········· 3 2 An Introduction to R 5 2.1 Downloading and Installing R 5 2.2 Communicating with R........ 6 2.3 Basic R Operations and Concepts..·..· 8 2.4 Getting Help.············ 14 2.5 External Resources........ 15 2.6 Other Tips............ 16 Chapter Exercises......... 17 3 Data Description 19 3.1 Types of Data........ 19 3.2 Features of Data Distributions 33 3.3 Descriptive Statistics......... 35 3.4 Exploratory Data Analysis 40 3.5 Multivariate Data and Data Frames 45 3.6 Comparing Populations 47 Chapter Exercises 53 4 Probability 65 4.1 Sample Spaces.... 65 4.2 Events 70 4.3 Model Assignment 。 75 4.4 Properties of Probability 80 4.5 Counting Methods... 84 4.6 Conditional Probability 89 4.7 Independent Events..············ 95 4.8 Bayes'Rule..··.··············· 98 4.9 Random Variables.....,..:。·.·············· 102 Chapter Exercises......... 105 iii
Contents Preface vii List of Figures xiii List of Tables xv 1 An Introduction to Probability and Statistics 1 1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 An Introduction to R 5 2.1 Downloading and Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Communicating with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Basic R Operations and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 External Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Other Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Data Description 19 3.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Features of Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Multivariate Data and Data Frames . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Comparing Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 Probability 65 4.1 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3 Model Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Properties of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5 Counting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.8 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.9 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 iii
公 CONTENTS 5 Discrete Distributions 107 5.1 Discrete Random Variables ............. 107 5.2 The Discrete Uniform Distribution 110 5.3 The Binomial Distribution 111 5.4 Expectation and Moment Generating Functions... 116 5.5 The Empirical Distribution.·················· 120 5.6 Other Discrete Distributions 123 5.7 Functions of Discrete Random Variables 130 Chapter Exercises........................... 132 6 Continuous Distributions 137 6.1 Continuous Random Variables..... 。 137 6.2 The Continuous Uniform Distribution 142 6.3 The Normal Distribution.. 143 6.4 Functions of Continuous Random Variables 146 6.5 Other Continuous Distributions...... 150 Chapter Exercises.············ 155 7 Multivariate Distributions 157 7.1 Joint and Marginal Probability Distributions.·····. 157 7.2 Joint and Marginal Expectation .. 163 7.3 Conditional Distributions .. 165 7.4 Independent Random Variables... 167 7.5 Exchangeable Random Variables 170 7.6 The Bivariate Normal Distribution 170 7.7 Bivariate Transformations of Random Variables......... 172 7.8 Remarks for the Multivariate Case 175 7.9 The Multinomial Distribution 178 Chapter Exercises..········ 180 8 Sampling Distributions 181 8.1 Simple Random Samples ...... 。。g 182 8.2 Sampling from a Normal Distribution 182 8.3 The Central Limit Theorem....... 185 8.4 Sampling Distributions of Two-Sample Statistics 187 8.5 Simulated Sampling Distributions 189 Chapter Exercises............... 191 9 Estimation 193 9.1 Point Estimation... 193 9.2 Confidence Intervals for Means....,············· 202 9.3 Confidence Intervals for Differences of Means.··..·...·· 208 9.4 Confidence Intervals for Proportions 210 9.5 Confidence Intervals for Variances 212 9.6 Fitting Distributions....... 212 9.7 Sample Size and Margin of Error................ 212 9.8 Other Topics.....·············· 214 Chapter Exercises ............ 215
iv CONTENTS 5 Discrete Distributions 107 5.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 The Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 110 5.3 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.4 Expectation and Moment Generating Functions . . . . . . . . . . . . . . . . . 116 5.5 The Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.6 Other Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7 Functions of Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 130 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6 Continuous Distributions 137 6.1 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2 The Continuous Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . 142 6.3 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4 Functions of Continuous Random Variables . . . . . . . . . . . . . . . . . . . 146 6.5 Other Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7 Multivariate Distributions 157 7.1 Joint and Marginal Probability Distributions . . . . . . . . . . . . . . . . . . . 157 7.2 Joint and Marginal Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.5 Exchangeable Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.6 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 170 7.7 Bivariate Transformations of Random Variables . . . . . . . . . . . . . . . . . 172 7.8 Remarks for the Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . 175 7.9 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8 Sampling Distributions 181 8.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.2 Sampling from a Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 182 8.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.4 Sampling Distributions of Two-Sample Statistics . . . . . . . . . . . . . . . . 187 8.5 Simulated Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . 189 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 9 Estimation 193 9.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 9.2 Confidence Intervals for Means . . . . . . . . . . . . . . . . . . . . . . . . . . 202 9.3 Confidence Intervals for Differences of Means . . . . . . . . . . . . . . . . . . 208 9.4 Confidence Intervals for Proportions . . . . . . . . . . . . . . . . . . . . . . . 210 9.5 Confidence Intervals for Variances . . . . . . . . . . . . . . . . . . . . . . . . 212 9.6 Fitting Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.7 Sample Size and Margin of Error . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.8 Other Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
CONTENTS 10 Hypothesis Testing 217 l0.1 Introduction......··· 217 10.2 Tests for Proportions 218 10.3 One Sample Tests for Means and Variances 224 10.4 Two-Sample Tests for Means and Variances 227 10.5 Other Hypothesis Tests 228 10.6 Analysis of Variance 。 229 10.7 Sample Size and Power 230 Chapter Exercises..... 232 11 Simple Linear Regression 235 11.1 Basic Philosophy 235 1l.2 Estimation...· 239 11.3 Model Utility and Inference. 248 11.4 Residual Analysis........ 252 11.5 Other Diagnostic Tools 259 Chapter Exercises ......... 266 12 Multiple Linear Regression 267 12.1 The Multiple Linear Regression Model. 267 12.2 Estimation and Prediction......... 270 12.3 Model Utility and Inference... 277 12.4 Polynomial Regression 280 12.5 Interaction....... 283 12.6 Qualitative Explanatory Variables.. 286 12.7 Partial F Statistic 289 12.8 Residual Analysis and Diagnostic Tools 291 12.9 Additional Topics 292 Chapter Exercises .. 296 13 Resampling Methods 297 13.1 Introduction.. 297 13.2 Bootstrap Standard Errors... 299 13.3 Bootstrap Confidence Intervals 303 13.4 Resampling in Hypothesis Tests 305 Chapter Exercises 309 14 Categorical Data Analysis 311 15 Nonparametric Statistics 313 16 Time Series 315 A R Session Information 317 GNU Free Documentation License 319 C History 327 D Data 329
CONTENTS v 10 Hypothesis Testing 217 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 10.2 Tests for Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 10.3 One Sample Tests for Means and Variances . . . . . . . . . . . . . . . . . . . 224 10.4 Two-Sample Tests for Means and Variances . . . . . . . . . . . . . . . . . . . 227 10.5 Other Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 10.6 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 10.7 Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 11 Simple Linear Regression 235 11.1 Basic Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 11.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 11.3 Model Utility and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 11.4 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 11.5 Other Diagnostic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 12 Multiple Linear Regression 267 12.1 The Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . 267 12.2 Estimation and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 12.3 Model Utility and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 12.4 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 12.5 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 12.6 Qualitative Explanatory Variables . . . . . . . . . . . . . . . . . . . . . . . . . 286 12.7 Partial F Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 12.8 Residual Analysis and Diagnostic Tools . . . . . . . . . . . . . . . . . . . . . 291 12.9 Additional Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 13 Resampling Methods 297 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 13.2 Bootstrap Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 13.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 303 13.4 Resampling in Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 305 Chapter Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 14 Categorical Data Analysis 311 15 Nonparametric Statistics 313 16 Time Series 315 A R Session Information 317 B GNU Free Documentation License 319 C History 327 D Data 329
vi CONTENTS D.1 Data Structures 329 D.2 Importing Data 334 D.3 Creating New Data Sets.. 335 D.4 Editing Data 335 D.5 Exporting Data 336 D.6 Reshaping Data 337 E Mathematical Machinery 339 E.1 Set Algebra 339 E.2 Differential and Integral Calculus 340 E.3 Sequences and Series 343 E.4 The Gamma Function 345 E.5 Linear Algebra 345 E.6 Multivariable Calculus 347 F Writing Reports with R 349 F.1 What to Write..... 349 F2 How to Write It with R 350 F.3 Formatting Tables 353 F.4 Other Formats..... 353 G Instructions for Instructors 355 G.1 Generating This Document.. 356 G.2 How to Use This Document... 356 G.3 Ancillary Materials....... 357 G.4 Modifying This Document 357 H RcmdrTestDrive Story 359 Bibliography 363 Index 369
vi CONTENTS D.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 D.2 Importing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 D.3 Creating New Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 D.4 Editing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 D.5 Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 D.6 Reshaping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 E Mathematical Machinery 339 E.1 Set Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 E.2 Differential and Integral Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 340 E.3 Sequences and Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 E.4 The Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 E.5 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 E.6 Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 F Writing Reports with R 349 F.1 What to Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 F.2 How to Write It with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 F.3 Formatting Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 F.4 Other Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 G Instructions for Instructors 355 G.1 Generating This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 G.2 How to Use This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 G.3 Ancillary Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 G.4 Modifying This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 H RcmdrTestDrive Story 359 Bibliography 363 Index 369
Preface This book was expanded from lecture materials I use in a one semester upper-division under- graduate course entitled Probability and Statistics at Youngstown State University.Those lec- ture materials,in turn,were based on notes that I transcribed as a graduate student at Bowling Green State University.The course for which the materials were written is 50-50 Probabil- ity and Statistics,and the attendees include mathematics,engineering,and computer science majors(among others).The catalog prerequisites for the course are a full year of calculus. The book can be subdivided into three basic parts.The first part includes the introductions and elementary descriptive statistics;I want the students to be knee-deep in data right out of the gate.The second part is the study of probability,which begins at the basics of sets and the equally likely model,journeys past discrete/continuous random variables,and continues through to multivariate distributions.The chapter on sampling distributions paves the way to the third part,which is inferential statistics.This last part includes point and interval estimation, hypothesis testing,and finishes with introductions to selected topics in applied statistics. I usually only have time in one semester to cover a small subset of this book.I cover the material in Chapter 2 in a class period that is supplemented by a take-home assignment for the students.I spend a lot of time on Data Description,Probability,Discrete,and Continuous Distributions.I mention selected facts from Multivariate Distributions in passing,and discuss the meaty parts of Sampling Distributions before moving right along to Estimation(which is another chapter I dwell on considerably).Hypothesis Testing goes faster after all of the previous work,and by that time the end of the semester is in sight.I normally choose one or two final chapters (sometimes three)from the remaining to survey,and regret at the end that I did not have the chance to cover more. In an attempt to be correct I have included material in this book which I would normally not mention during the course of a standard lecture.For instance,I normally do not highlight the intricacies of measure theory or integrability conditions when speaking to the class.Moreover,I often stray from the matrix approach to multiple linear regression because many of my students have not yet been formally trained in linear algebra.That being said,it is important to me for the students to hold something in their hands which acknowledges the world of mathematics and statistics beyond the classroom,and which may be useful to them for many semesters to come.It also mirrors my own experience as a student. The vision for this document is a more or less self contained,essentially complete,correct, introductory textbook.There should be plenty of exercises for the student,with full solutions for some,and no solutions for others(so that the instructor may assign them for grading). By Sweave's dynamic nature it is possible to write randomly generated exercises and I had planned to implement this idea already throughout the book.Alas,there are only 24 hours in a day.Look for more in future editions. Seasoned readers will be able to detect my origins:Probability and Statistical Inference by Hogg and Tanis [44],Statistical Inference by Casella and Berger [13],and Theory of Point Estimation/Testing Statistical Hypotheses by Lehmann [59,58].I highly recommend each of vii
Preface This book was expanded from lecture materials I use in a one semester upper-division undergraduate course entitled Probability and Statistics at Youngstown State University. Those lecture materials, in turn, were based on notes that I transcribed as a graduate student at Bowling Green State University. The course for which the materials were written is 50-50 Probability and Statistics, and the attendees include mathematics, engineering, and computer science majors (among others). The catalog prerequisites for the course are a full year of calculus. The book can be subdivided into three basic parts. The first part includes the introductions and elementary descriptive statistics; I want the students to be knee-deep in data right out of the gate. The second part is the study of probability, which begins at the basics of sets and the equally likely model, journeys past discrete/continuous random variables, and continues through to multivariate distributions. The chapter on sampling distributions paves the way to the third part, which isinferential statistics. This last part includes point and interval estimation, hypothesis testing, and finishes with introductions to selected topics in applied statistics. I usually only have time in one semester to cover a small subset of this book. I cover the material in Chapter 2 in a class period that is supplemented by a take-home assignment for the students. I spend a lot of time on Data Description, Probability, Discrete, and Continuous Distributions. I mention selected facts from Multivariate Distributions in passing, and discuss the meaty parts of Sampling Distributions before moving right along to Estimation (which is another chapter I dwell on considerably). Hypothesis Testing goes faster after all of the previous work, and by that time the end of the semester is in sight. I normally choose one or two final chapters (sometimes three) from the remaining to survey, and regret at the end that I did not have the chance to cover more. In an attempt to be correct I have included material in this book which I would normally not mention during the course of a standard lecture. For instance, I normally do not highlight the intricacies of measure theory or integrability conditions when speaking to the class. Moreover, I often stray from the matrix approach to multiple linear regression because many of my students have not yet been formally trained in linear algebra. That being said, it is important to me for the students to hold something in their hands which acknowledges the world of mathematics and statistics beyond the classroom, and which may be useful to them for many semesters to come. It also mirrors my own experience as a student. The vision for this document is a more or less self contained, essentially complete, correct, introductory textbook. There should be plenty of exercises for the student, with full solutions for some, and no solutions for others (so that the instructor may assign them for grading). By Sweave’s dynamic nature it is possible to write randomly generated exercises and I had planned to implement this idea already throughout the book. Alas, there are only 24 hours in a day. Look for more in future editions. Seasoned readers will be able to detect my origins: Probability and Statistical Inference by Hogg and Tanis [44], Statistical Inference by Casella and Berger [13], and Theory of Point Estimation/Testing Statistical Hypotheses by Lehmann [59, 58]. I highly recommend each of vii
viii CONTENTS those books to every reader of this one.Some R books with"introductory"in the title that I recommend are Introductory Statistics with R by Dalgaard [19]and Using R for Introductory Statistics by Verzani [87].Surely there are many,many other good introductory books about R,but frankly,I have tried to steer clear of them for the past year or so to avoid any undue influence on my own writing. I would like to make special mention of two other books:Introduction to Statistical Thought by Michael Lavine [56]and Introduction to Probability by Grinstead and Snell [37].Both of these books are free and are what ultimately convinced me to release IRUR under a free license, too. Please bear in mind that the title of this book is "Introduction to Probability and Statistics Using R",and not"Introduction to R Using Probability and Statistics",nor even"Introduction to Probability and Statistics and R Using Words".The people at the party are Probability and Statistics;the handshake is R.There are several important topics about R which some individuals will feel are underdeveloped,glossed over,or wantonly omitted.Some will feel the same way about the probabilistic and/or statistical content.Still others will just want to learn R and skip all of the mathematics. Despite any misgivings:here it is,warts and all.I humbly invite said individuals to take this book,with the GNU Free Documentation License(GNU-FDL)in hand,and make it better. In that spirit there are at least a few ways in my view in which this book could be improved. Better data.The data analyzed in this book are almost entirely from the datasets package in base R,and here is why: 1.I made a conscious effort to minimize dependence on contributed packages, 2.The data are instantly available,already in the correct format,so we need not take time to manage them,and 3.The data are real. I made no attempt to choose data sets that would be interesting to the students;rather, data were chosen for their potential to convey a statistical point.Many of the data sets are decades old or more (for instance,the data used to introduce simple linear regression are the speeds and stopping distances of cars in the 1920's). In a perfect world with infinite time I would research and contribute recent,real data in a context crafted to engage the students in every example.One day I hope to stumble over said time.In the meantime,I will add new data sets incrementally as time permits More proofs.I would like to include more proofs for the sake of completeness (I understand that some people would not consider more proofs to be improvement).Many proofs have been skipped entirely,and I am not aware of any rhyme or reason to the current omissions.I will add more when I get a chance. More and better graphics:I have not used the ggplot2 package [90]because I do not know how to use it yet.It is on my to-do list. More and better exercises:There are only a few exercises in the first edition simply because I have not had time to write more.I have toyed with the exams package [38]and I believe that it is a right way to move forward.As I learn more about what the package can do I would like to incorporate it into later editions of this book
viii CONTENTS those books to every reader of this one. Some R books with “introductory” in the title that I recommend are Introductory Statistics with R by Dalgaard [19] and Using R for Introductory Statistics by Verzani [87]. Surely there are many, many other good introductory books about R, but frankly, I have tried to steer clear of them for the past year or so to avoid any undue influence on my own writing. I would like to make special mention of two other books: Introduction to Statistical Thought by Michael Lavine [56] and Introduction to Probability by Grinstead and Snell [37]. Both of these books are free and are what ultimately convinced me to release IPSUR under a free license, too. Please bear in mind that the title of this book is “Introduction to Probability and Statistics Using R”, and not “Introduction to R Using Probability and Statistics”, nor even “Introduction to Probability and Statistics and R Using Words”. The people at the party are Probability and Statistics; the handshake is R. There are several important topics about R which some individuals will feel are underdeveloped, glossed over, or wantonly omitted. Some will feel the same way about the probabilistic and/or statistical content. Still others will just want to learn R and skip all of the mathematics. Despite any misgivings: here it is, warts and all. I humbly invite said individuals to take this book, with the GNU Free Documentation License (GNU-FDL) in hand, and make it better. In that spirit there are at least a few ways in my view in which this book could be improved. Better data. The data analyzed in this book are almost entirely from the datasets package in base R, and here is why: 1. I made a conscious effort to minimize dependence on contributed packages, 2. The data are instantly available, already in the correct format, so we need not take time to manage them, and 3. The data are real. I made no attempt to choose data sets that would be interesting to the students; rather, data were chosen for their potential to convey a statistical point. Many of the data sets are decades old or more (for instance, the data used to introduce simple linear regression are the speeds and stopping distances of cars in the 1920’s). In a perfect world with infinite time I would research and contribute recent, real data in a context crafted to engage the students in every example. One day I hope to stumble over said time. In the meantime, I will add new data sets incrementally as time permits. More proofs. I would like to include more proofs for the sake of completeness (I understand that some people would not consider more proofs to be improvement). Many proofs have been skipped entirely, and I am not aware of any rhyme or reason to the current omissions. I will add more when I get a chance. More and better graphics: I have not used the ggplot2 package [90] because I do not know how to use it yet. It is on my to-do list. More and better exercises: There are only a few exercises in the first edition simply because I have not had time to write more. I have toyed with the exams package [38] and I believe that it is a right way to move forward. As I learn more about what the package can do I would like to incorporate it into later editions of this book
CONTENTS 水 About This Document IRUR contains many interrelated parts:the Document,the Program,the Package,and the An- cillaries.In short,the Document is what you are reading right now.The Program provides an efficient means to modify the Document.The Package is an R package that houses the Program and the Document.Finally,the Ancillaries are extra materials that reside in the Package and were produced by the Program to supplement use of the Document.We briefly describe each of them in turn The Document The Document is that which you are reading right now-IBUR's raison d'etre.There are transparent copies (nonproprietary text files)and opaque copies (everything else).See the GNU-FDL in Appendix B for more precise language and details. IPSUR.tex is a transparent copy of the Document to be typeset with a LTEX distribution such as MikTEX or TEX Live.Any reader is free to modify the Document and release the modified version in accordance with the provisions of the GNU-FDL.Note that this file cannot be used to generate a randomized copy of the Document.Indeed,in its released form it is only capable of typesetting the exact version of IRUR which you are currently reading.Furthermore,the tex file is unable to generate any of the ancillary materials. IPSUR-xxx.eps,IPSUR-xxx.pdf are the image files for every graph in the Document.These are needed when typesetting with ITEX. IPSUR.pdf is an opaque copy of the Document.This is the file that instructors would likely want to distribute to students. IPSUR.dvi is another opaque copy of the Document in a different file format. The Program The Program includes IPSUR.lyx and its nephew IPSUR.Rnw;the purpose of each is to give individuals a way to quickly customize the Document for their particular purpose(s). IPSUR.lyx is the source LyX file for the Program,released under the GNU General Public License(GNU GPL)Version 3.This file is opened,modified,and compiled with LyX,a sophisticated open-source document processor,and may be used(together with Sweave) to generate a randomized,modified copy of the Document with brand new data sets for some of the exercises and the solution manuals (in the Second Edition).Additionally, LyX can easily activate/deactivate entire blocks of the document,e.g.the proofs of the theorems,the student solutions to the exercises,or the instructor answers to the prob- lems,so that the new author may choose which sections(s)he would like to include in the final Document (again,Second Edition).The IPSUR.lyx file is all that a person needs (in addition to a properly configured system-see Appendix G)to generate/compile/ex- port to all of the other formats described above and below,which includes the ancillary materials IPSUR.Rdata and IPSUR.R. IPSUR.Rnw is another form of the source code for the Program,also released under the GNU GPL Version 3.It was produced by exporting IPSUR.lyx into R/Sweave format(.Rnw)
CONTENTS ix About This Document IPSUR contains many interrelated parts: the Document, the Program, the Package, and the Ancillaries. In short, the Document is what you are reading right now. The Program provides an efficient means to modify the Document. The Package is an R package that houses the Program and the Document. Finally, the Ancillaries are extra materials that reside in the Package and were produced by the Program to supplement use of the Document. We briefly describe each of them in turn. The Document The Document is that which you are reading right now – IPSUR’s raison d’être. There are transparent copies (nonproprietary text files) and opaque copies (everything else). See the GNU-FDL in Appendix B for more precise language and details. IPSUR.tex is a transparent copy of the Document to be typeset with a LATEX distribution such as MikTEX or TEX Live. Any reader is free to modify the Document and release the modified version in accordance with the provisions of the GNU-FDL. Note that this file cannot be used to generate a randomized copy of the Document. Indeed, in its released form it is only capable of typesetting the exact version of IPSUR which you are currently reading. Furthermore, the .tex file is unable to generate any of the ancillary materials. IPSUR-xxx.eps, IPSUR-xxx.pdf are the image files for every graph in the Document. These are needed when typesetting with LATEX. IPSUR.pdf is an opaque copy of the Document. This is the file that instructors would likely want to distribute to students. IPSUR.dvi is another opaque copy of the Document in a different file format. The Program The Program includes IPSUR.lyx and its nephew IPSUR.Rnw; the purpose of each is to give individuals a way to quickly customize the Document for their particular purpose(s). IPSUR.lyx is the source LYX file for the Program, released under the GNU General Public License (GNU GPL) Version 3. This file is opened, modified, and compiled with LYX, a sophisticated open-source document processor, and may be used (together with Sweave) to generate a randomized, modified copy of the Document with brand new data sets for some of the exercises and the solution manuals (in the Second Edition). Additionally, LYX can easily activate/deactivate entire blocks of the document, e.g. the proofs of the theorems, the student solutions to the exercises, or the instructor answers to the problems, so that the new author may choose which sections (s)he would like to include in the final Document (again, Second Edition). The IPSUR.lyx file is all that a person needs (in addition to a properly configured system – see Appendix G) to generate/compile/export to all of the other formats described above and below, which includes the ancillary materials IPSUR.Rdata and IPSUR.R. IPSUR.Rnw is another form of the source code for the Program, also released under the GNU GPL Version 3. It was produced by exporting IPSUR.lyx into R/Sweave format (.Rnw)
CONTENTS This file may be processed with Sweave to generate a randomized copy of IPSUR.tex-a transparent copy of the Document-together with the ancillary materials IPSUR.Rdata and IPSUR.R.Please note,however,that IPSUR.Rnw is just a simple text file which does not support many of the extra features that LyX offers such as WYSIWYM editing, instantly (de)activating branches of the manuscript,and more. The Package There is a contributed package on CRAN,called IPSUR.The package affords many advantages, one being that it houses the Document in an easy-to-access medium.Indeed,a student can have the Document at his/her fingertips with only three commands: install.packages("IPSUR") library(IPSUR) read(IPSUR) Another advantage goes hand in hand with the Program's license;since IPUR is free,the source code must be freely available to anyone that wants it.A package hosted on CRAN allows the author to obey the license by default. A much more important advantage is that the excellent facilities at R-Forge are building and checking the package daily against patched and development versions of the absolute latest pre-release of R.If any problems surface then I will know about it within 24 hours. And finally,suppose there is some sort of problem.The package structure makes it in- credibly easy for me to distribute bug-fixes and corrected typographical errors.As an author I can make my corrections,upload them to the repository at R-Forge,and they will be reflected worldwide within hours.We aren't in Kansas anymore,Dorothy. Ancillary Materials These are extra materials that accompany IPUR.They reside in the /etc subdirectory of the package source. IPSUR.RData is a saved image of the R workspace at the completion of the Sweave processing of IRUR.It can be loaded into memory with File >Load Workspace or with the com- mand load("/path/to/IPSUR.Rdata").Either method will make every single object in the file immediately available and in memory.In particular,the data BLANK from Exercise BLANK in Chapter BLANK on page BLANK will be loaded.Type BLANK at the command line (after loading IPSUR.RData)to see for yourself. IPSUR.R is the exported R code from IPSUR.Rnw.With this script,literally every R command from the entirety of IPUR can be resubmitted at the command line. Notation We use the notation x or stem.leaf notation to denote objects,functions,etc..The sequence "Statistics Summaries>Active Dataset"means to click the Statistics menu item,next click the Summaries submenu item,and finally click Active Dataset
x CONTENTS This file may be processed with Sweave to generate a randomized copy of IPSUR.tex – a transparent copy of the Document – together with the ancillary materials IPSUR.Rdata and IPSUR.R. Please note, however, that IPSUR.Rnw is just a simple text file which does not support many of the extra features that LYX offers such as WYSIWYM editing, instantly (de)activating branches of the manuscript, and more. The Package There is a contributed package on CRAN, called IPSUR. The package affords many advantages, one being that it houses the Document in an easy-to-access medium. Indeed, a student can have the Document at his/her fingertips with only three commands: > install.packages("IPSUR") > library(IPSUR) > read(IPSUR) Another advantage goes hand in hand with the Program’s license; since IPSUR is free, the source code must be freely available to anyone that wants it. A package hosted on CRAN allows the author to obey the license by default. A much more important advantage is that the excellent facilities at R-Forge are building and checking the package daily against patched and development versions of the absolute latest pre-release of R. If any problems surface then I will know about it within 24 hours. And finally, suppose there is some sort of problem. The package structure makes it incredibly easy for me to distribute bug-fixes and corrected typographical errors. As an author I can make my corrections, upload them to the repository at R-Forge, and they will be reflected worldwide within hours. We aren’t in Kansas anymore, Dorothy. Ancillary Materials These are extra materials that accompany IPSUR. They reside in the /etc subdirectory of the package source. IPSUR.RData is a saved image of the R workspace at the completion of the Sweave processing of IPSUR. It can be loaded into memory with File ⊲ Load Workspace or with the command load("/path/to/IPSUR.Rdata"). Either method will make every single object in the file immediately available and in memory. In particular, the data BLANK from Exercise BLANK in Chapter BLANK on page BLANK will be loaded. Type BLANK at the command line (after loading IPSUR.RData) to see for yourself. IPSUR.R is the exported R code from IPSUR.Rnw. With this script, literally every R command from the entirety of IPSUR can be resubmitted at the command line. Notation We use the notation x or stem.leaf notation to denote objects, functions, etc.. The sequence “Statistics ⊲ Summaries ⊲ Active Dataset” means to click the Statistics menu item, next click the Summaries submenu item, and finally click Active Dataset
CONTENTS xi Acknowledgements This book would not have been possible without the firm mathematical and statistical foun- dation provided by the professors at Bowling Green State University,including Drs.Gabor Szekely,Craig Zirbel,Arjun K.Gupta,Hanfeng Chen,Truc Nguyen,and James Albert.I would also like to thank Drs.Neal Carothers and Kit Chan. I would also like to thank my colleagues at Youngstown State University for their support. In particular,I would like to thank Dr.G.Andy Chang for showing me what it means to be a statistician. I would like to thank Richard Heiberger for his insightful comments and improvements to several points and displays in the manuscript. Finally,and most importantly,I would like to thank my wife for her patience and under- standing while I worked hours,days,months,and years on a free book.In retrospect,I can't believe I ever got away with it
CONTENTS xi Acknowledgements This book would not have been possible without the firm mathematical and statistical foundation provided by the professors at Bowling Green State University, including Drs. Gábor Székely, Craig Zirbel, Arjun K. Gupta, Hanfeng Chen, Truc Nguyen, and James Albert. I would also like to thank Drs. Neal Carothers and Kit Chan. I would also like to thank my colleagues at Youngstown State University for their support. In particular, I would like to thank Dr. G. Andy Chang for showing me what it means to be a statistician. I would like to thank Richard Heiberger for his insightful comments and improvements to several points and displays in the manuscript. Finally, and most importantly, I would like to thank my wife for her patience and understanding while I worked hours, days, months, and years on a free book. In retrospect, I can’t believe I ever got away with it