Individuals and variables Lecture 3 Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals Relationship things Between variables a variable is any characteristic of an individual. a variable can take different values for different individuals The 1997 survey data set, for example includes data about a sample of women The individuals described are the women at childbearing ages Each row recodes data on one individual The women are the individuals You will often see each row of data called described by the data set For each a case. each column contains the values individual. the data contain the values of of one variable for all the individuals variables such as date of birth, place of Most data sets follow this format---each residence and educational level row is an individual, and each column is a In practice, any set of data variable ccompanied by background information that helps us understand the data
1 1 Lecture 3 Relationship Between Variables 2 Individuals and Variables • Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things. • A variable is any characteristic of an individual. A variable can take different values for different individuals. 3 • The 1997 survey data set, for example, includes data about a sample of women at childbearing ages. • The women are the individuals described by the data set. For each individual, the data contain the values of variables such as date of birth, place of residence, and educational level. • In practice, any set of data is accompanied by background information that helps us understand the data. 4 • The individuals described are the women. Each row recodes data on one individual. You will often see each row of data called a case. Each column contains the values of one variable for all the individuals. • Most data sets follow this format---each row is an individual, and each column is a variable
Measuring center: the mean To find the mean of a set of observations add their values and divide by the number of observations if the n observations are A description of a distribution almost always includes a measure of its center or x1,x2…,xn, their mean is verage. The most common measure of center is the ordinary arithmetic average, x1+x2+…+x n Or in more compact notation =∑x n EXample: mean age at first marrage Q105: When were you married for the first time? Statistics age at first marriage Valid 4134 Mean
2 5 Measuring center: the mean • A description of a distribution almost always includes a measure of its center or average. The most common measure of center is the ordinary arithmetic average, or mean. 6 • To find the mean of a set of observations, add their values and divide by the number of observations. If the n observations are , their mean is: Or in more compact notation: 1 2 , ,..., n xx x 1 2 ... n xx x x n + + + = 1 i x x n = ∑ 7 Example: mean age at first marriage • Q105: When were you married for the first time? Statistics age at first marriage 4134 872 21.04 Valid Missing N Mean 8 age at first marriage 2 .0 .0 .0 2 .0 .0 .1 4 .1 .1 .2 13 .3 .3 .5 38 .8 .9 1.4 111 2.2 2.7 4.1 185 3.7 4.5 8.6 309 6.2 7.5 16.1 528 10.5 12.8 28.8 604 12.1 14.6 43.4 609 12.2 14.7 58.2 589 11.8 14.2 72.4 435 8.7 10.5 82.9 307 6.1 7.4 90.4 191 3.8 4.6 95.0 90 1.8 2.2 97.2 58 1.2 1.4 98.6 27 .5 .7 99.2 18 .4 .4 99.7 3 .1 .1 99.7 7 .1 .2 99.9 1 .0 .0 99.9 2 .0 .0 100.0 1 .0 .0 100.0 4134 82.6 100.0 872 17.4 5006 100.0 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Total Valid Missing System Total Frequency Percent Valid Percent Cumulative Percent
age at first marriage An important point Since the single age refers to a 12 month age range, accuracy in the calculations requires that the mid-point of the range be used to represent the average age of all bers of the gro 5td Dev= 271 Use115,12.5,13,5.34.5 instead of 11513515517519521.52525527.5295315335 11,12,13,34 12514516518520.522.524.5265285305325345 Measuring center: the median 2. If the number of observations n is odd the median is the center observation in the The median is the mid-point of ordered list. Find the location of the distribution the number such that half the median by counting(n+1)/2 observations observations are smaller and the other half up( down) from the bottom(top) of the list are larger. To find the median of a 3. f the number of observations n is even distribution the median is the mean of the two center 1. Arrange all observations in order of size observations in the ordered list. The from smallest to largest location of the median is again(n+1)/2
3 9 age at first marriage 34.5 33.5 32.5 31.5 30.5 29.5 28.5 27.5 26.5 25.5 24.5 23.5 22.5 21.5 20.5 19.5 18.5 17.5 16.5 15.5 14.5 13.5 12.5 11.5 age at first marriage Frequency 700 600 500 400 300 200 100 0 Std. Dev = 2.71 Mean = 21.0 N = 4134.00 10 An important point • Since the single age refers to a 12 month age range, accuracy in the calculations requires that the mid-point of the range be used to represent the average age of all members of the group. Use 11.5, 12.5, 13,5,……34.5 instead of 11, 12, 13, ……34 11 Measuring center: the median • The median is the mid-point of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 12 2. If the number of observations n is odd, the median is the center observation in the ordered list. Find the location of the median by counting (n+1)/2 observations up (down) from the bottom (top) of the list. 3. If the number of observations n is even, the median is the mean of the two center observations in the ordered list. The location of the median is again (n+1)/2 from the bottom (top) of the list
EXamples 9223233393942494652 2225343541414646464749 The count of observations n=10 is even There is an odd number of observations There is no center observation but there is a center pair These are two 39s. the so there is one center observation this is median is the average of these two the median It is 41 observations which is 39 location of the median=(11+1)/2=6 location of the median=(10+1)2=5.5 The median age at first The formula for the media N age at first marriage marriage F Median= /=lower limit of the age group containing the age at first marriage N=total population Valid F=cumulative frequency up to the age group containing the median Missing 872 21.00 median cy of the age i=the size of the interval of the age group containing the median
4 13 Examples There is an odd number of observations, so there is one center observation. This is the median. It is 41. n=11, location of the median=(11+1)/2=6 22 25 34 35 41 41 46 46 46 47 49 14 The count of observations n=10 is even. There is no center observation, but there is a center pair. These are two 39s. The median is the average of these two observations, which is 39. n=10, location of the median=(10+1)/2=5.5 9 22 32 33 39 39 42 49 46 52 15 The median age at first marriage Statistics age at first marriage 4134 872 21.00 Valid Missing N Median 16 The formula for the median age at first marriage: Median • l =lower limit of the age group containing the median • N =total population • F =cumulative frequency up to the age group containing the median • f =frequency of the age group containing the median • i =the size of the interval of the age group containing the median 2 N F l i f − = + ×
Comparing the mean and the M=A+AM median 4812 48120 A mB Mean=(4+8+12)3=8Mean=(4+8+1203=44 Median=8 F The median unlike the mean is resistant A+AM=/+ Measuring spread: the standard The mean and median of a symmetric deviation distribution are close together. If the distribution is exactly symmetric, the The mean to measure center and the mean and median are exactly the standard deviation to measure spread same. In a skewed distribution the The standard deviation measures mean is farther out in the long tail than is the median spread by looking at how far the observations are from their mean
5 17 • M=A+AM A M B A+AM 2 N F l i f − = + × 18 Comparing the mean and the median 4 8 12 Mean=(4+8+12)/3=8 Median=8 4 8 120 Mean=(4+8+120)/3=44 Median=8 The median, unlike the mean, is resistant. 19 • The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median. 20 Measuring spread: the standard deviation • The mean to measure center and the standard deviation to measure spread • The standard deviation measures spread by looking at how far the observations are from their mean
The standard deviation Or, more compactly, The variance of a set of observations is the average of the squares of the deviations of the observations from their The standard deviation is the square mean. In symbols, the variance of n root of the variance observations x,,x2,,, is =∑ 2=(x-x)+(2-x+…+(x,-x Calculating the standard deviation Calculating the standard deviation 1792166618621614146018671439 x=1439 x=1600 The mean deviation=161 deviation=192 1792+1666+1362+1614+1460+1867+1439 l1200 =1600 1300140015001600170018001900
6 21 The standard deviation • The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations is 1 2 , ,..., n xx x 22 2 2 1 2 ( ) ( ) ... ( ) 1 n xx xx xx s n − + − ++ − = − 22 Or, more compactly, 2 2 1 ( ) 1 i s x x n = − − ∑ The standard deviation is the square The standard deviation is the square root of the variance: root of the variance: 1 2 ( ) 1 i s xx n = − − ∑ 23 Calculating the standard deviation 1792 1666 1362 1614 1460 1867 1439 1792+1666+1362+1614+1460+1867+1439 7 x = The mean: 11200 1600 7 = = 24 Calculating the standard deviation x =1439 x =1792 deviation = -161 deviation = 192 x =1600 x =1600
Observations De The variance is the sum of the squared x-x deviations divided by one less than the x number of observations 17921792 13621362-1600=-238(-238)2=5664 16141614一1600 214870 35811.67 14391489-1600=-161(-161)2 Note that the"average"in the variance divides the sum by one fewer than the The standard deviation is the square root number of observations that is. n-1 rather of the variance than n. the reason is that the deviations always sum to exactly 0, so that knowing n-1 s=√358117=18924 of them determines the last one Only n-1 of the squared deviations can vary freely, and we average by dividing the total by n-1.The number n-1 is called the degree of freedom of the variance or standard deviation
7 25 1792 1792 - 1600 = 192 1922 = 36864 1666 1666 - 1600 = 66 662 = 4356 1362 1362 - 1600 = -238 (-238)2 = 56644 1614 1614 - 1600 = 14 142 = 196 1460 1460 - 1600 = -140 (-140)2 = 19600 1867 1867 - 1600 = 267 2672 = 71289 1439 1439 - 1600 = -161 (-161)2 = 25921 sum = 0 sum = 214870 i x Observations Deviations Squared deviations i x − x 2 ( ) i x x − 26 • The variance is the sum of the squared deviations divided by one less than the number of observations: 2 214870 35811.67 6 s = = 27 • The standard deviation is the square root of the variance: s = = 35811.67 189.24 28 Note that the “average” in the variance divides the sum by one fewer than the number of observations, that is, n-1 rather than n. The reason is that the deviations always sum to exactly 0, so that knowing n-1 of them determines the last one. Only n-1 of the squared deviations can vary freely, and we average by dividing the total by n-1. The number n-1 is called the degree of freedom of the variance or standard deviation
Standard deviation Standard deviation measure spread about the mean and has the same units of measurement as the should be used only when the mean is original observations chosen as the measure of center Like the mean, s is not resistant. Strong 0 only when there is no spread. This skewness or few outliers can greatly happens only when all observations have Increases s the same value otherwise s>0. as observations become more spread out about their mean, s gets larger Independent, Dependent and Control variables Dependent variables: also called outcome or response variables (i.e. they Independent variables: also known as are outcomes or responses to the explanatory or predictor variables (i.e. they independent variables). Most statistical explain or predict the dependent variable procedures require the dependent variable or covariates. Independent variables can to be either numeric or dichotomous be numeric or categorical data. The However, dependent variables with more statistical procedures to be used to than two categories are possible analyse the data will depend on whether the variables are numeric or categorical 8
8 29 Standard deviation • measure spread about the mean and should be used only when the mean is chosen as the measure of center. • = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise s > o. as observations become more spread out about their mean, s gets larger. 30 • has the same units of measurement as the original observations. • Like the mean, s is not resistant. Strong skewness or few outliers can greatly increases s. Standard deviation 31 Independent, Dependent and Control Variables • Independent variables: also known as explanatory or predictor variables (i.e. they explain or predict the dependent variable) or covariates. Independent variables can be numeric or categorical data. The statistical procedures to be used to analyse the data will depend on whether the variables are numeric or categorical. 32 • Dependent variables: also called outcome or response variables (i.e. they are outcomes or responses to the independent variables). Most statistical procedures require the dependent variable to be either numeric or dichotomous. However, dependent variables with more than two categories are possible
Control variables: these are independent variables that have a known or expected Age is often a control variable in relationship to the dependent variable demographic analysis. This is because Therefore, we are not interested in examining many demographic events are influenced their relationship to the dependent variable by a persons age so that we usually know Nonetheless they still have to be included in the analysis because they also have a known the nature of the relationship between age or expected relationship to the other and various demographic events such as independent variables. We have to control for getting married, giving birth and dying their effect when we examine the relationship of the other independent variables to the dependent variable. Hence they are called control variables Control variables can be numeric or categorical For example, we know that the number of children a woman has is related to her age Sex is also a common control variable in older women will have more children than younger women, all else being equal demographic analysis. If we know that However, other personal factors such males and females have different risks of level of education can also be related to a experiencing a demographic event,we persons age. For example, older people should control for sex in the data anal tend to have fewer years of schooling than younger people. Thus, if we want to examine the relationship between women 's level of education and the number of children they have, we need to control for age in the analysis
9 33 • Control variables: these are independent variables that have a known or expected relationship to the dependent variable. Therefore, we are not interested in examining their relationship to the dependent variable. Nonetheless they still have to be included in the analysis because they also have a known or expected relationship to the other independent variables. We have to control for their effect when we examine the relationship of the other independent variables to the dependent variable. Hence they are called control variables. Control variables can be numeric or categorical. 34 • Age is often a control variable in demographic analysis. This is because many demographic events are influenced by a person’s age so that we usually know the nature of the relationship between age and various demographic events such as getting married, giving birth and dying. 35 • For example, we know that the number of children a woman has is related to her age; older women will have more children than younger women, all else being equal. However, other personal factors such as level of education can also be related to a person’s age. For example, older people tend to have fewer years of schooling than younger people. Thus, if we want to examine the relationship between women’s level of education and the number of children they have, we need to control for age in the analysis. 36 • Sex is also a common control variable in demographic analysis. If we know that males and females have different risks of experiencing a demographic event, we should control for sex in the data analysis
Relationship Between Variables In examining a relationship of causation In survey data analysis, we often are the objective is to see whether the interested in examining the relationship independent variable(s has an effect on between two variables or the relationships the dependent variable. Sometimes, the between one or more independent direction of causation may be unclear, in variables and a dependent variable hich case we can test only for an The relationship between two variables, or association between the variables In this between one or more independent case, the objective is to see whether variables and a dependent variable, may changes in one or more(independent) be one of causation or association variables result in a real change in the Usually specification of a relationship of other(dependent)variable causation is based on theory or hypothesis ndependent variables Dependent variables The type of relationship you specify to be (explanatory and Response variables) analysed is determined by theoretical control variables) arguments. Therefore you need to be sure of the theory on which you are basing your data analysis. This will guide you in formulating a theoretical (or conceptual or analytical)framework that will in turn Independent"variables determine the statistical model on which variates and Dependent” variables control variables analyse your data
10 37 Relationship Between Variables • In survey data analysis, we often are interested in examining the relationship between two variables or the relationships between one or more independent variables and a dependent variable. • The relationship between two variables, or between one or more independent variables and a dependent variable, may be one of causation or association. Usually specification of a relationship of causation is based on theory or hypothesis. 38 • In examining a relationship of causation, the objective is to see whether the independent variable(s) has an effect on the dependent variable. Sometimes, the direction of causation may be unclear, in which case we can test only for an association between the variables. In this case, the objective is to see whether changes in one or more (independent) variables result in a real change in the other (dependent) variable. 39 40 • The type of relationship you specify to be analysed is determined by theoretical arguments. Therefore you need to be sure of the theory on which you are basing your data analysis. This will guide you in formulating a theoretical (or conceptual or analytical) framework that will in turn determine the statistical model on which you analyse your data