中国科学技术大学：《应用统计方法》课程教学资源（学习讲义）应用统计方法大作业指导.pdf_大学文库

ggtitle("Boxplot of Medical Charges by Number of Children") 与其他群体相比，有5个孩子的人的医疗支出平均更少。来从bmi创建新变量 insurance$bmi30 =30,"yes","no") 幸肥胖状况 describeBy(insurance$charges,insurance$bmi30) ggplot(data -insurance,aes(bmi30,charges))+geom_boxplot(fill -c(2:3))+ theme_classic()+ggtitle("Boxplot of Medical Charges by Obesity") 创建新变量bm30背后的想法是，0是肥胖的bmi阈值，我们都知道肥胖在一个人的健康中起着巨大的作用。正如我们所见，虽然肥胖者和非肥胖者的医疗费用中位数相同，但他们的平均支出相差近 5000美元。 pairs.panels(insurance[c("age","mi","children","charges")]) 我们可以看到，在我们的数值变量中，age与charges的相关性最高。我们可以从该图中得出的另一个观察结果是，我们的数值之间没有一个高度相关，因此多重共线性不会成为问题。另一件需要注意的事情是，年龄和收费之间的关系可能根本不是真正的线性关系。构建模型 #从原始数据集创建模型 ins_model <-1m(charges-age sex bmi children smoker region,data insurance) summary(ins_model) 在第一个模型中，我们使用了数据集中包含的那些原始变量，得到了0.7509的r平方，这意味charges 的75.09%的变化可以通过我们包含的自变量集来解释。我们还可以观察到，除性别外，我们包含的所有自变量都是医疗费用的统计显者预测因子(p值小于0.05<显若性水平). #创建新变量年龄的平方 insuranceSage2 <-insuranceSage2 #第二个模型 ins_model2 <-1m(charges-age age2 children bmi sex bmi30*smoker region,data insu summary(ins_model2) 在这一部分中做的第一件事是创建一个新的变量ag2,它是年龄的平方。正如之前所说的，年龄和费用之间的关系可能不是完全线性的，所以我们在模型中引入变量ag©2来处理这种非线性。正如我们所看到的，通过添加我们导出的这些变量，我们的模型得到了显着改进。我们现在有0.8664的r平方，这意味着86.64%的方差变化可以用模型中的自变量来解释。与前一个模型相比，第二个模型的调整后的

ggtitle("Boxplot of Medical Charges by Number of Children") 与其他群体相比，有 5 个孩子的人的医疗支出平均更少。 # 从 bmi 创建新变量 insurance$bmi30 =30,"yes","no") # 肥胖状况 describeBy(insurance$charges,insurance$bmi30) ggplot(data = insurance,aes(bmi30,charges)) + geom_boxplot(fill = c(2:3)) + theme_classic() + ggtitle("Boxplot of Medical Charges by Obesity") 创建新变量 bmi30 背后的想法是，30 是肥胖的 bmi 阈值，我们都知道肥胖在一个人的健康中起着巨大的作用。正如我们所见，虽然肥胖者和非肥胖者的医疗费用中位数相同，但他们的平均支出相差近 5000 美元。 pairs.panels(insurance[c("age", "bmi", "children", "charges")]) 我们可以看到，在我们的数值变量中，age 与 charges 的相关性最高。我们可以从该图中得出的另一个观察结果是，我们的数值之间没有一个高度相关，因此多重共线性不会成为问题。另一件需要注意的事情是，年龄和收费之间的关系可能根本不是真正的线性关系。构建模型 # 从原始数据集创建模型 ins_model <- lm(charges ~ age + sex + bmi + children + smoker + region, data = insurance) summary(ins_model) 在第一个模型中，我们使用了数据集中包含的那些原始变量，得到了 0.7509 的 r 平方，这意味着 charges 的 75.09% 的变化可以通过我们包含的自变量集来解释。我们还可以观察到，除性别外，我们包含的所有自变量都是医疗费用的统计显着预测因子（p 值小于 0.05 <- 显着性水平）。 # 创建新变量年龄的平方 insurance$age2 <- insurance$age^2 # 第二个模型 ins_model2 <- lm(charges ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance) summary(ins_model2) 在这一部分中做的第一件事是创建一个新的变量 age2，它是年龄的平方。正如之前所说的，年龄和费用之间的关系可能不是完全线性的，所以我们在模型中引入变量 age2 来处理这种非线性。正如我们所看到的，通过添加我们导出的这些变量，我们的模型得到了显着改进。我们现在有 0.8664 的 r 平方，这意味着 86.64% 的方差变化可以用模型中的自变量来解释。与前一个模型相比，第二个模型的调整后的 4

R平方也好很多，这进一步验证了我们的观点。将线性回归模型可视化我们先来看看医疗费用与一个人的年龄和吸烟状况的关系。 attach(insurance) plot(age,charges,col-smoker) summary(charges [smoker=="no"]) summary(charges[smoker--"yes"]) 我们可以在这里看到一个有趣的趋势，随着人们年龄的增长，他们的医疗费用会更高，这是意料之中的。但是，无论年龄大小，吸烟者的医疗费用都比不吸烟者高，正如之前推断的那样。我们将尝试创建一个仅使用年龄和吸烟状况的模型，以进行比较。看起来吸烟者是预测医疗费用中最重要的一个变量， ins_model3<-1m(charges-age+smoker,insurance) summary(ins_model3) 仅使用年龄和吸烟者作为自变量，我们建立了一个r平方为7214%的模型，这与我们使用所有原始变量的第一个模型相当。在回归分析中，我们希望创建一个准确但同时尽可能简单的模型。因此，如果我必须选择，我会洗择第三个模型而不是第一个模型。但是，第二个模型比这些模型中的任何一个都好因此我们建议采用它。 intercepts<-c(coef(ins_model3)["(Intercept)"],coef(ins_model3)["(Intercept)"]+coef(ins_model3)["sm lines.df<-data.frame(intercepts=intercepts, slopes -rep(coef(ins_model3)["age"],2), smoker levels(insuranceSsmoker)) qplot(x-age,y=charges,color-smoker,data-insurance)+geom_abline(aes(intercept-intercepts,slope=slop 我们将构建的回归模型可视化。图中有2条线，这表明我们有2个不同的回归方程，它们具有相同的斜率但不同的截距。回归线的斜率等于变量ag(274.87)的系数。而就截距而言，吸烟者截距比非吸烟者高23,855.30。这表明，平均而言，吸烟者的医疗费用根据年龄增加约24,000美元。（吸烟有害健康！）下面这一部分使用的Sharpley回归，不做要求，仅供参考 Variable Importance ins_model2_shapley<-calc.relimp(ins_model2,type-"lmg") ins_model2_shapley ins_model2_shapley$1mg As we have concluded,the second model has the best performance with the highest r-squared out of the 3 models we have built.We would use it to derive the variable importance of our predictors.We will use a statistical method called shapley value regression which is a solution that originated

R 平方也好很多，这进一步验证了我们的观点。将线性回归模型可视化我们先来看看医疗费用与一个人的年龄和吸烟状况的关系。 attach(insurance) plot(age,charges,col=smoker) summary(charges[smoker=="no"]) summary(charges[smoker=="yes"]) 我们可以在这里看到一个有趣的趋势，随着人们年龄的增长，他们的医疗费用会更高，这是意料之中的。但是，无论年龄大小，吸烟者的医疗费用都比不吸烟者高，正如之前推断的那样。我们将尝试创建一个仅使用年龄和吸烟状况的模型，以进行比较。看起来吸烟者是预测医疗费用中最重要的一个变量。 ins_model3<-lm(charges~age+smoker,insurance) summary(ins_model3) 仅使用年龄和吸烟者作为自变量，我们建立了一个 r 平方为 72.14% 的模型，这与我们使用所有原始变量的第一个模型相当。在回归分析中，我们希望创建一个准确但同时尽可能简单的模型。因此，如果我必须选择，我会选择第三个模型而不是第一个模型。但是，第二个模型比这些模型中的任何一个都好，因此我们建议采用它。 intercepts<-c(coef(ins_model3)["(Intercept)"],coef(ins_model3)["(Intercept)"]+coef(ins_model3)["smokeryes"]) lines.df<- data.frame(intercepts = intercepts, slopes = rep(coef(ins_model3)["age"], 2), smoker = levels(insurance$smoker)) qplot(x=age,y=charges,color=smoker,data=insurance)+geom_abline(aes(intercept=intercepts,slope=slopes,color=smoker),data=lines.df) + theme_few() + scale_y_continuous(breaks = seq(0,65000,5000)) 我们将构建的回归模型可视化。图中有 2 条线，这表明我们有 2 个不同的回归方程，它们具有相同的斜率但不同的截距。回归线的斜率等于变量 age (274.87) 的系数。而就截距而言，吸烟者截距比非吸烟者高 23,855.30。这表明，平均而言，吸烟者的医疗费用根据年龄增加约 24,000 美元。（吸烟有害健康！）下面这一部分使用的 Sharpley 回归，不做要求，仅供参考。 Variable Importance ins_model2_shapley<-calc.relimp(ins_model2,type="lmg") ins_model2_shapley ins_model2_shapley$lmg As we have concluded, the second model has the best performance with the highest r-squared out of the 3 models we have built. We would use it to derive the variable importance of our predictors. We will use a statistical method called shapley value regression which is a solution that originated 5

from the Game Theory concept developed by Lloyd Shapley in the 1950s.It's aim is to fairly allocate predictor importance in regression analysis.Given n number of independent variables (IV),we will run all combination of linear regression models using this list of IVs against the dependent variable (DV)and get each model's R-Squared.To get the importance measure of each independent variable (IV),the average contribution to the total R-squared of each IV is computed by decomposing the total R-squared and computing for the proportion marginal contribution of each IV. Let's say we have 2 IVs A and B and a dependent variable Y.We can build 3 models as follows:1) Y-A 2)Y-B3)Y-A+B and each model would have their respective R-squared. To get the Shapley Value of A we have to decompose the r-squared of the third model and derive Attribute A's marginal contribution. ShapleyValue(A)=[R-squared(AB)-R-squared(B)]+R-squared(A)/2 We have used the cale.relimp()function from the relaimpo package to determine the Shapley Value of our predictors. sum(ins_model2_shapley$lmg) As we can see,the Shapley Value of our attributes sums up to the R-squared of our second regression model.Like what I have said,Shapley Value Regression is a variance decomposition method by means of computing the marginal contribution of each attribute. barplot(sort(ins_model2_shapley$lmg,decreasing=TRUE),col-c(2:10),main-"Relative Importance of Pr The Shapley Value scores of each attribute shows their marginal contribution to the overall r-squared (0.8664)of the second model.So we can conclude that,on the 86.64%total variance explained by our model a little over60%of it is due to the attribute smoker.Results also cemented our previous hypothesis that variable smoker is the singlemost important variable in predicting medical charges.If you would also notice,smoker is followed by bmi30:smoker,age2,age, and bmi30 where majority of which are variables we have derived and not included in the original dataset.Glad we have engineered those variables up!:）总结在本分析中，我们使用Shapley值回归来推导医疗费用的关键因素。它在处理多重共线性问题时非常有用，另一方面，Shaple©y值回归按比例分解r平方以解决多重共线性问题（尽管在此数据集中多重共线性不是问题)。我们还了解了特征工程在提高模型准确性方面的重要性。注意，吸烟对您的健康有害！

from the Game Theory concept developed by Lloyd Shapley in the 1950s. It’s aim is to fairly allocate predictor importance in regression analysis. Given n number of independent variables (IV), we will run all combination of linear regression models using this list of IVs against the dependent variable (DV) and get each model’s R-Squared. To get the importance measure of each independent variable (IV), the average contribution to the total R-squared of each IV is computed by decomposing the total R-squared and computing for the proportion marginal contribution of each IV. Let’s say we have 2 IVs A and B and a dependent variable Y. We can build 3 models as follows: 1) Y~A 2) Y~B 3) Y~A+B and each model would have their respective R-squared. To get the Shapley Value of A we have to decompose the r-squared of the third model and derive Attribute A’s marginal contribution. 𝑆ℎ𝑎𝑝𝑙𝑒𝑦𝑉 𝑎𝑙𝑢𝑒(𝐴) = [𝑅 − 𝑠𝑞𝑢𝑎𝑟𝑒𝑑(𝐴𝐵) − 𝑅 − 𝑠𝑞𝑢𝑎𝑟𝑒𝑑(𝐵)] + 𝑅 − 𝑠𝑞𝑢𝑎𝑟𝑒𝑑(𝐴)/2 We have used the calc.relimp() function from the relaimpo package to determine the Shapley Value of our predictors. sum(ins_model2_shapley$lmg) As we can see, the Shapley Value of our attributes sums up to the R-squared of our second regression model. Like what I have said, Shapley Value Regression is a variance decomposition method by means of computing the marginal contribution of each attribute. barplot(sort(ins_model2_shapley$lmg,decreasing = TRUE),col=c(2:10),main="Relative Importance of Predictors",xlab="Predictor Labels",ylab="Shapley Value Regression",font.lab=2) The Shapley Value scores of each attribute shows their marginal contribution to the overall r-squared (0.8664) of the second model. So we can conclude that, on the 86.64% total variance explained by our model a little over 60% of it is due to the attribute smoker. Results also cemented our previous hypothesis that variable smoker is the singlemost important variable in predicting medical charges. If you would also notice, smoker is followed by bmi30:smoker, age2, age, and bmi30 where majority of which are variables we have derived and not included in the original dataset. Glad we have engineered those variables up! :) 总结在本分析中，我们使用 Shapley 值回归来推导医疗费用的关键因素。它在处理多重共线性问题时非常有用，另一方面，Shapley 值回归按比例分解 r 平方以解决多重共线性问题（尽管在此数据集中多重共线性不是问题）。我们还了解了特征工程在提高模型准确性方面的重要性。注意，吸烟对您的健康有害！ 6