R 平方也好很多，这进一步验证了我们的观点。将线性回归模型可视化我们先

点击下载：中国科学技术大学：《应用统计方法》课程教学资源（学习讲义）应用统计方法大作业指导

正在加载图片...

R平方也好很多，这进一步验证了我们的观点。将线性回归模型可视化我们先来看看医疗费用与一个人的年龄和吸烟状况的关系。 attach(insurance) plot(age,charges,col-smoker) summary(charges [smoker=="no"]) summary(charges[smoker--"yes"]) 我们可以在这里看到一个有趣的趋势，随着人们年龄的增长，他们的医疗费用会更高，这是意料之中的。但是，无论年龄大小，吸烟者的医疗费用都比不吸烟者高，正如之前推断的那样。我们将尝试创建一个仅使用年龄和吸烟状况的模型，以进行比较。看起来吸烟者是预测医疗费用中最重要的一个变量， ins_model3<-1m(charges-age+smoker,insurance) summary(ins_model3) 仅使用年龄和吸烟者作为自变量，我们建立了一个r平方为7214%的模型，这与我们使用所有原始变量的第一个模型相当。在回归分析中，我们希望创建一个准确但同时尽可能简单的模型。因此，如果我必须选择，我会洗择第三个模型而不是第一个模型。但是，第二个模型比这些模型中的任何一个都好因此我们建议采用它。 intercepts<-c(coef(ins_model3)["(Intercept)"],coef(ins_model3)["(Intercept)"]+coef(ins_model3)["sm lines.df<-data.frame(intercepts=intercepts, slopes -rep(coef(ins_model3)["age"],2), smoker levels(insuranceSsmoker)) qplot(x-age,y=charges,color-smoker,data-insurance)+geom_abline(aes(intercept-intercepts,slope=slop 我们将构建的回归模型可视化。图中有2条线，这表明我们有2个不同的回归方程，它们具有相同的斜率但不同的截距。回归线的斜率等于变量ag(274.87)的系数。而就截距而言，吸烟者截距比非吸烟者高23,855.30。这表明，平均而言，吸烟者的医疗费用根据年龄增加约24,000美元。（吸烟有害健康！）下面这一部分使用的Sharpley回归，不做要求，仅供参考 Variable Importance ins_model2_shapley<-calc.relimp(ins_model2,type-"lmg") ins_model2_shapley ins_model2_shapley$1mg As we have concluded,the second model has the best performance with the highest r-squared out of the 3 models we have built.We would use it to derive the variable importance of our predictors.We will use a statistical method called shapley value regression which is a solution that originated R 平方也好很多，这进一步验证了我们的观点。将线性回归模型可视化我们先来看看医疗费用与一个人的年龄和吸烟状况的关系。 attach(insurance) plot(age,charges,col=smoker) summary(charges[smoker=="no"]) summary(charges[smoker=="yes"]) 我们可以在这里看到一个有趣的趋势，随着人们年龄的增长，他们的医疗费用会更高，这是意料之中的。但是，无论年龄大小，吸烟者的医疗费用都比不吸烟者高，正如之前推断的那样。我们将尝试创建一个仅使用年龄和吸烟状况的模型，以进行比较。看起来吸烟者是预测医疗费用中最重要的一个变量。 ins_model3<-lm(charges~age+smoker,insurance) summary(ins_model3) 仅使用年龄和吸烟者作为自变量，我们建立了一个 r 平方为 72.14% 的模型，这与我们使用所有原始变量的第一个模型相当。在回归分析中，我们希望创建一个准确但同时尽可能简单的模型。因此，如果我必须选择，我会选择第三个模型而不是第一个模型。但是，第二个模型比这些模型中的任何一个都好，因此我们建议采用它。 intercepts<-c(coef(ins_model3)["(Intercept)"],coef(ins_model3)["(Intercept)"]+coef(ins_model3)["smokeryes"]) lines.df<- data.frame(intercepts = intercepts, slopes = rep(coef(ins_model3)["age"], 2), smoker = levels(insurance$smoker)) qplot(x=age,y=charges,color=smoker,data=insurance)+geom_abline(aes(intercept=intercepts,slope=slopes,color=smoker),data=lines.df) + theme_few() + scale_y_continuous(breaks = seq(0,65000,5000)) 我们将构建的回归模型可视化。图中有 2 条线，这表明我们有 2 个不同的回归方程，它们具有相同的斜率但不同的截距。回归线的斜率等于变量 age (274.87) 的系数。而就截距而言，吸烟者截距比非吸烟者高 23,855.30。这表明，平均而言，吸烟者的医疗费用根据年龄增加约 24,000 美元。（吸烟有害健康！）下面这一部分使用的 Sharpley 回归，不做要求，仅供参考。 Variable Importance ins_model2_shapley<-calc.relimp(ins_model2,type="lmg") ins_model2_shapley ins_model2_shapley$lmg As we have concluded, the second model has the best performance with the highest r-squared out of the 3 models we have built. We would use it to derive the variable importance of our predictors. We will use a statistical method called shapley value regression which is a solution that originated 5

<<向上翻页向下翻页>>

点击下载：中国科学技术大学：《应用统计方法》课程教学资源（学习讲义）应用统计方法大作业指导