The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
Please download worldData.xlsx to your working directory. The xlsx file contains two sheets: one containing variable descriptions, labeled variables, and the other containing data, labeled data. Each observation is a country, and the variables are drawn from a variety of sources.
xr<-data.frame(readxl::read_xlsx(path="worldData.xlsx",sheet="data"))
rownames(xr)<-xr$ISO3 # using as rowname the ISO 3-character code for each country
xr$prcpXtemp<-scale(xr$prcp*xr$temp) # values will be high when the climate is warm and wet; low when dry and cold
We can extract some variables to make a model explaining the variation across countries in Healthy Life Expectancy at birth. This is the number of healthy years the average new born is expected to live (the figure is from around 2012).
mhh<-c("pctUrb","simlang","pctYlow40","NetPrimEd","sanTot","watTot","CALORIE97","prcpXtemp","GCGDP","X2009.Overall","Christian")
dpv<-"HLEbirth"
xur<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(xur,xr))
##
## Call:
## lm(formula = xur, data = xr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6870 -3.0476 0.1886 3.0076 9.8266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.707440 5.548952 -0.848 0.39850
## pctUrb 0.132423 0.038673 3.424 0.00093 ***
## simlang 7.104192 2.851683 2.491 0.01456 *
## pctYlow40 0.478732 0.147753 3.240 0.00168 **
## NetPrimEd 0.067901 0.058044 1.170 0.24516
## sanTot 0.114436 0.037552 3.047 0.00303 **
## watTot 0.135329 0.059723 2.266 0.02585 *
## CALORIE97 0.003235 0.001699 1.904 0.06006 .
## prcpXtemp 1.704599 0.742838 2.295 0.02408 *
## GCGDP -0.317173 0.126362 -2.510 0.01386 *
## X2009.Overall 0.213336 0.071512 2.983 0.00367 **
## Christian -1.977590 1.563444 -1.265 0.20918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.878 on 90 degrees of freedom
## (89 observations deleted due to missingness)
## Multiple R-squared: 0.8523, Adjusted R-squared: 0.8343
## F-statistic: 47.22 on 11 and 90 DF, p-value: < 2.2e-16
The above is our unrestricted model, so we will create our descriptive statistics table now.
u<-data.frame(psych::describe(xr[,c(dpv,mhh)])) # we will output our descriptive statistics now
u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file called "descrip.csv"
n | mean | sd | min | max | |
---|---|---|---|---|---|
HLEbirth | 189 | 57.5197 | 11.1386 | 28.5607 | 74.9935 |
pctUrb | 190 | 55.2158 | 23.6784 | 11.0000 | 100.0000 |
simlang | 191 | 0.7874 | 0.2160 | 0.2551 | 1.0000 |
pctYlow40 | 125 | 16.8480 | 4.3125 | 6.0000 | 25.0000 |
NetPrimEd | 183 | 85.0164 | 15.8001 | 22.0000 | 100.0000 |
sanTot | 155 | 67.1935 | 30.1783 | 5.0000 | 100.0000 |
watTot | 159 | 83.4591 | 18.3033 | 22.0000 | 100.0000 |
CALORIE97 | 139 | 2,686.6475 | 513.3608 | 1,685.0000 | 3,699.0000 |
prcpXtemp | 179 | 0.0000 | 1.0000 | -1.8454 | 2.8088 |
GCGDP | 140 | 15.1893 | 5.6270 | 4.6000 | 32.0000 |
X2009.Overall | 172 | 59.5581 | 10.5864 | 22.7000 | 87.1000 |
Christian | 191 | 0.5753 | 0.3738 | 0.0000 | 0.9945 |
Next, we will identify the coefficients with a p-value above 0.10, and use an F-test to confirm that those variables may be dropped from the model.
j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
drpt<-intersect(rownames(j)[which(j[,4]>.1)],mhh) # get the names of the variables with p-values above 0.10
linearHypothesis(q,drpt) # perform the F-test
## Linear hypothesis test
##
## Hypothesis:
## NetPrimEd = 0
## Christian = 0
##
## Model 1: restricted model
## Model 2: HLEbirth ~ pctUrb + simlang + pctYlow40 + NetPrimEd + sanTot +
## watTot + CALORIE97 + prcpXtemp + GCGDP + X2009.Overall +
## Christian
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 2208.3
## 2 90 2141.3 2 67.078 1.4097 0.2496
The p-value of the F-test is above 0.05; therefore, we cannot reject the null hypothesis that the true values of the coefficients are zero. So we drop the variables, creating our restricted model as the final model.
mhh<-setdiff(mhh,drpt)
fr<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(fr,xr))
##
## Call:
## lm(formula = fr, data = xr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.023 -2.780 0.112 2.842 11.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.752275 5.145257 -0.535 0.593985
## pctUrb 0.121508 0.037929 3.204 0.001860 **
## simlang 7.690603 2.691687 2.857 0.005274 **
## pctYlow40 0.539948 0.137020 3.941 0.000157 ***
## sanTot 0.128905 0.033981 3.793 0.000264 ***
## watTot 0.157277 0.056965 2.761 0.006944 **
## CALORIE97 0.003503 0.001682 2.083 0.040036 *
## prcpXtemp 1.753336 0.717317 2.444 0.016399 *
## GCGDP -0.349862 0.120966 -2.892 0.004763 **
## X2009.Overall 0.193388 0.069378 2.787 0.006441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.875 on 93 degrees of freedom
## (88 observations deleted due to missingness)
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8344
## F-statistic: 58.1 on 9 and 93 DF, p-value: < 2.2e-16
Now that we have the final model, we can output our regression results table.
j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
j<-data.frame(j,vif=c(NA,vif(q))) # adding the VIF as the last column of the table
write.csv(j,file="regres.csv") # this writes the object j to a csv-format file called "regres.csv"
Estimate | Std..Error | t.value | Pr…t.. | vif | |
---|---|---|---|---|---|
(Intercept) | -2.7523 | 5.1453 | -0.5349 | 0.5940 | NA |
pctUrb | 0.1215 | 0.0379 | 3.2036 | 0.0019 | 2.8667 |
simlang | 7.6906 | 2.6917 | 2.8572 | 0.0053 | 1.4669 |
pctYlow40 | 0.5399 | 0.1370 | 3.9407 | 0.0002 | 1.5231 |
sanTot | 0.1289 | 0.0340 | 3.7934 | 0.0003 | 4.3245 |
watTot | 0.1573 | 0.0570 | 2.7609 | 0.0069 | 3.8197 |
CALORIE97 | 0.0035 | 0.0017 | 2.0826 | 0.0400 | 2.8525 |
prcpXtemp | 1.7533 | 0.7173 | 2.4443 | 0.0164 | 1.8877 |
GCGDP | -0.3499 | 0.1210 | -2.8922 | 0.0048 | 1.6359 |
X2009.Overall | 0.1934 | 0.0694 | 2.7875 | 0.0064 | 1.7612 |
You have used a t-statistic to judge whether an independent variable is significantly related to the dependent variable. Here we look at a related question: of all the significant independent variables, which exerts the greatest influence on the dependent variable? There are three main ways to answer this question.
We rescale our coefficients so they can be interpreted as the change in the dependent variable (measured in units of the standard deviation of the dependent variable) for a one standard deviation increase in the independent variable. The formula is simple: multiply each estimated coefficient by the ratio of the standard deviation of the independent variable over the standard deviation of the dependent variable.
\[\hat\beta_{j}^\prime=\hat\beta_{j} \frac{sd(x_j)}{sd(y)} \]
When calculating the standard deviations, one should use the same data that were used in estimating the coefficients. R will automatically drop any observation that contains a missing value (we call this dropping of observations listwise deletion), so some of the observations in the original data set may not have been used in the regression. Fortunately, R stores, in the lm
object, the exact set of data used to estimate the coefficients, this is q$model
below.
#--standardized coefficients---
rr<-q$model # The data that were actually used in the regression
sdv<-apply(rr,2,sd) # calculate the standard deviation for each column
cf<-coef(q) # extract the coefficient estimates
stdcf<-cf[mhh]*sdv[mhh]/sdv[dpv] # use the names to make sure everything is matched up properly
We rescale our coefficients so they can be interpreted as the percent change in the dependent variable for a one percent increase in the independent variable.
\[\xi_{x_j}^y=\frac{\frac{\Delta y}{y}}{\frac{\Delta x_j}{x_j}}=\frac{\Delta y}{\Delta x_j}\frac{x_j}{y} = \hat\beta_j \frac{\bar x_j}{\bar y} \]
The above formula is for a linear function. It is customary to take the values of \(x_j\) and \(y\) at their mean, but elasticity will actually change over the range of the data.
#--elasticities---
rr<-q$model
mnv<-apply(rr,2,mean)
cf<-coef(q)
elast<-cf[mhh]*mnv[mhh]/mnv[dpv]
Since independent variables are almost always somewhat collinear, much of R2 cannot be assigned to a single variable. A popular algorithm to tease apart these overlapping effects is to enter the independent variables one at a time in the model, recording the change in R2 that results. The first variables to enter grab all of the overlapping effects, but by trying all possible orders of variable entry, and then taking the average across all of these orders, one can produce a reasonable measure of the contribution to the model by an independent variable.
#--partitioning R2--this is slow with more than 10 or so variables--
library(relaimpo)
r2p<-calc.relimp(q)$lmg
# --put all three measures in the same data.frame. Note that they do not completely agree--
j<-data.frame(elast,stdcf,r2p)
j$z<-rowMeans(scale(abs(j))) # quick and dirty way to get the aggregate influence
j<-data.frame(round(summary(q)$coefficients[rownames(j),c(1,4)],4),mnv=round(mnv[rownames(j)],4),j)
j<-j[order(j$z),]
The table below shows the estimated coefficients and p-values; the mean values of the independent variables; our three influence measures, and our quick-and-dirty aggregate influence measure. Clearly, the estimated coefficients cannot be used as a direct measure of influence: coefficients take on large values when the mean variable value is small (e.g., simlang), and take on small values when the mean variable value is large (e.g., CALORIE97).
Estimate | Pr…t.. | mnv | elast | stdcf | r2p | z | |
---|---|---|---|---|---|---|---|
prcpXtemp | 1.7533 | 0.0164 | -0.1250 | -0.0039 | 0.1353 | 0.0202 | -1.3020 |
GCGDP | -0.3499 | 0.0048 | 14.2990 | -0.0899 | -0.1491 | 0.0194 | -0.8163 |
simlang | 7.6906 | 0.0053 | 0.7701 | 0.1064 | 0.1394 | 0.0561 | -0.5995 |
pctYlow40 | 0.5399 | 0.0002 | 16.4757 | 0.1598 | 0.1960 | 0.0424 | -0.1001 |
CALORIE97 | 0.0035 | 0.0400 | 2,604.5728 | 0.1639 | 0.1417 | 0.1262 | 0.0531 |
X2009.Overall | 0.1934 | 0.0064 | 60.2699 | 0.2094 | 0.1491 | 0.0990 | 0.1717 |
pctUrb | 0.1215 | 0.0019 | 51.1748 | 0.1117 | 0.2186 | 0.1207 | 0.1991 |
watTot | 0.1573 | 0.0069 | 83.8641 | 0.2370 | 0.2174 | 0.1654 | 1.0275 |
sanTot | 0.1289 | 0.0003 | 66.9515 | 0.1551 | 0.3179 | 0.1996 | 1.3665 |
OLS coefficients are derived so as to minimize the sum of squared errors. Outlier observations can have a big effect, since errors are squared. It is often useful, particularly with small datasets, to see how the presence of a particular observation affects ones results. The seminal source for influence diagnostics is:
The plot below illustrates how an outlier (dark point at lower right) can affect the regression line. The green line was estimated with data including the outlier, the red line used data that excluded it.
The measure dffits gives an overall measure of the influence of a particular observation: deleting the ith observation, the parameters are re-estimated, and a new predicted value for the ith observation is calculated.The dffits measure is the scaled change in the predicted value when the observation is included. Values larger (in absolute value) than \(2\sqrt{\frac{p}{n}}\) are worth thinking about (\(n=\)number of observations; \(p=\)number of estimated parameters).
Below are only those observations that were especially influential. It can often help in model building to think about the forces that makes these places different: perhaps we left an important variable out of the model?
#--dffits: just look at the most influential---
o<-as.matrix(dffits(q))
cutoff<-2*(q$rank/length(q$residuals))^.5 # cutoff=2*sqrt(p/n)
length(zii<-rownames(o)[which(abs(o)>cutoff)])
## [1] 10
j<-data.frame(cntry=xr[zii,"country"],dft=o[zii,],rr[zii,])
cntry | dft | HLEbirth | pctUrb | simlang | pctYlow40 | sanTot | watTot | CALORIE97 | prcpXtemp | GCGDP | X2009.Overall | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BWA | Botswana | -1.5699 | 35.6932 | 59 | 0.8105 | 9 | 47 | 96 | 2,183 | -0.46693740 | 26.6 | 69.7 |
SLE | Sierra Leone | -0.9241 | 28.5607 | 42 | 0.3338 | 17 | 11 | 53 | 2,035 | 1.64796571 | 11.7 | 47.8 |
SWZ | Swaziland | -0.8899 | 34.1755 | 25 | 0.9731 | 13 | 50 | 60 | 2,483 | -0.02092433 | 21.9 | 59.1 |
ZWE | Zimbabwe | -0.7210 | 33.5699 | 37 | 0.7643 | 13 | 46 | 81 | 2,145 | -0.11224243 | 18.0 | 22.7 |
LSO | Lesotho | -0.7118 | 31.3823 | 19 | 1.0000 | 6 | 36 | 78 | 2,243 | -0.51255707 | 18.3 | 49.7 |
MWI | Malawi | -0.6843 | 34.9106 | 18 | 0.7806 | 18 | 60 | 76 | 2,043 | 0.33412548 | 14.4 | 53.7 |
MDG | Madagascar | 0.6255 | 48.5957 | 27 | 0.9585 | 13 | 12 | 47 | 2,021 | 0.81384149 | 7.2 | 62.2 |
NER | Niger | 0.7735 | 35.5151 | 17 | 0.4106 | 10 | 7 | 42 | 2,097 | -0.53133366 | 13.3 | 53.8 |
CHN | China | 0.7780 | 64.1457 | 42 | 0.8947 | 13 | 65 | 88 | 2,897 | -1.09896244 | 12.1 | 53.2 |
BOL | Bolivia | 0.8110 | 54.4196 | 65 | 0.4279 | 7 | 43 | 86 | 2,174 | 0.25099103 | 14.2 | 53.6 |
The dfbeta measure is especially useful-it gives a scaled measure of how a particular regression coefficient has changed because of the inclusion of a particular observation (i.e., positive values mean that the coefficient is higher because of inclusion, negative values mean that it is lower). The recommended cutoff for especially influential observations is \(\frac{2}{\sqrt{n}}\).
#--dfbetas: just look at the most influential---
o<-data.frame(dfbetas(q))[,-1] #remove first column--for the intercept
mx<-NULL
nnx<-NROW(o)
for (i in names(o)){
z<-which(abs(o[,i])>2/nnx^.5) #cutoff=2/sqrt(n)
pz<-cbind(data.frame(dfbeta=round(o[z,i],4)),varb=i,ISO3=row.names(o)[z],xval=q$model[z,i],yval=q$model[z,dpv])
mx<-rbind(mx,pz[order(pz[,1]),])
}
mx<-merge(xr[,c("ISO3","country")],mx,by="ISO3")
mx<-mx[order(mx$varb,mx$dfbeta),]
Running these diagnostics allows one to identify observations that are out of sync with the rest of the data. This could be because of data entry errors, or it could be because the model omits a variable that exerts an important influence on that particular observation. Examining these diagnostics might therefore suggest new independent variables, or it may result in dropping one or more observations.
Due in one week, two hours before class.
Download to your working directory the file worldData.xlsx. The workbook contains two worksheets: variables and data. The data consist of 92 variables, mostly having to do with economics and demographics. The observations are 191 countries.
Using these data, build a model explaining GDPpcPPP (GDP per capita, using purchasing power parity, 2009). Follow the same modeling procedure that you followed in the previous homework:
I expect two documents from each group: 1) an MS Word document containing your four tables and discussions; 2) your R script.