1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

3 Miscellaneous international data

Please download worldData.xlsx to your working directory. The xlsx file contains two sheets: one containing variable descriptions, labeled variables, and the other containing data, labeled data. Each observation is a country, and the variables are drawn from a variety of sources.

xr<-data.frame(readxl::read_xlsx(path="worldData.xlsx",sheet="data"))
rownames(xr)<-xr$ISO3 # using as rowname the ISO 3-character code for each country
xr$prcpXtemp<-scale(xr$prcp*xr$temp) # values will be high when the climate is warm and wet; low when dry and cold

4 A model of healthy life expectancy at birth

We can extract some variables to make a model explaining the variation across countries in Healthy Life Expectancy at birth. This is the number of healthy years the average new born is expected to live (the figure is from around 2012).

4.1 Unrestricted model

mhh<-c("pctUrb","simlang","pctYlow40","NetPrimEd","sanTot","watTot","CALORIE97","prcpXtemp","GCGDP","X2009.Overall","Christian")
dpv<-"HLEbirth"
xur<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(xur,xr))
## 
## Call:
## lm(formula = xur, data = xr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6870  -3.0476   0.1886   3.0076   9.8266 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.707440   5.548952  -0.848  0.39850    
## pctUrb         0.132423   0.038673   3.424  0.00093 ***
## simlang        7.104192   2.851683   2.491  0.01456 *  
## pctYlow40      0.478732   0.147753   3.240  0.00168 ** 
## NetPrimEd      0.067901   0.058044   1.170  0.24516    
## sanTot         0.114436   0.037552   3.047  0.00303 ** 
## watTot         0.135329   0.059723   2.266  0.02585 *  
## CALORIE97      0.003235   0.001699   1.904  0.06006 .  
## prcpXtemp      1.704599   0.742838   2.295  0.02408 *  
## GCGDP         -0.317173   0.126362  -2.510  0.01386 *  
## X2009.Overall  0.213336   0.071512   2.983  0.00367 ** 
## Christian     -1.977590   1.563444  -1.265  0.20918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.878 on 90 degrees of freedom
##   (89 observations deleted due to missingness)
## Multiple R-squared:  0.8523, Adjusted R-squared:  0.8343 
## F-statistic: 47.22 on 11 and 90 DF,  p-value: < 2.2e-16

The above is our unrestricted model, so we will create our descriptive statistics table now.

u<-data.frame(psych::describe(xr[,c(dpv,mhh)])) # we will output our descriptive statistics now
u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file called "descrip.csv"
Descriptive statistics
n mean sd min max
HLEbirth 189 57.5197 11.1386 28.5607 74.9935
pctUrb 190 55.2158 23.6784 11.0000 100.0000
simlang 191 0.7874 0.2160 0.2551 1.0000
pctYlow40 125 16.8480 4.3125 6.0000 25.0000
NetPrimEd 183 85.0164 15.8001 22.0000 100.0000
sanTot 155 67.1935 30.1783 5.0000 100.0000
watTot 159 83.4591 18.3033 22.0000 100.0000
CALORIE97 139 2,686.6475 513.3608 1,685.0000 3,699.0000
prcpXtemp 179 0.0000 1.0000 -1.8454 2.8088
GCGDP 140 15.1893 5.6270 4.6000 32.0000
X2009.Overall 172 59.5581 10.5864 22.7000 87.1000
Christian 191 0.5753 0.3738 0.0000 0.9945

4.2 F-test to identify irrelevant variables

Next, we will identify the coefficients with a p-value above 0.10, and use an F-test to confirm that those variables may be dropped from the model.

j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
drpt<-intersect(rownames(j)[which(j[,4]>.1)],mhh) # get the names of the variables with p-values above 0.10
linearHypothesis(q,drpt) # perform the F-test
## Linear hypothesis test
## 
## Hypothesis:
## NetPrimEd = 0
## Christian = 0
## 
## Model 1: restricted model
## Model 2: HLEbirth ~ pctUrb + simlang + pctYlow40 + NetPrimEd + sanTot + 
##     watTot + CALORIE97 + prcpXtemp + GCGDP + X2009.Overall + 
##     Christian
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     92 2208.3                           
## 2     90 2141.3  2    67.078 1.4097 0.2496

The p-value of the F-test is above 0.05; therefore, we cannot reject the null hypothesis that the true values of the coefficients are zero. So we drop the variables, creating our restricted model as the final model.

4.3 Restricted model

mhh<-setdiff(mhh,drpt)
fr<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(fr,xr))
## 
## Call:
## lm(formula = fr, data = xr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.023  -2.780   0.112   2.842  11.399 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.752275   5.145257  -0.535 0.593985    
## pctUrb         0.121508   0.037929   3.204 0.001860 ** 
## simlang        7.690603   2.691687   2.857 0.005274 ** 
## pctYlow40      0.539948   0.137020   3.941 0.000157 ***
## sanTot         0.128905   0.033981   3.793 0.000264 ***
## watTot         0.157277   0.056965   2.761 0.006944 ** 
## CALORIE97      0.003503   0.001682   2.083 0.040036 *  
## prcpXtemp      1.753336   0.717317   2.444 0.016399 *  
## GCGDP         -0.349862   0.120966  -2.892 0.004763 ** 
## X2009.Overall  0.193388   0.069378   2.787 0.006441 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.875 on 93 degrees of freedom
##   (88 observations deleted due to missingness)
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8344 
## F-statistic:  58.1 on 9 and 93 DF,  p-value: < 2.2e-16

Now that we have the final model, we can output our regression results table.

j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
j<-data.frame(j,vif=c(NA,vif(q))) # adding the VIF as the last column of the table
write.csv(j,file="regres.csv") # this writes the object j to a csv-format file called "regres.csv"
Regression results for final model
Estimate Std..Error t.value Pr…t.. vif
(Intercept) -2.7523 5.1453 -0.5349 0.5940 NA
pctUrb 0.1215 0.0379 3.2036 0.0019 2.8667
simlang 7.6906 2.6917 2.8572 0.0053 1.4669
pctYlow40 0.5399 0.1370 3.9407 0.0002 1.5231
sanTot 0.1289 0.0340 3.7934 0.0003 4.3245
watTot 0.1573 0.0570 2.7609 0.0069 3.8197
CALORIE97 0.0035 0.0017 2.0826 0.0400 2.8525
prcpXtemp 1.7533 0.7173 2.4443 0.0164 1.8877
GCGDP -0.3499 0.1210 -2.8922 0.0048 1.6359
X2009.Overall 0.1934 0.0694 2.7875 0.0064 1.7612

5 Influential variables

You have used a t-statistic to judge whether an independent variable is significantly related to the dependent variable. Here we look at a related question: of all the significant independent variables, which exerts the greatest influence on the dependent variable? There are three main ways to answer this question.

  • The first is the use of standardized coefficients (often called beta coefficients). The standardized coefficients show by how many standard deviations the dependent variable will change, for a one standard deviation change in the independent variable. This is equivalent to the estimated coefficient times the standard deviation of the independent variable divided by the standard deviation of the dependent variable.
  • The second is the calculation of elasticities. An elasticity is the percentage change in a dependent variable caused by a one percent change in the independent variable (e.g., the percentage change in quantity demanded caused by a one percent change in price). In a model in which all variables are converted to their natural logs (such as the Cobb-Douglas production function), the coefficient estimates can be directly interpreted as elasticities. In a linear regression, one multiplies the estimated coefficient by the mean of the independent variable divided by the mean of the dependent variable.
  • The third way to determine the relative influences of the independent variables is to decompose the model \(R^2\) into the portion attributable to each independent variable. The \(R^2\) is the percent of the variation in the dependent variable that can be explained by the model. If the independent variables are perfectly orthogonal (that is, not correlated with each other), then each independent variable will explain a unique portion of the variation in the dependent variable. But independent variables are almost never orthogonal, they will share some variation, so that two or more independent variables will account for a portion of the variation in the dependent variables. Decomposition of \(R^2\) into the portion accounted for by each independent variable is thus not straightforward. R contains several algorithms to perform this.

5.1 Standardized coefficients

We rescale our coefficients so they can be interpreted as the change in the dependent variable (measured in units of the standard deviation of the dependent variable) for a one standard deviation increase in the independent variable. The formula is simple: multiply each estimated coefficient by the ratio of the standard deviation of the independent variable over the standard deviation of the dependent variable.

\[\hat\beta_{j}^\prime=\hat\beta_{j} \frac{sd(x_j)}{sd(y)} \]

When calculating the standard deviations, one should use the same data that were used in estimating the coefficients. R will automatically drop any observation that contains a missing value (we call this dropping of observations listwise deletion), so some of the observations in the original data set may not have been used in the regression. Fortunately, R stores, in the lm object, the exact set of data used to estimate the coefficients, this is q$model below.

#--standardized coefficients---
rr<-q$model # The data that were actually used in the regression
sdv<-apply(rr,2,sd) # calculate the standard deviation for each column
cf<-coef(q) # extract the coefficient estimates
stdcf<-cf[mhh]*sdv[mhh]/sdv[dpv] # use the names to make sure everything is matched up properly

5.2 Elasticities

We rescale our coefficients so they can be interpreted as the percent change in the dependent variable for a one percent increase in the independent variable.

\[\xi_{x_j}^y=\frac{\frac{\Delta y}{y}}{\frac{\Delta x_j}{x_j}}=\frac{\Delta y}{\Delta x_j}\frac{x_j}{y} = \hat\beta_j \frac{\bar x_j}{\bar y} \]

The above formula is for a linear function. It is customary to take the values of \(x_j\) and \(y\) at their mean, but elasticity will actually change over the range of the data.

#--elasticities---
rr<-q$model
mnv<-apply(rr,2,mean)
cf<-coef(q)
elast<-cf[mhh]*mnv[mhh]/mnv[dpv]

5.3 R2 decomposition

Since independent variables are almost always somewhat collinear, much of R2 cannot be assigned to a single variable. A popular algorithm to tease apart these overlapping effects is to enter the independent variables one at a time in the model, recording the change in R2 that results. The first variables to enter grab all of the overlapping effects, but by trying all possible orders of variable entry, and then taking the average across all of these orders, one can produce a reasonable measure of the contribution to the model by an independent variable.

#--partitioning R2--this is slow with more than 10 or so variables--
library(relaimpo)
r2p<-calc.relimp(q)$lmg
# --put all three measures in the same data.frame. Note that they do not completely agree--
j<-data.frame(elast,stdcf,r2p)
j$z<-rowMeans(scale(abs(j))) # quick and dirty way to get the aggregate influence
j<-data.frame(round(summary(q)$coefficients[rownames(j),c(1,4)],4),mnv=round(mnv[rownames(j)],4),j)
j<-j[order(j$z),]

The table below shows the estimated coefficients and p-values; the mean values of the independent variables; our three influence measures, and our quick-and-dirty aggregate influence measure. Clearly, the estimated coefficients cannot be used as a direct measure of influence: coefficients take on large values when the mean variable value is small (e.g., simlang), and take on small values when the mean variable value is large (e.g., CALORIE97).

Most influential variables
Estimate Pr…t.. mnv elast stdcf r2p z
prcpXtemp 1.7533 0.0164 -0.1250 -0.0039 0.1353 0.0202 -1.3020
GCGDP -0.3499 0.0048 14.2990 -0.0899 -0.1491 0.0194 -0.8163
simlang 7.6906 0.0053 0.7701 0.1064 0.1394 0.0561 -0.5995
pctYlow40 0.5399 0.0002 16.4757 0.1598 0.1960 0.0424 -0.1001
CALORIE97 0.0035 0.0400 2,604.5728 0.1639 0.1417 0.1262 0.0531
X2009.Overall 0.1934 0.0064 60.2699 0.2094 0.1491 0.0990 0.1717
pctUrb 0.1215 0.0019 51.1748 0.1117 0.2186 0.1207 0.1991
watTot 0.1573 0.0069 83.8641 0.2370 0.2174 0.1654 1.0275
sanTot 0.1289 0.0003 66.9515 0.1551 0.3179 0.1996 1.3665

6 Influential observations

OLS coefficients are derived so as to minimize the sum of squared errors. Outlier observations can have a big effect, since errors are squared. It is often useful, particularly with small datasets, to see how the presence of a particular observation affects ones results. The seminal source for influence diagnostics is:

  • Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. Wiley-IEEE.

The plot below illustrates how an outlier (dark point at lower right) can affect the regression line. The green line was estimated with data including the outlier, the red line used data that excluded it.

An outlier exerts undue influence on a regression slope

6.1 dffits

The measure dffits gives an overall measure of the influence of a particular observation: deleting the ith observation, the parameters are re-estimated, and a new predicted value for the ith observation is calculated.The dffits measure is the scaled change in the predicted value when the observation is included. Values larger (in absolute value) than \(2\sqrt{\frac{p}{n}}\) are worth thinking about (\(n=\)number of observations; \(p=\)number of estimated parameters).

Below are only those observations that were especially influential. It can often help in model building to think about the forces that makes these places different: perhaps we left an important variable out of the model?

#--dffits: just look at the most influential---
o<-as.matrix(dffits(q))
cutoff<-2*(q$rank/length(q$residuals))^.5 # cutoff=2*sqrt(p/n)
length(zii<-rownames(o)[which(abs(o)>cutoff)])
## [1] 10
j<-data.frame(cntry=xr[zii,"country"],dft=o[zii,],rr[zii,])
Most influential observations
cntry dft HLEbirth pctUrb simlang pctYlow40 sanTot watTot CALORIE97 prcpXtemp GCGDP X2009.Overall
BWA Botswana -1.5699 35.6932 59 0.8105 9 47 96 2,183 -0.46693740 26.6 69.7
SLE Sierra Leone -0.9241 28.5607 42 0.3338 17 11 53 2,035 1.64796571 11.7 47.8
SWZ Swaziland -0.8899 34.1755 25 0.9731 13 50 60 2,483 -0.02092433 21.9 59.1
ZWE Zimbabwe -0.7210 33.5699 37 0.7643 13 46 81 2,145 -0.11224243 18.0 22.7
LSO Lesotho -0.7118 31.3823 19 1.0000 6 36 78 2,243 -0.51255707 18.3 49.7
MWI Malawi -0.6843 34.9106 18 0.7806 18 60 76 2,043 0.33412548 14.4 53.7
MDG Madagascar 0.6255 48.5957 27 0.9585 13 12 47 2,021 0.81384149 7.2 62.2
NER Niger 0.7735 35.5151 17 0.4106 10 7 42 2,097 -0.53133366 13.3 53.8
CHN China 0.7780 64.1457 42 0.8947 13 65 88 2,897 -1.09896244 12.1 53.2
BOL Bolivia 0.8110 54.4196 65 0.4279 7 43 86 2,174 0.25099103 14.2 53.6

6.2 dfbeta

The dfbeta measure is especially useful-it gives a scaled measure of how a particular regression coefficient has changed because of the inclusion of a particular observation (i.e., positive values mean that the coefficient is higher because of inclusion, negative values mean that it is lower). The recommended cutoff for especially influential observations is \(\frac{2}{\sqrt{n}}\).

#--dfbetas: just look at the most influential---
o<-data.frame(dfbetas(q))[,-1] #remove first column--for the intercept
mx<-NULL
nnx<-NROW(o)
for (i in names(o)){
  z<-which(abs(o[,i])>2/nnx^.5) #cutoff=2/sqrt(n)
  pz<-cbind(data.frame(dfbeta=round(o[z,i],4)),varb=i,ISO3=row.names(o)[z],xval=q$model[z,i],yval=q$model[z,dpv])
  mx<-rbind(mx,pz[order(pz[,1]),])
}
mx<-merge(xr[,c("ISO3","country")],mx,by="ISO3")
mx<-mx[order(mx$varb,mx$dfbeta),]

Running these diagnostics allows one to identify observations that are out of sync with the rest of the data. This could be because of data entry errors, or it could be because the model omits a variable that exerts an important influence on that particular observation. Examining these diagnostics might therefore suggest new independent variables, or it may result in dropping one or more observations.

7 Group homework assignment

Due in one week, two hours before class.

Download to your working directory the file worldData.xlsx. The workbook contains two worksheets: variables and data. The data consist of 92 variables, mostly having to do with economics and demographics. The observations are 191 countries.

Using these data, build a model explaining GDPpcPPP (GDP per capita, using purchasing power parity, 2009). Follow the same modeling procedure that you followed in the previous homework:

  • In a short (1/2 to 1 page) ex ante discussion explain why each of your independent variables are likely to help explain GDP per capita, and include in this the expected sign of the estimated coefficient.
  • Present a table of descriptive statistics for the variables in your unrestricted model (include variable descriptions in this table).
  • Present a table of your unrestricted model estimated coefficients (include VIFs as a table column and include N, \(R^2\), and RESET p-value in the notes).
  • Perform an F-test to determine whether you can drop any insignificant independent variables.
  • Present a table of your restricted model estimated coefficients (include VIFs as a table column and include N, \(R^2\), RESET p-value, and F-test p-value in the notes).
  • Present a table for your restricted model of the standardized coefficients, elasticities, and partitioned \(R^2\) for your independent variables (best to combine this with your table of restricted model results, if you have room).
  • Identify the most influential observation overall (dffits), and the most influential observation for one of your independent variables (dfbeta) and place these results in a table.
  • In a short (no more than 1 full page) ex post discussion tell us the meaning of what you found.
  • Please also turn in your R script.

I expect two documents from each group: 1) an MS Word document containing your four tables and discussions; 2) your R script.