1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

R tutorial on YouTube (21 videos, total time: 1 hour, 7 minutes)
Cheat sheets
R web search Use this to hunt for documentation for specific package or function

3 Miscellaneous international data

Please download worldData.xlsx to your working directory. The xlsx file contains two sheets: one containing variable descriptions, labeled variables, and the other containing data, labeled data. Each observation is a country, and the variables are drawn from a variety of sources.

xr<-data.frame(readxl::read_xlsx(path="worldData.xlsx",sheet="data"))
rownames(xr)<-xr$ISO3 # using as rowname the ISO 3-character code for each country
xr$prcpXtemp<-scale(xr$prcp*xr$temp) # values will be high when the climate is warm and wet; low when dry and cold

4 A model of healthy life expectancy at birth

We can extract some variables to make a model explaining the variation across countries in Healthy Life Expectancy at birth. This is the number of healthy years the average new born is expected to live (the figure is from around 2012).

4.1 Unrestricted model

mhh<-c("pctUrb","simlang","pctYlow40","NetPrimEd","sanTot","watTot","CALORIE97","prcpXtemp","GCGDP","X2009.Overall","Christian")
dpv<-"HLEbirth"
xur<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(xur,xr))

## 
## Call:
## lm(formula = xur, data = xr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6870  -3.0476   0.1886   3.0076   9.8266 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.707440   5.548952  -0.848  0.39850    
## pctUrb         0.132423   0.038673   3.424  0.00093 ***
## simlang        7.104192   2.851683   2.491  0.01456 *  
## pctYlow40      0.478732   0.147753   3.240  0.00168 ** 
## NetPrimEd      0.067901   0.058044   1.170  0.24516    
## sanTot         0.114436   0.037552   3.047  0.00303 ** 
## watTot         0.135329   0.059723   2.266  0.02585 *  
## CALORIE97      0.003235   0.001699   1.904  0.06006 .  
## prcpXtemp      1.704599   0.742838   2.295  0.02408 *  
## GCGDP         -0.317173   0.126362  -2.510  0.01386 *  
## X2009.Overall  0.213336   0.071512   2.983  0.00367 ** 
## Christian     -1.977590   1.563444  -1.265  0.20918    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.878 on 90 degrees of freedom
##   (89 observations deleted due to missingness)
## Multiple R-squared:  0.8523, Adjusted R-squared:  0.8343 
## F-statistic: 47.22 on 11 and 90 DF,  p-value: < 2.2e-16

The above is our unrestricted model, so we will create our descriptive statistics table now.

u<-data.frame(psych::describe(xr[,c(dpv,mhh)])) # we will output our descriptive statistics now
u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file called "descrip.csv"

Descriptive statistics
	n	mean	sd	min	max
HLEbirth	189	57.5197	11.1386	28.5607	74.9935
pctUrb	190	55.2158	23.6784	11.0000	100.0000
simlang	191	0.7874	0.2160	0.2551	1.0000
pctYlow40	125	16.8480	4.3125	6.0000	25.0000
NetPrimEd	183	85.0164	15.8001	22.0000	100.0000
sanTot	155	67.1935	30.1783	5.0000	100.0000
watTot	159	83.4591	18.3033	22.0000	100.0000
CALORIE97	139	2,686.6475	513.3608	1,685.0000	3,699.0000
prcpXtemp	179	0.0000	1.0000	-1.8454	2.8088
GCGDP	140	15.1893	5.6270	4.6000	32.0000
X2009.Overall	172	59.5581	10.5864	22.7000	87.1000
Christian	191	0.5753	0.3738	0.0000	0.9945

4.2 F-test to identify irrelevant variables

Next, we will identify the coefficients with a p-value above 0.10, and use an F-test to confirm that those variables may be dropped from the model.

j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
drpt<-intersect(rownames(j)[which(j[,4]>.1)],mhh) # get the names of the variables with p-values above 0.10
linearHypothesis(q,drpt) # perform the F-test

## Linear hypothesis test
## 
## Hypothesis:
## NetPrimEd = 0
## Christian = 0
## 
## Model 1: restricted model
## Model 2: HLEbirth ~ pctUrb + simlang + pctYlow40 + NetPrimEd + sanTot + 
##     watTot + CALORIE97 + prcpXtemp + GCGDP + X2009.Overall + 
##     Christian
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     92 2208.3                           
## 2     90 2141.3  2    67.078 1.4097 0.2496

The p-value of the F-test is above 0.05; therefore, we cannot reject the null hypothesis that the true values of the coefficients are zero. So we drop the variables, creating our restricted model as the final model.

4.3 Restricted model

mhh<-setdiff(mhh,drpt)
fr<-formula(paste(dpv,paste(mhh,collapse="+"),sep="~"))
summary(q<-lm(fr,xr))

## 
## Call:
## lm(formula = fr, data = xr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.023  -2.780   0.112   2.842  11.399 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.752275   5.145257  -0.535 0.593985    
## pctUrb         0.121508   0.037929   3.204 0.001860 ** 
## simlang        7.690603   2.691687   2.857 0.005274 ** 
## pctYlow40      0.539948   0.137020   3.941 0.000157 ***
## sanTot         0.128905   0.033981   3.793 0.000264 ***
## watTot         0.157277   0.056965   2.761 0.006944 ** 
## CALORIE97      0.003503   0.001682   2.083 0.040036 *  
## prcpXtemp      1.753336   0.717317   2.444 0.016399 *  
## GCGDP         -0.349862   0.120966  -2.892 0.004763 ** 
## X2009.Overall  0.193388   0.069378   2.787 0.006441 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.875 on 93 degrees of freedom
##   (88 observations deleted due to missingness)
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.8344 
## F-statistic:  58.1 on 9 and 93 DF,  p-value: < 2.2e-16

Now that we have the final model, we can output our regression results table.

j<-summary(q)$coefficients # get the table of regression coefficients, p-values, etc.
j<-data.frame(j,vif=c(NA,vif(q))) # adding the VIF as the last column of the table
write.csv(j,file="regres.csv") # this writes the object j to a csv-format file called "regres.csv"

Regression results for final model
	Estimate	Std..Error	t.value	Pr…t..	vif
(Intercept)	-2.7523	5.1453	-0.5349	0.5940	NA
pctUrb	0.1215	0.0379	3.2036	0.0019	2.8667
simlang	7.6906	2.6917	2.8572	0.0053	1.4669
pctYlow40	0.5399	0.1370	3.9407	0.0002	1.5231
sanTot	0.1289	0.0340	3.7934	0.0003	4.3245
watTot	0.1573	0.0570	2.7609	0.0069	3.8197
CALORIE97	0.0035	0.0017	2.0826	0.0400	2.8525
prcpXtemp	1.7533	0.7173	2.4443	0.0164	1.8877
GCGDP	-0.3499	0.1210	-2.8922	0.0048	1.6359
X2009.Overall	0.1934	0.0694	2.7875	0.0064	1.7612

5 Influential variables

You have used a t-statistic to judge whether an independent variable is significantly related to the dependent variable. Here we look at a related question: of all the significant independent variables, which exerts the greatest influence on the dependent variable? There are three main ways to answer this question.

The first is the use of standardized coefficients (often called beta coefficients). The standardized coefficients show by how many standard deviations the dependent variable will change, for a one standard deviation change in the independent variable. This is equivalent to the estimated coefficient times the standard deviation of the independent variable divided by the standard deviation of the dependent variable.
The second is the calculation of elasticities. An elasticity is the percentage change in a dependent variable caused by a one percent change in the independent variable (e.g., the percentage change in quantity demanded caused by a one percent change in price). In a model in which all variables are converted to their natural logs (such as the Cobb-Douglas production function), the coefficient estimates can be directly interpreted as elasticities. In a linear regression, one multiplies the estimated coefficient by the mean of the independent variable divided by the mean of the dependent variable.
The third way to determine the relative influences of the independent variables is to decompose the model $R^2$ into the portion attributable to each independent variable. The $R^2$ is the percent of the variation in the dependent variable that can be explained by the model. If the independent variables are perfectly orthogonal (that is, not correlated with each other), then each independent variable will explain a unique portion of the variation in the dependent variable. But independent variables are almost never orthogonal, they will share some variation, so that two or more independent variables will account for a portion of the variation in the dependent variables. Decomposition of $R^2$ into the portion accounted for by each independent variable is thus not straightforward. R contains several algorithms to perform this.

5.1 Standardized coefficients

We rescale our coefficients so they can be interpreted as the change in the dependent variable (measured in units of the standard deviation of the dependent variable) for a one standard deviation increase in the independent variable. The formula is simple: multiply each estimated coefficient by the ratio of the standard deviation of the independent variable over the standard deviation of the dependent variable.

\[\hat\beta_{j}^\prime=\hat\beta_{j} \frac{sd(x_j)}{sd(y)} \]

When calculating the standard deviations, one should use the same data that were used in estimating the coefficients. R will automatically drop any observation that contains a missing value (we call this dropping of observations listwise deletion), so some of the observations in the original data set may not have been used in the regression. Fortunately, R stores, in the lm object, the exact set of data used to estimate the coefficients, this is q$model below.

#--standardized coefficients---
rr<-q$model # The data that were actually used in the regression
sdv<-apply(rr,2,sd) # calculate the standard deviation for each column
cf<-coef(q) # extract the coefficient estimates
stdcf<-cf[mhh]*sdv[mhh]/sdv[dpv] # use the names to make sure everything is matched up properly

5.2 Elasticities

We rescale our coefficients so they can be interpreted as the percent change in the dependent variable for a one percent increase in the independent variable.

\[\xi_{x_j}^y=\frac{\frac{\Delta y}{y}}{\frac{\Delta x_j}{x_j}}=\frac{\Delta y}{\Delta x_j}\frac{x_j}{y} = \hat\beta_j \frac{\bar x_j}{\bar y} \]

The above formula is for a linear function. It is customary to take the values of $x_j$ and $y$ at their mean, but elasticity will actually change over the range of the data.

#--elasticities---
rr<-q$model
mnv<-apply(rr,2,mean)
cf<-coef(q)
elast<-cf[mhh]*mnv[mhh]/mnv[dpv]

5.3 R2 decomposition

Since independent variables are almost always somewhat collinear, much of R2 cannot be assigned to a single variable. A popular algorithm to tease apart these overlapping effects is to enter the independent variables one at a time in the model, recording the change in R2 that results. The first variables to enter grab all of the overlapping effects, but by trying all possible orders of variable entry, and then taking the average across all of these orders, one can produce a reasonable measure of the contribution to the model by an independent variable.

#--partitioning R2--this is slow with more than 10 or so variables--
library(relaimpo)
r2p<-calc.relimp(q)$lmg
# --put all three measures in the same data.frame. Note that they do not completely agree--
j<-data.frame(elast,stdcf,r2p)
j$z<-rowMeans(scale(abs(j))) # quick and dirty way to get the aggregate influence
j<-data.frame(round(summary(q)$coefficients[rownames(j),c(1,4)],4),mnv=round(mnv[rownames(j)],4),j)
j<-j[order(j$z),]

The table below shows the estimated coefficients and p-values; the mean values of the independent variables; our three influence measures, and our quick-and-dirty aggregate influence measure. Clearly, the estimated coefficients cannot be used as a direct measure of influence: coefficients take on large values when the mean variable value is small (e.g., simlang), and take on small values when the mean variable value is large (e.g., CALORIE97).

Most influential variables
	Estimate	Pr…t..	mnv	elast	stdcf	r2p	z
prcpXtemp	1.7533	0.0164	-0.1250	-0.0039	0.1353	0.0202	-1.3020
GCGDP	-0.3499	0.0048	14.2990	-0.0899	-0.1491	0.0194	-0.8163
simlang	7.6906	0.0053	0.7701	0.1064	0.1394	0.0561	-0.5995
pctYlow40	0.5399	0.0002	16.4757	0.1598	0.1960	0.0424	-0.1001
CALORIE97	0.0035	0.0400	2,604.5728	0.1639	0.1417	0.1262	0.0531
X2009.Overall	0.1934	0.0064	60.2699	0.2094	0.1491	0.0990	0.1717
pctUrb	0.1215	0.0019	51.1748	0.1117	0.2186	0.1207	0.1991
watTot	0.1573	0.0069	83.8641	0.2370	0.2174	0.1654	1.0275
sanTot	0.1289	0.0003	66.9515	0.1551	0.3179	0.1996	1.3665

6 Influential observations

OLS coefficients are derived so as to minimize the sum of squared errors. Outlier observations can have a big effect, since errors are squared. It is often useful, particularly with small datasets, to see how the presence of a particular observation affects ones results. The seminal source for influence diagnostics is:

Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. Wiley-IEEE.

The plot below illustrates how an outlier (dark point at lower right) can affect the regression line. The green line was estimated with data including the outlier, the red line used data that excluded it.

An outlier exerts undue influence on a regression slope

6.1 dffits

The measure dffits gives an overall measure of the influence of a particular observation: deleting the ith observation, the parameters are re-estimated, and a new predicted value for the ith observation is calculated.The dffits measure is the scaled change in the predicted value when the observation is included. Values larger (in absolute value) than $2\sqrt{\frac{p}{n}}$ are worth thinking about ($n=$number of observations; $p=$number of estimated parameters).

Below are only those observations that were especially influential. It can often help in model building to think about the forces that makes these places different: perhaps we left an important variable out of the model?

#--dffits: just look at the most influential---
o<-as.matrix(dffits(q))
cutoff<-2*(q$rank/length(q$residuals))^.5 # cutoff=2*sqrt(p/n)
length(zii<-rownames(o)[which(abs(o)>cutoff)])

## [1] 10

j<-data.frame(cntry=xr[zii,"country"],dft=o[zii,],rr[zii,])

Most influential observations
	cntry	dft	HLEbirth	pctUrb	simlang	pctYlow40	sanTot	watTot	CALORIE97	prcpXtemp	GCGDP	X2009.Overall
BWA	Botswana	-1.5699	35.6932	59	0.8105	9	47	96	2,183	-0.46693740	26.6	69.7
SLE	Sierra Leone	-0.9241	28.5607	42	0.3338	17	11	53	2,035	1.64796571	11.7	47.8
SWZ	Swaziland	-0.8899	34.1755	25	0.9731	13	50	60	2,483	-0.02092433	21.9	59.1
ZWE	Zimbabwe	-0.7210	33.5699	37	0.7643	13	46	81	2,145	-0.11224243	18.0	22.7
LSO	Lesotho	-0.7118	31.3823	19	1.0000	6	36	78	2,243	-0.51255707	18.3	49.7
MWI	Malawi	-0.6843	34.9106	18	0.7806	18	60	76	2,043	0.33412548	14.4	53.7
MDG	Madagascar	0.6255	48.5957	27	0.9585	13	12	47	2,021	0.81384149	7.2	62.2
NER	Niger	0.7735	35.5151	17	0.4106	10	7	42	2,097	-0.53133366	13.3	53.8
CHN	China	0.7780	64.1457	42	0.8947	13	65	88	2,897	-1.09896244	12.1	53.2
BOL	Bolivia	0.8110	54.4196	65	0.4279	7	43	86	2,174	0.25099103	14.2	53.6

6.2 dfbeta

The dfbeta measure is especially useful-it gives a scaled measure of how a particular regression coefficient has changed because of the inclusion of a particular observation (i.e., positive values mean that the coefficient is higher because of inclusion, negative values mean that it is lower). The recommended cutoff for especially influential observations is $\frac{2}{\sqrt{n}}$.

#--dfbetas: just look at the most influential---
o<-data.frame(dfbetas(q))[,-1] #remove first column--for the intercept
mx<-NULL
nnx<-NROW(o)
for (i in names(o)){
  z<-which(abs(o[,i])>2/nnx^.5) #cutoff=2/sqrt(n)
  pz<-cbind(data.frame(dfbeta=round(o[z,i],4)),varb=i,ISO3=row.names(o)[z],xval=q$model[z,i],yval=q$model[z,dpv])
  mx<-rbind(mx,pz[order(pz[,1]),])
}
mx<-merge(xr[,c("ISO3","country")],mx,by="ISO3")
mx<-mx[order(mx$varb,mx$dfbeta),]

Running these diagnostics allows one to identify observations that are out of sync with the rest of the data. This could be because of data entry errors, or it could be because the model omits a variable that exerts an important influence on that particular observation. Examining these diagnostics might therefore suggest new independent variables, or it may result in dropping one or more observations.

7 Group homework assignment

Due in one week, two hours before class.

Download to your working directory the file worldData.xlsx. The workbook contains two worksheets: variables and data. The data consist of 92 variables, mostly having to do with economics and demographics. The observations are 191 countries.

Using these data, build a model explaining GDPpcPPP (GDP per capita, using purchasing power parity, 2009). Follow the same modeling procedure that you followed in the previous homework:

In a short (1/2 to 1 page) ex ante discussion explain why each of your independent variables are likely to help explain GDP per capita, and include in this the expected sign of the estimated coefficient.
Present a table of descriptive statistics for the variables in your unrestricted model (include variable descriptions in this table).
Present a table of your unrestricted model estimated coefficients (include VIFs as a table column and include N, $R^2$, and RESET p-value in the notes).
Perform an F-test to determine whether you can drop any insignificant independent variables.
Present a table of your restricted model estimated coefficients (include VIFs as a table column and include N, $R^2$, RESET p-value, and F-test p-value in the notes).
Present a table for your restricted model of the standardized coefficients, elasticities, and partitioned $R^2$ for your independent variables (best to combine this with your table of restricted model results, if you have room).
Identify the most influential observation overall (dffits), and the most influential observation for one of your independent variables (dfbeta) and place these results in a table.
In a short (no more than 1 full page) ex post discussion tell us the meaning of what you found.
Please also turn in your R script.

I expect two documents from each group: 1) an MS Word document containing your four tables and discussions; 2) your R script.

Influential variables and observations

Anthon Eff