Compiled on 2023-01-23 by E. Anthon Eff
Jones College of Business, Middle Tennessee State University

1 Resources to learn and use R

You might find it easier to read and navigate this html file if you download it to your own computer.

R web search Use this to hunt for documentation for specific package or function
Cheat sheets

2 Data sets available this semester

2.1 International data (each observation is a country or territory)

We will use international data for most of the topics this semester. I recommend that you use these data for your term paper.

pwt91_extract2017.xlsx. The 2017 data extracted from the Penn World Tables.
CIAfactbook.xlsx. The CIA World Factbook, which is described here.
EFOTW2018.xlsx. Economic Freedom of the World, 2019 report (with 2018 data). The text of the report, which explains the variables in detail, is found here.
FragileStatesIndex2020.xlsx. Fragile States Index, 2020 report. The text of the report, which explains the variables in detail, is found here.
WHR2020.xlsx. World Happiness Report, 2020. The text of the report, which explains the variables in detail, is found here.
legatum.xlsx. Legatum Prosperity Index, 2021. The text of the report, as well as documents explaining the methodolgy, can be found here.
GII2020.xlsx. Global Innovation Index, 2020. The text of the report, which explains the variables in detail, is found here.
GBD_dem.xlsx. Demographic data from the 2019 Global Burden of Disease study, whose webpage is here.
SIPRI_MiliExp2019.xlsx. Historical data on military expenditures from the Stockholm International Peace Research Institute.
WP.xlsx. Miscellaneous data scraped from tables in Wikipedia (December 2020).
misc.xlsx. More miscellaneous data.
worldData.xlsx About 90 miscellaneous variables for 191 countries and autonomous regions: the data are older than the other excel files, so use only if there is no alternative.
AgProduction.xlsx Aggregate data on agricultural output and inputs for most countries.
internationalOrganization_CIA.xlsx Large set of dummy variables showing membership of countries in various international groupings. From the CIA World Factbook.
InternationalOrganization_WP.xlsx Set of dummy variables showing membership of countries in various international groupings. From Wikipedia.
cntryprox.xlsx. For advanced analysis of spatial autocorrelation: international cultural proximity matrix based on language phylogeny. Based on this paper.

2.1.1 Human Development Index (HDI)

The HDI is produced by the United Nations Development Programme. See information about the HDI here and here. The following data come from the 2020 Human Development Report.

2.1.2 World Development Indicators (WDI)

The WDI are produced by the World Bank. See information about the WDI here.

2.2 Times series data (each observation is a time period)

Fred II We will use data from this site at the Saint Louis Fed for the time series topics.

3 First steps

Open Rstudio. The best way to arrange the panes is to have Source on the left and Console on the right. On the top menu you will see Tools: click Tools>Global Options>Pane Layout to arrange the panes.
The Source pane is where you will write your R scripts. The Console pane is where output appears when you run your script.

3.1 Working directory

On the first line of your script set your working directory to a location where you want to read and write your data. For example: setwd("C:/Users/EAEFF/Dropbox (MTSU)/class/4620/"). Note that slashes are different from the usual Microsoft Windows slashes.

3.2 Loading packages in R

R is an open-source project. Volunteers have created over 13,000 packages to perform specific tasks in a variety of fields, ranging from econometrics to textual analysis. A nice overview of the highest quality packages by field can be found in the CRAN Task Views.
To use a specific package, one must first install it. A package that we will use quite often this semester is AER.

install.packages("AER")

You will only need to install a package once. So, after installing a package, remove the install.packages command, or comment it out, by placing a hashtag (#) in front of it:

# install.packages("AER")

Once a package is installed, you can load that package like this:

library(AER)

Install the following packages now: AER, psych, readxl.
After installing, load them.

4 Reading and writing data

Data analyis essentially consists of bringing data into R, doing something to that data, then writing results to files. R can read and write many kinds of data. In the following example we will load two excel files: HDI_educationAchievements.xlsx and WDI_miscIndicators01.xlsx. Each of these workbooks has two worksheets. The first sheet is called data, and consists of a data table where each row (also called observation, or record) is a country or territory, and each column (also called variable, or field) gives values for various socio-economic measures. The second sheet, called variable, provides a short description of each data column.

Click on the file names above and save the two excel files in your working directory. You should open these files in excel and take a quick look at their contents.

4.1 Reading data

Once the file is in your working directory, you can read it into R, using a function called read_excel from the readxl package. The following command reads the worksheet data from the workbook WDI_miscIndicators01.xlsx and puts it into an object called dd in the R general environment.

dd<-readxl::read_excel(path="WDI_miscIndicators01.xlsx",sheet="data")

To see the help files for a function like readxl::read_excel, simply highlight the function name with your cursor in Rstudio, and press the F1 key.

The object dd is of a class called tibble, which we will not use in this course, so we will convert dd into the class data.frame, which is the most common way in which data is used in R.

dd<-data.frame(dd)

One should always develop an understanding of ones data before using them. The data.frame dd is 238 x 56 (238 observations, 56 variables), small enough that we can view it in a spreadsheet. But economists usually use much larger datasets (I have used several that had millions of records), so commands that will provide summary information about a dataset are very useful.

dd<-readxl::read_excel(path="WDI_miscIndicators01.xlsx",sheet="data")
dd<-data.frame(dd)
dim(dd) #gives the dimensions of the data.frame

## [1] 238  56

class(dd) #gives the object dd's class

## [1] "data.frame"

head(dd,3) #shows the top 3 rows of the data.frame

##       country iso3 iso2c year     CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER
## 1 Afghanistan  AFG    AF 2018 -1.4968340        10   4.326923          0.00000
## 2     Albania  ALB    AL 2018 -0.5219214        11  35.096150         24.51923
## 3     Algeria  DZA    DZ 2018 -0.6301624        10  28.365390         17.30769
##   CC.PER.RNK.UPPER CC.STD.ERR     GE.EST GE.NO.SRC GE.PER.RNK GE.PER.RNK.LOWER
## 1         9.615385  0.1573170 -1.4572850         8   7.692307         1.442308
## 2        44.230770  0.1360851  0.1147880         7  57.692310        42.788460
## 3        41.346150  0.1520321 -0.4439246        10  37.019230        20.192310
##   GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE HD.HCI.AMRT.MA
## 1         12.50000  0.2380511   0.7803819      0.8037745      0.7587757
## 2         70.19231  0.2275700   0.9277925      0.9501160      0.9055499
## 3         48.07692  0.1942946   0.9069255      0.9181952      0.8959428
##   HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS HD.HCI.HLOS.FE
## 1     8.72034       6.795722       9.544137    354.7588       350.1461
## 2    12.88551      13.201755      12.610832    429.3290       445.0104
## 3    11.77532      12.108990      11.442765    374.0891       383.3708
##   HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE HD.HCI.MORT.MA
## 1       357.7947    4.949788   0.9352661      0.9389164      0.9318808
## 2       413.5843    8.851396   0.9909778      0.9916161      0.9903747
## 3       365.7584    7.048028   0.9759771      0.9775091      0.9745197
##   HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB     PV.EST PV.NO.SRC PV.PER.RNK
## 1   0.3934892      0.3810325      0.4053017 -2.7516110         6  0.4761905
## 2   0.6286663      0.6159194      0.6409156  0.3783375         8 59.0476200
## 3   0.5319939      0.5249978      0.5387299 -0.8265918         7 18.0952400
##   PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR     RL.EST RL.NO.SRC RL.PER.RNK
## 1          0.00000         2.380952  0.2267807 -1.6680210        11   4.326923
## 2         47.61905        70.000000  0.2149327 -0.3923339        11  38.942310
## 3         11.42857        29.047620  0.2199717 -0.7752355        12  22.115390
##   RL.PER.RNK.LOWER RL.PER.RNK.UPPER RL.STD.ERR     RQ.EST RQ.NO.SRC RQ.PER.RNK
## 1         1.442308         6.730769  0.1586558 -1.1301910         9  10.576920
## 2        28.846150        48.076920  0.1456045  0.2684412         8  63.461540
## 3        16.346150        32.692310  0.1553647 -1.2625400         8   8.173077
##   RQ.PER.RNK.LOWER RQ.PER.RNK.UPPER RQ.STD.ERR     VA.EST VA.NO.SRC VA.PER.RNK
## 1         6.250000         18.75000  0.1931015 -0.9929534        10   20.68966
## 2        52.403850         73.07692  0.2055196  0.2076107        12   53.20197
## 3         4.326923         14.42308  0.1951701 -0.9816775        12   21.67488
##   VA.PER.RNK.LOWER VA.PER.RNK.UPPER VA.STD.ERR
## 1          13.7931         25.61576  0.1347938
## 2          45.3202         60.59113  0.1280210
## 3          13.7931         25.61576  0.1315684

tail(dd,2) #shows the bottom 2 rows of the data.frame

##      country iso3 iso2c year     CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER
## 237   Zambia  ZMB    ZM 2018 -0.6562852        14   27.88461        18.269230
## 238 Zimbabwe  ZWE    ZW 2018 -1.2347430        15   10.09615         5.288462
##     CC.PER.RNK.UPPER CC.STD.ERR     GE.EST GE.NO.SRC GE.PER.RNK
## 237         39.42308  0.1304875 -0.5592093        11   33.17308
## 238         14.42308  0.1272638 -1.1988600        12   10.57692
##     GE.PER.RNK.LOWER GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE
## 237        17.788460         43.75000  0.1874648   0.7248627      0.7747966
## 238         6.730769         16.34615  0.1805988   0.6420019      0.6577585
##     HD.HCI.AMRT.MA HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS
## 237      0.6749445    8.764444       8.745135       8.787398    358.1404
## 238      0.6228139   11.054537      10.974918      11.136672    396.1388
##     HD.HCI.HLOS.FE HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE
## 237             NA             NA    5.022243   0.9406496      0.9455552
## 238             NA             NA    7.006610   0.9506910      0.9554852
##     HD.HCI.MORT.MA HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB     PV.EST
## 237      0.9359885   0.3911756      0.3769210      0.4038159  0.0938291
## 238      0.9462116   0.4612416      0.4397229      0.4822915 -0.7134061
##     PV.NO.SRC PV.PER.RNK PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR
## 237         7   50.95238         37.14286         61.42857  0.2192343
## 238         8   20.47619         12.85714         33.33333  0.2149327
##         RL.EST RL.NO.SRC RL.PER.RNK RL.PER.RNK.LOWER RL.PER.RNK.UPPER
## 237 -0.3450496        14  41.346150        29.807690         50.48077
## 238 -1.2733430        15   8.173077         5.288462         14.42308
##     RL.STD.ERR     RQ.EST RQ.NO.SRC RQ.PER.RNK RQ.PER.RNK.LOWER
## 237  0.1434674 -0.4842296        10  34.134620        22.115390
## 238  0.1387014 -1.5035220        11   5.288462         2.884615
##     RQ.PER.RNK.UPPER RQ.STD.ERR     VA.EST VA.NO.SRC VA.PER.RNK
## 237        45.192310  0.1663302 -0.3190484        14   35.96059
## 238         8.653846  0.1644797 -1.1232210        15   17.24138
##     VA.PER.RNK.LOWER VA.PER.RNK.UPPER VA.STD.ERR
## 237         31.03448         40.39409  0.1272391
## 238         12.31527         22.16749  0.1242625

dd[c(12,7,19),3:6] #show rows 12,7,19 for columns 3,4,5,6.

##    iso2c year     CC.EST CC.NO.SRC
## 12    AU 2018  1.8064600        11
## 7     AI 2018  1.2387970         1
## 19    BY 2018 -0.1922859        11

We can see that three variables are character variables, the others are numeric. We will use country name as the rowname, and then remove all character variables from the data.frame, so that all remaining variables are numeric.

rownames(dd)<-dd$country #assigns country names as the rownames
dd<-dd[,-(1:3)] #removes the first three columns in dd (the non-numeric columns)
dd[c("Denmark","Turkey"),] # shows the rows for Denmark and Turkey

##         year     CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER CC.PER.RNK.UPPER
## Denmark 2018  2.1481000         9   98.55769         94.71154        100.00000
## Turkey  2018 -0.3354134        12   43.75000         33.65385         52.40385
##         CC.STD.ERR    GE.EST GE.NO.SRC GE.PER.RNK GE.PER.RNK.LOWER
## Denmark  0.1451558 1.8716310         7   97.11539         91.82692
## Turkey   0.1323154 0.0060968         9   53.84615         39.42308
##         GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE HD.HCI.AMRT.MA
## Denmark        100.00000  0.2221446   0.9296045      0.9469248      0.9127313
## Turkey          66.34615  0.2085700   0.9078032      0.9365753      0.8785881
##         HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS HD.HCI.HLOS.FE
## Denmark    13.39460       13.47162       13.32221    530.7278       532.2063
## Turkey     12.06618       11.99553       12.13585    459.1798       463.1666
##         HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE HD.HCI.MORT.MA
## Denmark       529.1732   11.374220   0.9957407      0.9961299      0.9953738
## Turkey        455.3064    8.864876   0.9887604      0.9894558      0.9881020
##         HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB     PV.EST PV.NO.SRC
## Denmark   0.7708318      0.7620665      0.7794603  0.9532585         9
## Turkey    0.6253099      0.6135694      0.6366453 -1.3086830         8
##         PV.PER.RNK PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR     RL.EST
## Denmark   82.38095        66.190480         94.28571  0.2107918  1.8323330
## Turkey    10.47619         6.666667         14.28571  0.2142458 -0.3194206
##         RL.NO.SRC RL.PER.RNK RL.PER.RNK.LOWER RL.PER.RNK.UPPER RL.STD.ERR
## Denmark        10   96.63461         90.38461        100.00000  0.1569019
## Turkey         12   42.30769         30.76923         51.92308  0.1502382
##            RQ.EST RQ.NO.SRC RQ.PER.RNK RQ.PER.RNK.LOWER RQ.PER.RNK.UPPER
## Denmark 1.6370360         8   93.75000         86.53846         98.07692
## Turkey  0.0175277        10   54.80769         42.78846         63.94231
##         RQ.STD.ERR     VA.EST VA.NO.SRC VA.PER.RNK VA.PER.RNK.LOWER
## Denmark  0.2215324  1.6118670        10   97.53695         92.61084
## Turkey   0.1766253 -0.8313113        12   25.12315         19.21182
##         VA.PER.RNK.UPPER VA.STD.ERR
## Denmark        100.00000  0.1455202
## Turkey          30.04926  0.1298985

colMeans(dd[,c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA")]) #gives the mean of each of these three columns (using colnames)

##    HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA 
##             NA             NA             NA

colMeans(dd[,c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA")],na.rm=TRUE) #if there are missing values must remove them

##    HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA 
##       428.1475       434.4287       425.8587

apply(dd[,20:22],2,mean,na.rm=TRUE) #another way to get the mean of each of these three columns (using column numbers)

##    HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA 
##       428.1475       434.4287       425.8587

rxx<-c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA") # can store colnames in object
apply(dd[,rxx],2,sd,na.rm=TRUE) #get the standard deviation of each of these three columns

##    HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA 
##       69.28723       70.83195       71.66607

apply(dd[,rxx],2,range,na.rm=TRUE) #get the range of each of these three columns

##      HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## [1,]    304.9222       302.0893       307.0905
## [2,]    580.8655       584.6268       577.2593

apply(dd[,rxx],2,quantile,na.rm=TRUE) #get the quartiles of each of these three columns

##      HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## 0%      304.9222       302.0893       307.0905
## 25%     373.4452       382.8199       369.6537
## 50%     416.2702       426.1908       415.9109
## 75%     495.3233       503.6029       499.7190
## 100%    580.8655       584.6268       577.2593

4.2 Writing data

When publishing an empirical paper in economics, one typically presents a table of descriptive statistics, showing the mean, standard deviation, minimum, maximum, and number of non-missing observations for each of the variables one used in ones analysis. Below, we calculate this table and write the results in a file that can readily be used in MS Office.

u<-data.frame(psych::describe(dd)) # using the describe function from the psych package
colnames(u) # the function produces more information than we want to use

##  [1] "vars"     "n"        "mean"     "sd"       "median"   "trimmed" 
##  [7] "mad"      "min"      "max"      "range"    "skew"     "kurtosis"
## [13] "se"

u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
u # simply putting the name of the object causes it to print on the console

##                    n          mean          sd          min          max
## year             238  2.018000e+03  0.00000000 2018.0000000 2018.0000000
## CC.EST           205 -4.849561e-05  0.99746575   -1.8003120    2.2127440
## CC.NO.SRC        205  9.639024e+00  4.05675895    1.0000000   16.0000000
## CC.PER.RNK       205  5.002814e+01 28.99778868    0.0000000  100.0000000
## CC.PER.RNK.LOWER 205  4.156660e+01 29.06409059    0.0000000   95.1923100
## CC.PER.RNK.UPPER 205  5.760553e+01 28.03314615    2.8846150  100.0000000
## CC.STD.ERR       205  1.743017e-01  0.08379772    0.1191657    0.8767205
## GE.EST           205 -4.745274e-03  0.99526451   -2.4494090    2.2314740
## GE.NO.SRC        205  7.468293e+00  3.10205806    1.0000000   13.0000000
## GE.PER.RNK       205  4.985694e+01 28.95864127    0.0000000  100.0000000
## GE.PER.RNK.LOWER 205  3.837946e+01 28.61175022    0.0000000   97.1153900
## GE.PER.RNK.UPPER 205  6.045028e+01 28.03374140    0.9615384  100.0000000
## GE.STD.ERR       205  2.427437e-01  0.09085091    0.1731845    1.0484390
## HD.HCI.AMRT      173  8.443381e-01  0.08737014    0.5116570    0.9601836
## HD.HCI.AMRT.FE   173  8.761004e-01  0.08347202    0.5751923    0.9751105
## HD.HCI.AMRT.MA   173  8.129396e-01  0.09491627    0.4484606    0.9578469
## HD.HCI.EYRS      173  1.126578e+01  2.33297253    4.1569886   13.9364252
## HD.HCI.EYRS.FE   167  1.132775e+01  2.43761034    3.8481028   13.9298334
## HD.HCI.EYRS.MA   167  1.126428e+01  2.24040536    4.1481204   13.9701385
## HD.HCI.HLOS      166  4.281475e+02 69.28722577  304.9222412  580.8654785
## HD.HCI.HLOS.FE   152  4.344287e+02 70.83194719  302.0892944  584.6267700
## HD.HCI.HLOS.MA   152  4.258587e+02 71.66606804  307.0905151  577.2592773
## HD.HCI.LAYS      166  7.904075e+00  2.59345365    2.2065024   12.9378729
## HD.HCI.MORT      173  9.713640e-01  0.02892175    0.8773022    0.9981502
## HD.HCI.MORT.FE   173  9.737100e-01  0.02701427    0.8844790    0.9983334
## HD.HCI.MORT.MA   173  9.691398e-01  0.03075042    0.8705665    0.9979782
## HD.HCI.OVRL      166  5.654461e-01  0.14697786    0.2990290    0.8870836
## HD.HCI.OVRL.LB   166  5.516884e-01  0.15067062    0.2696179    0.8775859
## HD.HCI.OVRL.UB   166  5.777465e-01  0.14446884    0.3156515    0.8966209
## PV.EST           207 -1.276698e-02  1.00187293   -2.9933110    1.9354980
## PV.NO.SRC        207  6.526570e+00  2.29351541    1.0000000    9.0000000
## PV.PER.RNK       207  4.963653e+01 29.08487260    0.0000000  100.0000000
## PV.PER.RNK.LOWER 207  3.777778e+01 25.30539109    0.0000000   94.2857100
## PV.PER.RNK.UPPER 207  6.048769e+01 30.40167453    0.9523810  100.0000000
## PV.STD.ERR       207  2.340465e-01  0.05200456    0.2107918    0.5993918
## RL.EST           205 -2.556260e-03  0.99162415   -2.3320440    2.0451420
## RL.NO.SRC        205  9.960976e+00  3.90681143    1.0000000   16.0000000
## RL.PER.RNK       205  4.996951e+01 28.88854697    0.0000000  100.0000000
## RL.PER.RNK.LOWER 205  4.102486e+01 28.32456119    0.0000000   95.1923100
## RL.PER.RNK.UPPER 205  5.788227e+01 28.61483793    0.4807692  100.0000000
## RL.STD.ERR       205  1.834239e-01  0.08503842    0.1338077    0.7548596
## RQ.EST           205  1.985990e-03  0.99009942   -2.3639190    2.2269640
## RQ.NO.SRC        205  7.702439e+00  2.78374904    1.0000000   12.0000000
## RQ.PER.RNK       205  4.994606e+01 28.98178616    0.0000000  100.0000000
## RQ.PER.RNK.LOWER 205  3.909709e+01 28.65894884    0.0000000   97.5961500
## RQ.PER.RNK.UPPER 205  5.996482e+01 28.04701335    1.9230770  100.0000000
## RQ.STD.ERR       205  2.262221e-01  0.09385629    0.1600714    1.0725610
## VA.EST           200 -3.714988e-03  0.99002901   -2.1749270    1.7335840
## VA.NO.SRC        200  9.930000e+00  3.99309706    1.0000000   16.0000000
## VA.PER.RNK       200  4.984729e+01 28.96615364    0.4926108  100.0000000
## VA.PER.RNK.LOWER 200  4.236207e+01 26.96678251    0.0000000   95.5665100
## VA.PER.RNK.UPPER 200  5.660345e+01 30.18555150    1.9704430  100.0000000
## VA.STD.ERR       200  1.525038e-01  0.04322136    0.1241545    0.2962780

write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file in your working directory called "descrip.csv" (can be opened in excel)
getwd() # shows where your working directory is located, so that you can find your file

## [1] "C:/Users/EAEFF/Dropbox/4620/rmd"

5 Analyzing data

Economists usually employ some variant of regression analysis. They do two different kinds of tasks with regression: estimate values (also called prediction), and test hypotheses. The former is more common in the applied work done for business and government; the latter is more important in academia and will be the primary focus of this class.

Hypotheses are drawn from theory, or perhaps simply from common-sense views of how the world works. For example, suppose that most people believe that a country attains high test scores on international exams such as the PISA either by spending a lot on education, or by rationing secondary education to only the smartest students (only secondary education students take the test). We can use regression analysis to see whether those beliefs are consistent with reality.

Our hypothesis (really two hypotheses) is that the test score is a function of educational expenditures and the proportion of the population that enters secondary education: \(score = f(EducExp,PctSecondary)\). We express that hypothesis mathematically as a simple linear function:

\[score_i = \alpha_0 +\alpha_1 EducExp_i +\alpha_2 PctSecondary_i +\varepsilon_i~~~~~~~\forall~i \in countries \tag {1}\]

In this equation, score, EducExp, and PctSecondary are variables, and the subscript i indicates that the variables take on the value found in a specific country i. The left hand side variable is called the dependent variable (or response variable); the variables on the right hand side are called the independent variables (or predictor variables or explanatory variables). The direction of causation is from the independent to the dependent variable: the independent variable is cause, the dependent is effect.

The scalars \(\alpha\) are coefficients (also called parameters), where \(\alpha_0\) is the intercept and \(\alpha_1\) and \(\alpha_2\) are slope coefficients. For any country i, given the values of \(EducExp_i\) and \(PctSecondary_i\), the equation will return the score, but with error, since the world is infinitely complicated and nothing can really be reduced to an equation with two determinants. \(\varepsilon_i\) represents that error (we call \(\varepsilon_i\) the error term or the disturbance); it will always have a mean of zero and have both positive and negative values.

Note that \(\alpha_{1}\) equals the first derivative of score with respect to EducExp (\({\delta score}/{\delta EducExp}=\alpha_{1}\)). Thus, for every one unit increase in educational expenditures, score will change by \(\alpha_{1}\) points. With a linear model, the marginal effect - the effect on the dependent variable of a unit increase in the independent variable - is easily understood: it is simply the estimated coefficient.

If our hypothesis is correct, \(\alpha_1>0\) (an increase in educational expenditures leads to an increase in scores) and \(\alpha_2<0\) (an increase in the proportion of persons who enter secondary education leads to a decrease in scores).

Once we are clear on a hypothesis, we need to find data that can serve as variables in Equation [1] above. We have relevant variables in our excel files, as described in the table below.

Variables used
variable	long.name	description	excel.file
ger_secon	Gross enrolment ratio: Secondary	Gross enrolment ratio: Total enrolment in a given level of education (pre-primary, primary, secondary or tertiary), regardless of age, expressed as a percentage of the official school-age population for the same level of education.	HDI_educationAchievements.xlsx
govtExpEduc	Government expenditure on education	Government expenditure on education: Current, capital and transfer spending on education, expressed as a percentage of GDP.	HDI_educationAchievements.xlsx
HD.HCI.HLOS	Harmonized Test Scores	Harmonized Test Scores from major international student achievement testing programs. They are measured in TIMMS-equivalent units, where 300 is minimal attainment and 625 is advanced attainment. Test scores from the following testing programs are included: TIMSS (Trends in International Mathematics and Science Study); PIRLS (Progress in International Reading Literacy Study); PISA (Programme for International Student Assessment); SACMEQ (Southern and Eastern Africa Consortium for Monitoring Educational Quality); PASEC (Program of Analysis of Education Systems); LLECE (Latin American Laboratory for Assessment of the Quality of Education); EGRA (Early Grade Reading Assessments).	WDI_miscIndicators01.xlsx

A problem is that the variables are scattered across two different excel files. We will have to open both files and merge the datasets. Merging requires a key, a variable that exists in both datasets that can be used to match rows correctly. In the international excel files the variable iso3 serves as the key for merging.

dim(bb<-data.frame(read_excel("HDI_educationAchievements.xlsx",sheet="data"))) # note that we can nest functions (nest  read_excel within data.frame)

## [1] 200  14

dim(cc<-data.frame(read_excel("WDI_miscIndicators01.xlsx",sheet="data"))) # note also how we can nest the assignment within the dim function so that we learn the size of the object without typing a new line of code

## [1] 238  56

dim(m<-merge(bb,cc,by="iso3")) # when we merge two data frames, we drop observations where the key value is missing in one of the data frames. Thus the merged dataset usually has fewer observations than either of the original datasets.

## [1] 198  69

We will use the lm() function to carry out regression analysis. HD.HCI.HLOS is the dependent variable, and govtExpEduc and ger_secon are the indepdendent variables. We save the regression results to an object zz. We look at our results using the summary() command.

zz<-lm(HD.HCI.HLOS~govtExpEduc+ger_secon,data=m)  #the lm function uses OLS to estimate a model with HD.HCI.HLOS as the dependent variable and govtExpEduc and ger_secon as the independent variables, with variables found in data.frame m
summary(zz)#displays the most important information from the lm object zz

## 
## Call:
## lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.530  -34.750    3.472   36.159  100.591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 299.1965    15.0311  19.905   <2e-16 ***
## govtExpEduc  -4.0468     3.0413  -1.331    0.186    
## ger_secon     1.8507     0.1514  12.223   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.64 on 114 degrees of freedom
##   (81 observations deleted due to missingness)
## Multiple R-squared:  0.5975, Adjusted R-squared:  0.5904 
## F-statistic: 84.61 on 2 and 114 DF,  p-value: < 2.2e-16

str(zz) #the str function shows the structure of an object; you can see that zz contains a lot of different parts

## List of 13
##  $ coefficients : Named num [1:3] 299.2 -4.05 1.85
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##  $ residuals    : Named num [1:117] -29.99 -31.75 -52.47 -4.29 -32.82 ...
##   ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
##  $ effects      : Named num [1:117] -4757.6 203.1 -557.9 4 -21.5 ...
##   ..- attr(*, "names")= chr [1:117] "(Intercept)" "govtExpEduc" "ger_secon" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:117] 385 461 477 447 557 ...
##   ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:117, 1:3] -10.8167 0.0925 0.0925 0.0925 0.0925 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:117] "1" "3" "6" "7" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   ..$ qraux: num [1:3] 1.09 1.04 1.06
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 114
##  $ na.action    : 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
##   ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
##  $ terms        :Classes 'terms', 'formula'  language HD.HCI.HLOS ~ govtExpEduc + ger_secon
##   .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##   .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
##   .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##  $ model        :'data.frame':   117 obs. of  3 variables:
##   ..$ HD.HCI.HLOS: num [1:117] 355 429 424 443 524 ...
##   ..$ govtExpEduc: num [1:117] 3.93 3.95 5.57 2.76 5.32 ...
##   ..$ ger_secon  : num [1:117] 54.8 96.1 108 86 151 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language HD.HCI.HLOS ~ govtExpEduc + ger_secon
##   .. .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##   .. .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##   ..- attr(*, "na.action")= 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
##   .. ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
##  - attr(*, "class")= chr "lm"

str(summary(zz)) # the summary of zz also contains a lot of parts

## List of 12
##  $ call         : language lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
##  $ terms        :Classes 'terms', 'formula'  language HD.HCI.HLOS ~ govtExpEduc + ger_secon
##   .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##   .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
##   .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
##  $ residuals    : Named num [1:117] -29.99 -31.75 -52.47 -4.29 -32.82 ...
##   ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
##  $ coefficients : num [1:3, 1:4] 299.2 -4.05 1.85 15.03 3.04 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:3] FALSE FALSE FALSE
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##  $ sigma        : num 45.6
##  $ df           : int [1:3] 3 114 3
##  $ r.squared    : num 0.597
##  $ adj.r.squared: num 0.59
##  $ fstatistic   : Named num [1:3] 84.6 2 114
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:3, 1:3] 0.108448 -0.012256 -0.000501 -0.012256 0.00444 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##   .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
##  $ na.action    : 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
##   ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
##  - attr(*, "class")= chr "summary.lm"

When publishing an empirical paper in economics, one typically presents a table of regression results, showing the estimated coefficients, their standard errors, and the associated p-values. We will talk more about this next week, but here is code that will write your results to a csv file.

x<-summary(zz)$coefficients #extract just the part you want for your table
write.csv(x,file="regres.csv") # this writes the object x to a csv-format file called "regres.csv" (in your working directory)
getwd() # will tell you the location of your working directory

## [1] "C:/Users/EAEFF/Dropbox/4620/rmd"

The following table is our regression results table. The estimated coefficients are the \(\alpha_{0}\), \(\alpha_{1}\), and \(\alpha_{2}\) estimates. The standard error is a measure of how uncertain we are about the value of the estimated coefficient: the higher the standard error, the less we can trust our estimate. t value (the t-statistic) is formed by dividing the Estimate by the Std.Error. The null hypothesis (\(H_{0}\)) for the t-statistic is that the true value of the coefficient is zero (that is, changes in the independent variable have no effect on the dependent variable). Pr(>|t|) (the p-value) gives the probability that \(H_{0}\) is true. The probability that \(H_{0}\) is true (i.e., that the true value of the coefficient equals zero) is zero for ger_secon and 0.186 for govtExpEduc. As a rule of thumb, when the p-value is below 0.05 we reject the null hypothesis, allowing us to say with fair confidence that the coefficient is significantly different from zero: that the independent variable does indeed have an effect on the dependent variable. Thus, we conclude that \(\alpha_{1}=0\) and \(\alpha_{2}>0\).

Regression results
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	299.1965	15.0311	19.9052	0.000
govtExpEduc	-4.0468	3.0413	-1.3306	0.186
ger_secon	1.8507	0.1514	12.2233	0.000

What have we learned? We learned that our initial hypotheses are not consistent with the facts: increases in educational expenditures do not increase test scores; and increases in the proportion of persons who enroll in secondary education do not decrease test scores (in fact, they increase test scores).

Results like these would start several trains of thought going in the mind of the economist. Perhaps the relationship between test scores and educational expenditures is non-linear? Or perhaps we overlooked something when formulating our hypotheses: for example, countries with a small share of the population in secondary education might not be rationing to the brightest students, but instead to the best-connected, or to the most urbanized. The results set us thinking in new directions, so that we develop a more complete and more sophisticated understanding of what is happening in the real world.

6 Prediction

Using the regression results we can create a fitted value (or predicted value) for the dependent variable. The chart below illustrates the relationship between the fitted value (on the y-axis) and original dependent variable (on the x-axis). The diagonal blue line show where the plotted points would have landed were the predictions 100% accurate. The vertical lines show the residuals, which are the error in estimating the predicted value, corresponding to the error term in our initial linear model. The green lines show the positive residuals (where actual values are higher than the predicted values) and the red lines show the negative residuals (where actual values are lower than predicted).

Plotting fitted value against the dependent variable

A good measure of how well the fitted value corresponds to the dependent variable is \(R^2\), which is the proportion of the variation in the dependent variable that is explained by the model. This equals the squared correlation beteen the fitted value and the dependent variable. \(R^2\) will be between 0 and 1; in this case it is 0.597.

You can view the data used for estimation and prediction below:

7 Your responsibilities

7.1 Make sure you know what these mean

working directory
data.frame
rownames
colnames
variable
coefficient
standard error
residual
fitted value
dependent variable
independent variable
observation
standard deviation
mean
range
quantile
quartile
descriptive statistics
null hypothesis
t-statistic
p-value

7.2 Know how to use these functions in R

setwd
getwd
library
readxl::read_excel
class
head
tail
dim
write.csv
lm

7.3 Know how to do these things

write out a descriptive statistics table
write out a regression results table
read in data from an excel file
merge two data frames
run a regression

7.4 Acceptable table format for an empirical paper in economics

Economics wants to be considered a science and for that reason the presentation of results in a published paper reflects the general fashion in the sciences. Tables are especially important. A friend who was for many years the editor of a journal told me that he would always take a quick look at the tables in a newly submitted paper. If they were sloppy, or did not follow the usual style in economics, he would immediately begin looking for reasons to desk reject the paper. Acceptable tables are not hard to produce, and can easily be made using Microsoft Office. Look here for instructions.

7.5 Individual homework

Make, in MS Word, acceptable tables for two csv files: one with the descriptive statistics for the variables in your above model, and the second with the model regression results. Submit these in the Dropbox for week 1. I am calling this individual homework, but I very much hope that you all will help each other learn this material.