Compiled on 2023-01-23 by E.
Anthon Eff
You might find it easier to read and navigate this html file if you download it to your own computer.
We will use international data for most of the topics this semester. I recommend that you use these data for your term paper.
The HDI is produced by the United Nations Development Programme. See information about the HDI here and here. The following data come from the 2020 Human Development Report.
The WDI are produced by the World Bank. See information about the WDI here.
Open Rstudio. The best way to arrange the panes is to have
Source on the left and Console on the right. On the
top menu you will see Tools: click Tools>Global
Options>Pane Layout to arrange the panes.
The Source pane is where you will write your R scripts. The Console pane
is where output appears when you run your script.
On the first line of your script set your working directory to a
location where you want to read and write your data. For example:
setwd("C:/Users/EAEFF/Dropbox (MTSU)/class/4620/")
. Note
that slashes are different from the usual Microsoft Windows slashes.
R is an open-source project. Volunteers have created over 13,000
packages to perform specific tasks in a variety of fields, ranging from
econometrics to textual analysis. A nice overview of the highest quality
packages by field can be found in the CRAN Task Views.
To use a specific package, one must first install it. A package
that we will use quite often this semester is AER.
install.packages("AER")
You will only need to install a package once. So, after installing a
package, remove the install.packages
command, or comment it
out, by placing a hashtag (#
) in front of it:
# install.packages("AER")
Once a package is installed, you can load that package like this:
library(AER)
Install the following packages now: AER, psych,
readxl.
After installing, load them.
Data analyis essentially consists of bringing data into R, doing
something to that data, then writing results to files. R can read and
write many kinds of data. In the following example we will load two
excel files: HDI_educationAchievements.xlsx
and WDI_miscIndicators01.xlsx.
Each of these workbooks has two worksheets. The first sheet is called
data
, and consists of a data table where each row
(also called observation, or record) is a country or
territory, and each column (also called variable, or
field) gives values for various socio-economic measures. The
second sheet, called variable
, provides a short description
of each data column.
Click on the file names above and save the two excel files in your working directory. You should open these files in excel and take a quick look at their contents.
Once the file is in your working directory, you can read it into R,
using a function called read_excel
from the
readxl
package. The following command reads the worksheet
data
from the workbook
WDI_miscIndicators01.xlsx
and puts it into an object called
dd
in the R general environment.
dd<-readxl::read_excel(path="WDI_miscIndicators01.xlsx",sheet="data")
To see the help files for a function like
readxl::read_excel
, simply highlight the function name with
your cursor in Rstudio, and press the F1 key.
The object dd
is of a class called tibble,
which we will not use in this course, so we will convert dd
into the class data.frame, which is the most common way in
which data is used in R.
dd<-data.frame(dd)
One should always develop an understanding of ones data before using
them. The data.frame dd
is 238 x 56 (238 observations, 56
variables), small enough that we can view it in a spreadsheet. But
economists usually use much larger datasets (I have used several that
had millions of records), so commands that will provide summary
information about a dataset are very useful.
dd<-readxl::read_excel(path="WDI_miscIndicators01.xlsx",sheet="data")
dd<-data.frame(dd)
dim(dd) #gives the dimensions of the data.frame
## [1] 238 56
class(dd) #gives the object dd's class
## [1] "data.frame"
head(dd,3) #shows the top 3 rows of the data.frame
## country iso3 iso2c year CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER
## 1 Afghanistan AFG AF 2018 -1.4968340 10 4.326923 0.00000
## 2 Albania ALB AL 2018 -0.5219214 11 35.096150 24.51923
## 3 Algeria DZA DZ 2018 -0.6301624 10 28.365390 17.30769
## CC.PER.RNK.UPPER CC.STD.ERR GE.EST GE.NO.SRC GE.PER.RNK GE.PER.RNK.LOWER
## 1 9.615385 0.1573170 -1.4572850 8 7.692307 1.442308
## 2 44.230770 0.1360851 0.1147880 7 57.692310 42.788460
## 3 41.346150 0.1520321 -0.4439246 10 37.019230 20.192310
## GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE HD.HCI.AMRT.MA
## 1 12.50000 0.2380511 0.7803819 0.8037745 0.7587757
## 2 70.19231 0.2275700 0.9277925 0.9501160 0.9055499
## 3 48.07692 0.1942946 0.9069255 0.9181952 0.8959428
## HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS HD.HCI.HLOS.FE
## 1 8.72034 6.795722 9.544137 354.7588 350.1461
## 2 12.88551 13.201755 12.610832 429.3290 445.0104
## 3 11.77532 12.108990 11.442765 374.0891 383.3708
## HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE HD.HCI.MORT.MA
## 1 357.7947 4.949788 0.9352661 0.9389164 0.9318808
## 2 413.5843 8.851396 0.9909778 0.9916161 0.9903747
## 3 365.7584 7.048028 0.9759771 0.9775091 0.9745197
## HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB PV.EST PV.NO.SRC PV.PER.RNK
## 1 0.3934892 0.3810325 0.4053017 -2.7516110 6 0.4761905
## 2 0.6286663 0.6159194 0.6409156 0.3783375 8 59.0476200
## 3 0.5319939 0.5249978 0.5387299 -0.8265918 7 18.0952400
## PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR RL.EST RL.NO.SRC RL.PER.RNK
## 1 0.00000 2.380952 0.2267807 -1.6680210 11 4.326923
## 2 47.61905 70.000000 0.2149327 -0.3923339 11 38.942310
## 3 11.42857 29.047620 0.2199717 -0.7752355 12 22.115390
## RL.PER.RNK.LOWER RL.PER.RNK.UPPER RL.STD.ERR RQ.EST RQ.NO.SRC RQ.PER.RNK
## 1 1.442308 6.730769 0.1586558 -1.1301910 9 10.576920
## 2 28.846150 48.076920 0.1456045 0.2684412 8 63.461540
## 3 16.346150 32.692310 0.1553647 -1.2625400 8 8.173077
## RQ.PER.RNK.LOWER RQ.PER.RNK.UPPER RQ.STD.ERR VA.EST VA.NO.SRC VA.PER.RNK
## 1 6.250000 18.75000 0.1931015 -0.9929534 10 20.68966
## 2 52.403850 73.07692 0.2055196 0.2076107 12 53.20197
## 3 4.326923 14.42308 0.1951701 -0.9816775 12 21.67488
## VA.PER.RNK.LOWER VA.PER.RNK.UPPER VA.STD.ERR
## 1 13.7931 25.61576 0.1347938
## 2 45.3202 60.59113 0.1280210
## 3 13.7931 25.61576 0.1315684
tail(dd,2) #shows the bottom 2 rows of the data.frame
## country iso3 iso2c year CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER
## 237 Zambia ZMB ZM 2018 -0.6562852 14 27.88461 18.269230
## 238 Zimbabwe ZWE ZW 2018 -1.2347430 15 10.09615 5.288462
## CC.PER.RNK.UPPER CC.STD.ERR GE.EST GE.NO.SRC GE.PER.RNK
## 237 39.42308 0.1304875 -0.5592093 11 33.17308
## 238 14.42308 0.1272638 -1.1988600 12 10.57692
## GE.PER.RNK.LOWER GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE
## 237 17.788460 43.75000 0.1874648 0.7248627 0.7747966
## 238 6.730769 16.34615 0.1805988 0.6420019 0.6577585
## HD.HCI.AMRT.MA HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS
## 237 0.6749445 8.764444 8.745135 8.787398 358.1404
## 238 0.6228139 11.054537 10.974918 11.136672 396.1388
## HD.HCI.HLOS.FE HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE
## 237 NA NA 5.022243 0.9406496 0.9455552
## 238 NA NA 7.006610 0.9506910 0.9554852
## HD.HCI.MORT.MA HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB PV.EST
## 237 0.9359885 0.3911756 0.3769210 0.4038159 0.0938291
## 238 0.9462116 0.4612416 0.4397229 0.4822915 -0.7134061
## PV.NO.SRC PV.PER.RNK PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR
## 237 7 50.95238 37.14286 61.42857 0.2192343
## 238 8 20.47619 12.85714 33.33333 0.2149327
## RL.EST RL.NO.SRC RL.PER.RNK RL.PER.RNK.LOWER RL.PER.RNK.UPPER
## 237 -0.3450496 14 41.346150 29.807690 50.48077
## 238 -1.2733430 15 8.173077 5.288462 14.42308
## RL.STD.ERR RQ.EST RQ.NO.SRC RQ.PER.RNK RQ.PER.RNK.LOWER
## 237 0.1434674 -0.4842296 10 34.134620 22.115390
## 238 0.1387014 -1.5035220 11 5.288462 2.884615
## RQ.PER.RNK.UPPER RQ.STD.ERR VA.EST VA.NO.SRC VA.PER.RNK
## 237 45.192310 0.1663302 -0.3190484 14 35.96059
## 238 8.653846 0.1644797 -1.1232210 15 17.24138
## VA.PER.RNK.LOWER VA.PER.RNK.UPPER VA.STD.ERR
## 237 31.03448 40.39409 0.1272391
## 238 12.31527 22.16749 0.1242625
dd[c(12,7,19),3:6] #show rows 12,7,19 for columns 3,4,5,6.
## iso2c year CC.EST CC.NO.SRC
## 12 AU 2018 1.8064600 11
## 7 AI 2018 1.2387970 1
## 19 BY 2018 -0.1922859 11
We can see that three variables are character variables, the others are numeric. We will use country name as the rowname, and then remove all character variables from the data.frame, so that all remaining variables are numeric.
rownames(dd)<-dd$country #assigns country names as the rownames
dd<-dd[,-(1:3)] #removes the first three columns in dd (the non-numeric columns)
dd[c("Denmark","Turkey"),] # shows the rows for Denmark and Turkey
## year CC.EST CC.NO.SRC CC.PER.RNK CC.PER.RNK.LOWER CC.PER.RNK.UPPER
## Denmark 2018 2.1481000 9 98.55769 94.71154 100.00000
## Turkey 2018 -0.3354134 12 43.75000 33.65385 52.40385
## CC.STD.ERR GE.EST GE.NO.SRC GE.PER.RNK GE.PER.RNK.LOWER
## Denmark 0.1451558 1.8716310 7 97.11539 91.82692
## Turkey 0.1323154 0.0060968 9 53.84615 39.42308
## GE.PER.RNK.UPPER GE.STD.ERR HD.HCI.AMRT HD.HCI.AMRT.FE HD.HCI.AMRT.MA
## Denmark 100.00000 0.2221446 0.9296045 0.9469248 0.9127313
## Turkey 66.34615 0.2085700 0.9078032 0.9365753 0.8785881
## HD.HCI.EYRS HD.HCI.EYRS.FE HD.HCI.EYRS.MA HD.HCI.HLOS HD.HCI.HLOS.FE
## Denmark 13.39460 13.47162 13.32221 530.7278 532.2063
## Turkey 12.06618 11.99553 12.13585 459.1798 463.1666
## HD.HCI.HLOS.MA HD.HCI.LAYS HD.HCI.MORT HD.HCI.MORT.FE HD.HCI.MORT.MA
## Denmark 529.1732 11.374220 0.9957407 0.9961299 0.9953738
## Turkey 455.3064 8.864876 0.9887604 0.9894558 0.9881020
## HD.HCI.OVRL HD.HCI.OVRL.LB HD.HCI.OVRL.UB PV.EST PV.NO.SRC
## Denmark 0.7708318 0.7620665 0.7794603 0.9532585 9
## Turkey 0.6253099 0.6135694 0.6366453 -1.3086830 8
## PV.PER.RNK PV.PER.RNK.LOWER PV.PER.RNK.UPPER PV.STD.ERR RL.EST
## Denmark 82.38095 66.190480 94.28571 0.2107918 1.8323330
## Turkey 10.47619 6.666667 14.28571 0.2142458 -0.3194206
## RL.NO.SRC RL.PER.RNK RL.PER.RNK.LOWER RL.PER.RNK.UPPER RL.STD.ERR
## Denmark 10 96.63461 90.38461 100.00000 0.1569019
## Turkey 12 42.30769 30.76923 51.92308 0.1502382
## RQ.EST RQ.NO.SRC RQ.PER.RNK RQ.PER.RNK.LOWER RQ.PER.RNK.UPPER
## Denmark 1.6370360 8 93.75000 86.53846 98.07692
## Turkey 0.0175277 10 54.80769 42.78846 63.94231
## RQ.STD.ERR VA.EST VA.NO.SRC VA.PER.RNK VA.PER.RNK.LOWER
## Denmark 0.2215324 1.6118670 10 97.53695 92.61084
## Turkey 0.1766253 -0.8313113 12 25.12315 19.21182
## VA.PER.RNK.UPPER VA.STD.ERR
## Denmark 100.00000 0.1455202
## Turkey 30.04926 0.1298985
colMeans(dd[,c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA")]) #gives the mean of each of these three columns (using colnames)
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## NA NA NA
colMeans(dd[,c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA")],na.rm=TRUE) #if there are missing values must remove them
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## 428.1475 434.4287 425.8587
apply(dd[,20:22],2,mean,na.rm=TRUE) #another way to get the mean of each of these three columns (using column numbers)
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## 428.1475 434.4287 425.8587
rxx<-c("HD.HCI.HLOS","HD.HCI.HLOS.FE","HD.HCI.HLOS.MA") # can store colnames in object
apply(dd[,rxx],2,sd,na.rm=TRUE) #get the standard deviation of each of these three columns
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## 69.28723 70.83195 71.66607
apply(dd[,rxx],2,range,na.rm=TRUE) #get the range of each of these three columns
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## [1,] 304.9222 302.0893 307.0905
## [2,] 580.8655 584.6268 577.2593
apply(dd[,rxx],2,quantile,na.rm=TRUE) #get the quartiles of each of these three columns
## HD.HCI.HLOS HD.HCI.HLOS.FE HD.HCI.HLOS.MA
## 0% 304.9222 302.0893 307.0905
## 25% 373.4452 382.8199 369.6537
## 50% 416.2702 426.1908 415.9109
## 75% 495.3233 503.6029 499.7190
## 100% 580.8655 584.6268 577.2593
When publishing an empirical paper in economics, one typically presents a table of descriptive statistics, showing the mean, standard deviation, minimum, maximum, and number of non-missing observations for each of the variables one used in ones analysis. Below, we calculate this table and write the results in a file that can readily be used in MS Office.
u<-data.frame(psych::describe(dd)) # using the describe function from the psych package
colnames(u) # the function produces more information than we want to use
## [1] "vars" "n" "mean" "sd" "median" "trimmed"
## [7] "mad" "min" "max" "range" "skew" "kurtosis"
## [13] "se"
u<-u[,c("n","mean","sd","min","max")] # so here we select only the columns that we want
u # simply putting the name of the object causes it to print on the console
## n mean sd min max
## year 238 2.018000e+03 0.00000000 2018.0000000 2018.0000000
## CC.EST 205 -4.849561e-05 0.99746575 -1.8003120 2.2127440
## CC.NO.SRC 205 9.639024e+00 4.05675895 1.0000000 16.0000000
## CC.PER.RNK 205 5.002814e+01 28.99778868 0.0000000 100.0000000
## CC.PER.RNK.LOWER 205 4.156660e+01 29.06409059 0.0000000 95.1923100
## CC.PER.RNK.UPPER 205 5.760553e+01 28.03314615 2.8846150 100.0000000
## CC.STD.ERR 205 1.743017e-01 0.08379772 0.1191657 0.8767205
## GE.EST 205 -4.745274e-03 0.99526451 -2.4494090 2.2314740
## GE.NO.SRC 205 7.468293e+00 3.10205806 1.0000000 13.0000000
## GE.PER.RNK 205 4.985694e+01 28.95864127 0.0000000 100.0000000
## GE.PER.RNK.LOWER 205 3.837946e+01 28.61175022 0.0000000 97.1153900
## GE.PER.RNK.UPPER 205 6.045028e+01 28.03374140 0.9615384 100.0000000
## GE.STD.ERR 205 2.427437e-01 0.09085091 0.1731845 1.0484390
## HD.HCI.AMRT 173 8.443381e-01 0.08737014 0.5116570 0.9601836
## HD.HCI.AMRT.FE 173 8.761004e-01 0.08347202 0.5751923 0.9751105
## HD.HCI.AMRT.MA 173 8.129396e-01 0.09491627 0.4484606 0.9578469
## HD.HCI.EYRS 173 1.126578e+01 2.33297253 4.1569886 13.9364252
## HD.HCI.EYRS.FE 167 1.132775e+01 2.43761034 3.8481028 13.9298334
## HD.HCI.EYRS.MA 167 1.126428e+01 2.24040536 4.1481204 13.9701385
## HD.HCI.HLOS 166 4.281475e+02 69.28722577 304.9222412 580.8654785
## HD.HCI.HLOS.FE 152 4.344287e+02 70.83194719 302.0892944 584.6267700
## HD.HCI.HLOS.MA 152 4.258587e+02 71.66606804 307.0905151 577.2592773
## HD.HCI.LAYS 166 7.904075e+00 2.59345365 2.2065024 12.9378729
## HD.HCI.MORT 173 9.713640e-01 0.02892175 0.8773022 0.9981502
## HD.HCI.MORT.FE 173 9.737100e-01 0.02701427 0.8844790 0.9983334
## HD.HCI.MORT.MA 173 9.691398e-01 0.03075042 0.8705665 0.9979782
## HD.HCI.OVRL 166 5.654461e-01 0.14697786 0.2990290 0.8870836
## HD.HCI.OVRL.LB 166 5.516884e-01 0.15067062 0.2696179 0.8775859
## HD.HCI.OVRL.UB 166 5.777465e-01 0.14446884 0.3156515 0.8966209
## PV.EST 207 -1.276698e-02 1.00187293 -2.9933110 1.9354980
## PV.NO.SRC 207 6.526570e+00 2.29351541 1.0000000 9.0000000
## PV.PER.RNK 207 4.963653e+01 29.08487260 0.0000000 100.0000000
## PV.PER.RNK.LOWER 207 3.777778e+01 25.30539109 0.0000000 94.2857100
## PV.PER.RNK.UPPER 207 6.048769e+01 30.40167453 0.9523810 100.0000000
## PV.STD.ERR 207 2.340465e-01 0.05200456 0.2107918 0.5993918
## RL.EST 205 -2.556260e-03 0.99162415 -2.3320440 2.0451420
## RL.NO.SRC 205 9.960976e+00 3.90681143 1.0000000 16.0000000
## RL.PER.RNK 205 4.996951e+01 28.88854697 0.0000000 100.0000000
## RL.PER.RNK.LOWER 205 4.102486e+01 28.32456119 0.0000000 95.1923100
## RL.PER.RNK.UPPER 205 5.788227e+01 28.61483793 0.4807692 100.0000000
## RL.STD.ERR 205 1.834239e-01 0.08503842 0.1338077 0.7548596
## RQ.EST 205 1.985990e-03 0.99009942 -2.3639190 2.2269640
## RQ.NO.SRC 205 7.702439e+00 2.78374904 1.0000000 12.0000000
## RQ.PER.RNK 205 4.994606e+01 28.98178616 0.0000000 100.0000000
## RQ.PER.RNK.LOWER 205 3.909709e+01 28.65894884 0.0000000 97.5961500
## RQ.PER.RNK.UPPER 205 5.996482e+01 28.04701335 1.9230770 100.0000000
## RQ.STD.ERR 205 2.262221e-01 0.09385629 0.1600714 1.0725610
## VA.EST 200 -3.714988e-03 0.99002901 -2.1749270 1.7335840
## VA.NO.SRC 200 9.930000e+00 3.99309706 1.0000000 16.0000000
## VA.PER.RNK 200 4.984729e+01 28.96615364 0.4926108 100.0000000
## VA.PER.RNK.LOWER 200 4.236207e+01 26.96678251 0.0000000 95.5665100
## VA.PER.RNK.UPPER 200 5.660345e+01 30.18555150 1.9704430 100.0000000
## VA.STD.ERR 200 1.525038e-01 0.04322136 0.1241545 0.2962780
write.csv(u,file="descrip.csv") # this writes the object u to a csv-format file in your working directory called "descrip.csv" (can be opened in excel)
getwd() # shows where your working directory is located, so that you can find your file
## [1] "C:/Users/EAEFF/Dropbox/4620/rmd"
Economists usually employ some variant of regression analysis. They do two different kinds of tasks with regression: estimate values (also called prediction), and test hypotheses. The former is more common in the applied work done for business and government; the latter is more important in academia and will be the primary focus of this class.
Hypotheses are drawn from theory, or perhaps simply from common-sense views of how the world works. For example, suppose that most people believe that a country attains high test scores on international exams such as the PISA either by spending a lot on education, or by rationing secondary education to only the smartest students (only secondary education students take the test). We can use regression analysis to see whether those beliefs are consistent with reality.
Our hypothesis (really two hypotheses) is that the test score is a function of educational expenditures and the proportion of the population that enters secondary education: score=f(EducExp,PctSecondary). We express that hypothesis mathematically as a simple linear function:
scorei=α0+α1EducExpi+α2PctSecondaryi+εi ∀ i∈countries
In this equation, score, EducExp, and PctSecondary are variables, and the subscript i indicates that the variables take on the value found in a specific country i. The left hand side variable is called the dependent variable (or response variable); the variables on the right hand side are called the independent variables (or predictor variables or explanatory variables). The direction of causation is from the independent to the dependent variable: the independent variable is cause, the dependent is effect.
The scalars α are coefficients (also called parameters), where α0 is the intercept and α1 and α2 are slope coefficients. For any country i, given the values of EducExpi and PctSecondaryi, the equation will return the score, but with error, since the world is infinitely complicated and nothing can really be reduced to an equation with two determinants. εi represents that error (we call εi the error term or the disturbance); it will always have a mean of zero and have both positive and negative values.
Note that α1 equals the first derivative of score with respect to EducExp (δscore/δEducExp=α1). Thus, for every one unit increase in educational expenditures, score will change by α1 points. With a linear model, the marginal effect - the effect on the dependent variable of a unit increase in the independent variable - is easily understood: it is simply the estimated coefficient.
If our hypothesis is correct, α1>0 (an increase in educational expenditures leads to an increase in scores) and α2<0 (an increase in the proportion of persons who enter secondary education leads to a decrease in scores).
Once we are clear on a hypothesis, we need to find data that can serve as variables in Equation [1] above. We have relevant variables in our excel files, as described in the table below.
variable | long.name | description | excel.file |
---|---|---|---|
ger_secon | Gross enrolment ratio: Secondary | Gross enrolment ratio: Total enrolment in a given level of education (pre-primary, primary, secondary or tertiary), regardless of age, expressed as a percentage of the official school-age population for the same level of education. | HDI_educationAchievements.xlsx |
govtExpEduc | Government expenditure on education | Government expenditure on education: Current, capital and transfer spending on education, expressed as a percentage of GDP. | HDI_educationAchievements.xlsx |
HD.HCI.HLOS | Harmonized Test Scores | Harmonized Test Scores from major international student achievement testing programs. They are measured in TIMMS-equivalent units, where 300 is minimal attainment and 625 is advanced attainment. Test scores from the following testing programs are included: TIMSS (Trends in International Mathematics and Science Study); PIRLS (Progress in International Reading Literacy Study); PISA (Programme for International Student Assessment); SACMEQ (Southern and Eastern Africa Consortium for Monitoring Educational Quality); PASEC (Program of Analysis of Education Systems); LLECE (Latin American Laboratory for Assessment of the Quality of Education); EGRA (Early Grade Reading Assessments). | WDI_miscIndicators01.xlsx |
A problem is that the variables are scattered across two different excel files. We will have to open both files and merge the datasets. Merging requires a key, a variable that exists in both datasets that can be used to match rows correctly. In the international excel files the variable iso3 serves as the key for merging.
dim(bb<-data.frame(read_excel("HDI_educationAchievements.xlsx",sheet="data"))) # note that we can nest functions (nest read_excel within data.frame)
## [1] 200 14
dim(cc<-data.frame(read_excel("WDI_miscIndicators01.xlsx",sheet="data"))) # note also how we can nest the assignment within the dim function so that we learn the size of the object without typing a new line of code
## [1] 238 56
dim(m<-merge(bb,cc,by="iso3")) # when we merge two data frames, we drop observations where the key value is missing in one of the data frames. Thus the merged dataset usually has fewer observations than either of the original datasets.
## [1] 198 69
We will use the lm()
function to carry out regression
analysis. HD.HCI.HLOS is the dependent variable, and
govtExpEduc and ger_secon are the indepdendent
variables. We save the regression results to an object zz. We
look at our results using the summary()
command.
zz<-lm(HD.HCI.HLOS~govtExpEduc+ger_secon,data=m) #the lm function uses OLS to estimate a model with HD.HCI.HLOS as the dependent variable and govtExpEduc and ger_secon as the independent variables, with variables found in data.frame m
summary(zz)#displays the most important information from the lm object zz
##
## Call:
## lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.530 -34.750 3.472 36.159 100.591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 299.1965 15.0311 19.905 <2e-16 ***
## govtExpEduc -4.0468 3.0413 -1.331 0.186
## ger_secon 1.8507 0.1514 12.223 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.64 on 114 degrees of freedom
## (81 observations deleted due to missingness)
## Multiple R-squared: 0.5975, Adjusted R-squared: 0.5904
## F-statistic: 84.61 on 2 and 114 DF, p-value: < 2.2e-16
str(zz) #the str function shows the structure of an object; you can see that zz contains a lot of different parts
## List of 13
## $ coefficients : Named num [1:3] 299.2 -4.05 1.85
## ..- attr(*, "names")= chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## $ residuals : Named num [1:117] -29.99 -31.75 -52.47 -4.29 -32.82 ...
## ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
## $ effects : Named num [1:117] -4757.6 203.1 -557.9 4 -21.5 ...
## ..- attr(*, "names")= chr [1:117] "(Intercept)" "govtExpEduc" "ger_secon" "" ...
## $ rank : int 3
## $ fitted.values: Named num [1:117] 385 461 477 447 557 ...
## ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## ..$ qr : num [1:117, 1:3] -10.8167 0.0925 0.0925 0.0925 0.0925 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:117] "1" "3" "6" "7" ...
## .. .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## .. ..- attr(*, "assign")= int [1:3] 0 1 2
## ..$ qraux: num [1:3] 1.09 1.04 1.06
## ..$ pivot: int [1:3] 1 2 3
## ..$ tol : num 1e-07
## ..$ rank : int 3
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 114
## $ na.action : 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
## ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
## $ xlevels : Named list()
## $ call : language lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
## $ terms :Classes 'terms', 'formula' language HD.HCI.HLOS ~ govtExpEduc + ger_secon
## .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
## .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## $ model :'data.frame': 117 obs. of 3 variables:
## ..$ HD.HCI.HLOS: num [1:117] 355 429 424 443 524 ...
## ..$ govtExpEduc: num [1:117] 3.93 3.95 5.57 2.76 5.32 ...
## ..$ ger_secon : num [1:117] 54.8 96.1 108 86 151 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language HD.HCI.HLOS ~ govtExpEduc + ger_secon
## .. .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## .. .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
## .. .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
## .. .. ..- attr(*, "order")= int [1:2] 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## ..- attr(*, "na.action")= 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
## .. ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
## - attr(*, "class")= chr "lm"
str(summary(zz)) # the summary of zz also contains a lot of parts
## List of 12
## $ call : language lm(formula = HD.HCI.HLOS ~ govtExpEduc + ger_secon, data = m)
## $ terms :Classes 'terms', 'formula' language HD.HCI.HLOS ~ govtExpEduc + ger_secon
## .. ..- attr(*, "variables")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## .. .. .. ..$ : chr [1:2] "govtExpEduc" "ger_secon"
## .. ..- attr(*, "term.labels")= chr [1:2] "govtExpEduc" "ger_secon"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(HD.HCI.HLOS, govtExpEduc, ger_secon)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "HD.HCI.HLOS" "govtExpEduc" "ger_secon"
## $ residuals : Named num [1:117] -29.99 -31.75 -52.47 -4.29 -32.82 ...
## ..- attr(*, "names")= chr [1:117] "1" "3" "6" "7" ...
## $ coefficients : num [1:3, 1:4] 299.2 -4.05 1.85 15.03 3.04 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:3] FALSE FALSE FALSE
## ..- attr(*, "names")= chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## $ sigma : num 45.6
## $ df : int [1:3] 3 114 3
## $ r.squared : num 0.597
## $ adj.r.squared: num 0.59
## $ fstatistic : Named num [1:3] 84.6 2 114
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:3, 1:3] 0.108448 -0.012256 -0.000501 -0.012256 0.00444 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## .. ..$ : chr [1:3] "(Intercept)" "govtExpEduc" "ger_secon"
## $ na.action : 'omit' Named int [1:81] 2 4 5 8 11 19 20 21 22 23 ...
## ..- attr(*, "names")= chr [1:81] "2" "4" "5" "8" ...
## - attr(*, "class")= chr "summary.lm"
When publishing an empirical paper in economics, one typically presents a table of regression results, showing the estimated coefficients, their standard errors, and the associated p-values. We will talk more about this next week, but here is code that will write your results to a csv file.
x<-summary(zz)$coefficients #extract just the part you want for your table
write.csv(x,file="regres.csv") # this writes the object x to a csv-format file called "regres.csv" (in your working directory)
getwd() # will tell you the location of your working directory
## [1] "C:/Users/EAEFF/Dropbox/4620/rmd"
The following table is our regression results table. The estimated coefficients are the α0, α1, and α2 estimates. The standard error is a measure of how uncertain we are about the value of the estimated coefficient: the higher the standard error, the less we can trust our estimate. t value (the t-statistic) is formed by dividing the Estimate by the Std.Error. The null hypothesis (H0) for the t-statistic is that the true value of the coefficient is zero (that is, changes in the independent variable have no effect on the dependent variable). Pr(>|t|) (the p-value) gives the probability that H0 is true. The probability that H0 is true (i.e., that the true value of the coefficient equals zero) is zero for ger_secon and 0.186 for govtExpEduc. As a rule of thumb, when the p-value is below 0.05 we reject the null hypothesis, allowing us to say with fair confidence that the coefficient is significantly different from zero: that the independent variable does indeed have an effect on the dependent variable. Thus, we conclude that α1=0 and α2>0.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 299.1965 | 15.0311 | 19.9052 | 0.000 |
govtExpEduc | -4.0468 | 3.0413 | -1.3306 | 0.186 |
ger_secon | 1.8507 | 0.1514 | 12.2233 | 0.000 |
What have we learned? We learned that our initial hypotheses are not consistent with the facts: increases in educational expenditures do not increase test scores; and increases in the proportion of persons who enroll in secondary education do not decrease test scores (in fact, they increase test scores).
Results like these would start several trains of thought going in the mind of the economist. Perhaps the relationship between test scores and educational expenditures is non-linear? Or perhaps we overlooked something when formulating our hypotheses: for example, countries with a small share of the population in secondary education might not be rationing to the brightest students, but instead to the best-connected, or to the most urbanized. The results set us thinking in new directions, so that we develop a more complete and more sophisticated understanding of what is happening in the real world.
Using the regression results we can create a fitted value (or predicted value) for the dependent variable. The chart below illustrates the relationship between the fitted value (on the y-axis) and original dependent variable (on the x-axis). The diagonal blue line show where the plotted points would have landed were the predictions 100% accurate. The vertical lines show the residuals, which are the error in estimating the predicted value, corresponding to the error term in our initial linear model. The green lines show the positive residuals (where actual values are higher than the predicted values) and the red lines show the negative residuals (where actual values are lower than predicted).
A good measure of how well the fitted value corresponds to the dependent variable is R2, which is the proportion of the variation in the dependent variable that is explained by the model. This equals the squared correlation beteen the fitted value and the dependent variable. R2 will be between 0 and 1; in this case it is 0.597.
You can view the data used for estimation and prediction below:
setwd
getwd
library
readxl::read_excel
class
head
tail
dim
write.csv
lm
Economics wants to be considered a science and for that reason the presentation of results in a published paper reflects the general fashion in the sciences. Tables are especially important. A friend who was for many years the editor of a journal told me that he would always take a quick look at the tables in a newly submitted paper. If they were sloppy, or did not follow the usual style in economics, he would immediately begin looking for reasons to desk reject the paper. Acceptable tables are not hard to produce, and can easily be made using Microsoft Office. Look here for instructions.
Make, in MS Word, acceptable tables for two csv files: one with the descriptive statistics for the variables in your above model, and the second with the model regression results. Submit these in the Dropbox for week 1. I am calling this individual homework, but I very much hope that you all will help each other learn this material.