Week 1: Entering data and running a regression in R

R is an open-source project, where everything has been created by volunteers. Over 4000 “packages” have been created to perform specific tasks in a variety of fields, ranging from econometrics to textual analysis. A nice overview of the highest quality packages by field can be found in the CRAN Task Views.

Working directory and libraries

The first line in an R script typically sets the working directory: that is, the folder where R will look for input data and write output. If you have a flash drive, you should write to a folder there. Otherwise, write to your My Documents folder. Note that the slashes must be written Linux-style, rather than Windows-style.

setwd("C:/Users/eaeff/Documents")

To use a specific package, one must first install it. A package that we will use quite often this semester is AER.

install.packages("AER")

You will be prompted to pick a site from which the package will be downloaded. Once it is downloaded and installed, you load the package like this:

library(AER)
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival

Inputting data

The load command allows one to bring in an R format data set (a “workspace”):

load(file="S:/TEFF/6060/finXam.Rdata")

The package foreign provides capability to read or write in a number of popular formats. You must first load the foreign library.

The “csv” format, which can be directly opened by spreadsheet software, can be read with the read.csv() command. A csv format file can be created with the write.csv() command.

library(foreign)
uu<-read.csv("S:/TEFF/6060/examZ.csv",as.is=TRUE)
write.csv(uu, file="uu.csv")

The “dbf” format can be read and written in a very similar way.

gg<-read.dbf("S:/TEFF/6060/williamson.dbf",as.is=TRUE)
write.dbf(uu,file="uu.dbf")

Tab-delimited data can be most easily read using the read.delim() command.

ll<-read.delim("S:/TEFF/6060/languagecodes.tab")

The command readLines() will read in lines of text, and can often be used to collect unstructured data.

tt<-readLines("S:/TEFF/6060/pareto.txt")

Some data is included in R packages. Use the data() command to see which datasets are available in base R.

data(package="base")
## Warning in data(package = "base"): datasets have been moved from package
## 'base' to package 'datasets'

To load a dataset-for example the dataset LifeCycleSavings-use the data function this way:

data(LifeCycleSavings)

Looking at your data

One can see which objects exist in your environment by typing the command

ls()
## [1] "finXam"           "gg"               "LifeCycleSavings"
## [4] "ll"               "tt"               "uu"

There are many classes of objects; data will usually be in a dataframe or a matrix. Learn the class of the object uu with the command

class(uu)
## [1] "data.frame"

Objects have attributes. The attributes of a dataframe include its column and row names, and its dimensions (number of rows and columns). We call each row an observation or a record, and each column a variable or a field. One should always understand the data one is using, and a good way to do that is to look at its attributes and at a few observations. Some useful commands for looking at a dataframe uu are as follows:

Look at first six rows:

head(uu)
##   X             country ISO2 MRunder5 MRinfant Pop2007 Births2007 DeathsU5
## 1 1         Afghanistan   AF      257      165   27145       1314      338
## 2 2             Albania   AL       15       13    3190         52        1
## 3 3             Algeria   DZ       37       33   33858        704       26
## 4 4             Andorra   AD        3        3      75          0        0
## 5 5              Angola   AO      158      116   17024        810      128
## 6 6 Antigua and Barbuda   AG       11       10      85          0        0
##   GNIpercap LEXbirth LRadult NetPrimEd pctYlow40 pctYlow20 pctInfLBW
## 1       250       44      28        61        NA        NA        NA
## 2      3290       76      99        94        21        40         7
## 3      3620       72      75        95        19        43         6
## 4        NA       NA      NA        83        NA        NA        NA
## 5      2560       42      67        58        NA        NA        12
## 6     11520       NA      NA        NA        NA        NA         5
##   u5uweight u5wasting u5stunting iodize watTot watUrb watRur sanTot sanUrb
## 1        39         7         54     28     22     37     17     30     45
## 2         8         7         22     60     97     97     97     97     98
## 3         4         3         11     61     85     87     81     94     98
## 4        NA        NA         NA     NA    100    100    100    100    100
## 5        31         6         45     35     51     62     39     50     79
## 6        NA        NA         NA     NA     NA     95     NA     NA     98
##   sanRur GovFinVac pneumDoc pneumAB popU18 popU5 popGR CDR CBR LEX fertR
## 1     25         0       28      NA  14526  5002   4.5  20  48  44   7.1
## 2     97       100       45      38    986   250  -0.2   6  16  76   2.1
## 3     87       100       53      59  11780  3271   1.7   5  21  72   2.4
## 4    100        NA       NA      NA     14     4   2.0  NA  NA  NA    NA
## 5     16        18       58      NA   9022  3162   2.8  21  48  42   6.5
## 6     NA        NA       NA      NA     28     8   1.9  NA  NA  NA    NA
##   pctUrb urbanGR Latitude Longitude EdXpctGDP homi100K    MILLF  WARLIKE
## 1     24     6.4    33.00      65.0        NA      3.4       NA       NA
## 2     47     1.4    41.00      20.0        NA      6.6 1.990494 0.004838
## 3     65     3.2    28.00       3.0        NA      9.6 1.301976 0.001362
## 4     91     2.0    42.50       1.5  2.589995      1.4       NA       NA
## 5     55     5.4   -12.50      18.5        NA     36.0 1.775038 0.004215
## 6     38     2.5    17.05     -61.8        NA      7.7       NA       NA
##   LINGSIML PCGDP IMO9503 SOCCER2004 MISSWORLD MISSINTERN IQ CALORIE97
## 1       NA    NA      NA         NA        NA         NA NA        NA
## 2 0.871083  3050    70.1         81         0          0 90      2961
## 3 0.740776  4956    82.0         75         0          0 84      2853
## 4       NA    NA      NA         NA        NA         NA NA        NA
## 5 0.751893  2160     0.0         82         0          0 69      1903
## 6       NA    NA      NA         NA        NA         NA NA        NA
##   CALORIE70 PROTEIN97 PROTEIN70 FAT97 FAT70 TBRATE HIVRATE   PHYSPP
## 1        NA        NA        NA    NA    NA     NA      NA       NA
## 2      2424      98.7      69.8  78.8  52.3   19.1 0.00545 1.349445
## 3      1829      78.6      47.3  69.6  36.0   45.8 0.07238 0.923000
## 4        NA        NA        NA    NA    NA     NA      NA       NA
## 5      2103      40.5      44.9  37.5  34.1  123.8 2.11623 0.077030
## 6        NA        NA        NA    NA    NA     NA      NA       NA
##   PCTMPFEM    MF014    MFTOT FPCTLF CHLF GCGDP
## 1       NA       NA       NA     NA   NA    NA
## 2      5.2 1.071017 1.048135   41.1  0.6  11.6
## 3      4.0 1.047417 1.026386   26.1  0.7  16.5
## 4       NA       NA       NA     NA   NA    NA
## 5     15.5 1.001721 0.977901   46.4 26.5  32.0
## 6       NA       NA       NA     NA   NA    NA

Look at last six rows:

tail(uu)
##       X                            country ISO2 MRunder5 MRinfant Pop2007
## 187 187                            Vanuatu   VU       34       28     226
## 188 188 Venezuela (Bolivarian Republic of)   VE       19       17   27657
## 189 189                           Viet Nam   VN       15       13   87375
## 190 190                              Yemen   YE       73       55   22389
## 191 191                             Zambia   ZM      170      103   11922
## 192 192                           Zimbabwe   ZW       90       59   13349
##     Births2007 DeathsU5 GNIpercap LEXbirth LRadult NetPrimEd pctYlow40
## 187          7        0      1840       70      78        87        NA
## 188        597       11      7320       74      93        91        12
## 189       1653       25       790       74      90        95        18
## 190        860       63       870       62      59        75        19
## 191        473       80       800       42      68        57        12
## 192        373       34       340       43      91        88        13
##     pctYlow20 pctInfLBW u5uweight u5wasting u5stunting iodize watTot
## 187        NA         6        NA        NA         NA     NA     NA
## 188        52         9         5         4         12     90     NA
## 189        45         7        20         8         36     93     92
## 190        45        32        46        12         53     30     66
## 191        55        12        19         5         39     77     58
## 192        56        11        17         6         29     91     81
##     watUrb watRur sanTot sanUrb sanRur GovFinVac pneumDoc pneumAB popU18
## 187     NA     NA     NA     NA     NA       100       NA      NA    103
## 188     NA     NA     NA     NA     NA        NA       72      NA  10089
## 189     98     90     65     88     56        87       83      55  30263
## 190     68     65     46     88     30        31       NA      38  11729
## 191     90     41     52     55     51        24       68      NA   6270
## 192     98     72     46     63     37         0       25       8   6175
##     popU5 popGR CDR CBR LEX fertR pctUrb urbanGR Latitude Longitude
## 187    31   2.4   5  29  70   3.8     24     4.2      -16       167
## 188  2896   2.0   5  22  74   2.6     94     2.8        8       -66
## 189  8109   1.6   5  19  74   2.2     27     3.6       16       106
## 190  3740   3.5   8  38  62   5.5     28     5.6       15        48
## 191  2030   2.3  19  40  42   5.2     35     1.7      -15        30
## 192  1706   1.4  19  28  43   3.2     37     3.0      -20        30
##     EdXpctGDP homi100K    MILLF  WARLIKE LINGSIML PCGDP IMO9503 SOCCER2004
## 187        NA      1.0       NA       NA       NA    NA      NA         NA
## 188        NA     37.0 0.830040 0.005048 0.978824  5940    70.6         52
## 189        NA      3.8 1.601692 0.015830 0.814990  1752     6.3        128
## 190        NA      3.2 1.350834 0.008555 0.931964   776     0.0        166
## 191  2.773961     22.9 0.440700 0.005797 0.711336   771     0.0         87
## 192        NA     32.9 0.726816 0.005808 0.764259  2744     0.0         78
##     MISSWORLD MISSINTERN IQ CALORIE97 CALORIE70 PROTEIN97 PROTEIN70 FAT97
## 187        NA         NA NA        NA        NA        NA        NA    NA
## 188     269.5      258.5 89      2321      2352      59.0      58.8  65.8
## 189       0.0        7.5 96      2484      2146      56.7      50.0  36.3
## 190       0.0        0.0 83      2051      1768      54.4      49.7  36.5
## 191       0.0        0.0 77      1970      2173      51.5      63.7  29.7
## 192       0.0        0.0 66      2145      2225      52.3      61.2  53.3
##     FAT70 TBRATE  HIVRATE   PHYSPP PCTMPFEM    MF014    MFTOT FPCTLF CHLF
## 187    NA     NA       NA       NA       NA       NA       NA     NA   NA
## 188  53.7   26.3  0.68538 2.151490     28.6 1.042256 1.013751   34.0  0.4
## 189  21.1  111.0  0.21767 0.524345     26.0 1.030326 0.993503   49.1  6.8
## 190  28.5   73.7  0.01243 0.225620      0.7 1.048498 0.986403   28.0 19.3
## 191  40.7  488.4 19.06989 0.069040     10.1 1.021479 1.005128   45.1 15.9
## 192  49.9  374.6 25.83682 0.138990     14.0 1.004648 0.996861   44.5 28.0
##     GCGDP
## 187    NA
## 188   6.8
## 189   7.9
## 190  13.9
## 191  11.1
## 192  18.0

Matrices and dataframes are subsettable: one may extract just part of them. The following command looks at rows 100 to 105 and columns 1 to 8 of dataframe uu.

uu[100:105,1:8]
##       X    country ISO2 MRunder5 MRinfant Pop2007 Births2007 DeathsU5
## 100 100 Luxembourg   LU        3        2     467          5        0
## 101 101 Madagascar   MG      112       70   19683        722       81
## 102 102     Malawi   MW      111       71   13925        573       64
## 103 103   Malaysia   MY       11       10   26572        555        6
## 104 104   Maldives   MV       30       26     306          7        0
## 105 105       Mali   ML      196      117   12337        595      117

The number of rows and columns in a matrix or dataframe can be found as follows:

dim(uu)
## [1] 192  65
# or
NROW(uu);NCOL(uu)
## [1] 192
## [1] 65

Note how the # symbol indicates a comment (R will not execute anything to the right of #), and a semi-colon separates executable commands on the same line.

Matrices have rownames and colnames. Dataframes have rownames and names.

names(uu)
##  [1] "X"          "country"    "ISO2"       "MRunder5"   "MRinfant"  
##  [6] "Pop2007"    "Births2007" "DeathsU5"   "GNIpercap"  "LEXbirth"  
## [11] "LRadult"    "NetPrimEd"  "pctYlow40"  "pctYlow20"  "pctInfLBW" 
## [16] "u5uweight"  "u5wasting"  "u5stunting" "iodize"     "watTot"    
## [21] "watUrb"     "watRur"     "sanTot"     "sanUrb"     "sanRur"    
## [26] "GovFinVac"  "pneumDoc"   "pneumAB"    "popU18"     "popU5"     
## [31] "popGR"      "CDR"        "CBR"        "LEX"        "fertR"     
## [36] "pctUrb"     "urbanGR"    "Latitude"   "Longitude"  "EdXpctGDP" 
## [41] "homi100K"   "MILLF"      "WARLIKE"    "LINGSIML"   "PCGDP"     
## [46] "IMO9503"    "SOCCER2004" "MISSWORLD"  "MISSINTERN" "IQ"        
## [51] "CALORIE97"  "CALORIE70"  "PROTEIN97"  "PROTEIN70"  "FAT97"     
## [56] "FAT70"      "TBRATE"     "HIVRATE"    "PHYSPP"     "PCTMPFEM"  
## [61] "MF014"      "MFTOT"      "FPCTLF"     "CHLF"       "GCGDP"
rownames(uu)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
##  [12] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22" 
##  [23] "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33" 
##  [34] "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44" 
##  [45] "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55" 
##  [56] "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66" 
##  [67] "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77" 
##  [78] "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88" 
##  [89] "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99" 
## [100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
## [111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
## [122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"
## [144] "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
## [155] "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165"
## [166] "166" "167" "168" "169" "170" "171" "172" "173" "174" "175" "176"
## [177] "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"
## [188] "188" "189" "190" "191" "192"

A useful way to learn something about an object is to look at its structure:

str(uu)
## 'data.frame':    192 obs. of  65 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ ISO2      : chr  "AF" "AL" "DZ" "AD" ...
##  $ MRunder5  : int  257 15 37 3 158 11 16 24 6 4 ...
##  $ MRinfant  : int  165 13 33 3 116 10 15 22 5 4 ...
##  $ Pop2007   : int  27145 3190 33858 75 17024 85 39531 3002 20743 8361 ...
##  $ Births2007: int  1314 52 704 0 810 0 693 37 256 77 ...
##  $ DeathsU5  : int  338 1 26 0 128 0 11 1 2 0 ...
##  $ GNIpercap : int  250 3290 3620 NA 2560 11520 6050 2640 35960 42700 ...
##  $ LEXbirth  : int  44 76 72 NA 42 NA 75 72 81 80 ...
##  $ LRadult   : int  28 99 75 NA 67 NA 98 100 NA NA ...
##  $ NetPrimEd : int  61 94 95 83 58 NA 99 99 96 97 ...
##  $ pctYlow40 : int  NA 21 19 NA NA NA 11 21 18 22 ...
##  $ pctYlow20 : int  NA 40 43 NA NA NA 55 43 41 38 ...
##  $ pctInfLBW : int  NA 7 6 NA 12 5 7 8 7 7 ...
##  $ u5uweight : int  39 8 4 NA 31 NA 4 4 NA NA ...
##  $ u5wasting : int  7 7 3 NA 6 NA 1 5 NA NA ...
##  $ u5stunting: int  54 22 11 NA 45 NA 4 13 NA NA ...
##  $ iodize    : int  28 60 61 NA 35 NA 90 97 NA NA ...
##  $ watTot    : int  22 97 85 100 51 NA 96 98 100 100 ...
##  $ watUrb    : int  37 97 87 100 62 95 98 99 100 100 ...
##  $ watRur    : int  17 97 81 100 39 NA 80 96 100 100 ...
##  $ sanTot    : int  30 97 94 100 50 NA 91 91 100 100 ...
##  $ sanUrb    : int  45 98 98 100 79 98 92 96 100 100 ...
##  $ sanRur    : int  25 97 87 100 16 NA 83 81 100 100 ...
##  $ GovFinVac : int  0 100 100 NA 18 NA NA 33 88 NA ...
##  $ pneumDoc  : int  28 45 53 NA 58 NA NA 36 NA NA ...
##  $ pneumAB   : int  NA 38 59 NA NA NA NA 11 NA NA ...
##  $ popU18    : int  14526 986 11780 14 9022 28 12279 760 4802 1573 ...
##  $ popU5     : int  5002 250 3271 4 3162 8 3364 167 1272 393 ...
##  $ popGR     : num  4.5 -0.2 1.7 2 2.8 1.9 1.1 -1 1.2 0.5 ...
##  $ CDR       : int  20 6 5 NA 21 NA 8 10 7 9 ...
##  $ CBR       : int  48 16 21 NA 48 NA 18 12 12 9 ...
##  $ LEX       : int  44 76 72 NA 42 NA 75 72 81 80 ...
##  $ fertR     : num  7.1 2.1 2.4 NA 6.5 NA 2.3 1.4 1.8 1.4 ...
##  $ pctUrb    : int  24 47 65 91 55 38 90 64 89 66 ...
##  $ urbanGR   : num  6.4 1.4 3.2 2 5.4 2.5 1.5 -1.4 1.5 0.5 ...
##  $ Latitude  : num  33 41 28 42.5 -12.5 ...
##  $ Longitude : num  65 20 3 1.5 18.5 ...
##  $ EdXpctGDP : num  NA NA NA 2.59 NA ...
##  $ homi100K  : num  3.4 6.6 9.6 1.4 36 7.7 NA 3.3 1.5 0.8 ...
##  $ MILLF     : num  NA 1.99 1.3 NA 1.78 ...
##  $ WARLIKE   : num  NA 0.00484 0.00136 NA 0.00421 ...
##  $ LINGSIML  : num  NA 0.871 0.741 NA 0.752 ...
##  $ PCGDP     : int  NA 3050 4956 NA 2160 NA 12034 2216 23555 24298 ...
##  $ IMO9503   : num  NA 70.1 82 NA 0 NA 30.6 38.7 19.3 46 ...
##  $ SOCCER2004: int  NA 81 75 NA 82 NA 2 99 14 46 ...
##  $ MISSWORLD : num  NA 0 0 NA 0 ...
##  $ MISSINTERN: num  NA 0 0 NA 0 ...
##  $ IQ        : int  NA 90 84 NA 69 NA 96 93 98 102 ...
##  $ CALORIE97 : int  NA 2961 2853 NA 1903 NA 3093 2371 3224 3536 ...
##  $ CALORIE70 : int  NA 2424 1829 NA 2103 NA 3347 NA 3251 3227 ...
##  $ PROTEIN97 : num  NA 98.7 78.6 NA 40.5 ...
##  $ PROTEIN70 : num  NA 69.8 47.3 NA 44.9 ...
##  $ FAT97     : num  NA 78.8 69.6 NA 37.5 ...
##  $ FAT70     : num  NA 52.3 36 NA 34.1 ...
##  $ TBRATE    : num  NA 19.1 45.8 NA 123.8 ...
##  $ HIVRATE   : num  NA 0.00545 0.07238 NA 2.11623 ...
##  $ PHYSPP    : num  NA 1.349 0.923 NA 0.077 ...
##  $ PCTMPFEM  : num  NA 5.2 4 NA 15.5 NA 21.3 3.1 25.1 25.1 ...
##  $ MF014     : num  NA 1.07 1.05 NA 1 ...
##  $ MFTOT     : num  NA 1.048 1.026 NA 0.978 ...
##  $ FPCTLF    : num  NA 41.1 26.1 NA 46.4 NA 32.1 48.3 43.2 40.3 ...
##  $ CHLF      : num  NA 0.6 0.7 NA 26.5 NA 3.3 0 0 0 ...
##  $ GCGDP     : num  NA 11.6 16.5 NA 32 NA 13 11.4 18.3 20 ...

For most object classes, summary() will return useful information.

summary(uu)
##        X            country              ISO2              MRunder5     
##  Min.   :  1.00   Length:192         Length:192         Min.   :  3.00  
##  1st Qu.: 48.75   Class :character   Class :character   1st Qu.: 10.00  
##  Median : 96.50   Mode  :character   Mode  :character   Median : 25.00  
##  Mean   : 96.50                                         Mean   : 50.85  
##  3rd Qu.:144.25                                         3rd Qu.: 72.50  
##  Max.   :192.00                                         Max.   :262.00  
##                                                         NA's   :1       
##     MRinfant        Pop2007          Births2007         DeathsU5      
##  Min.   :  2.0   Min.   :      2   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:  9.0   1st Qu.:   1334   1st Qu.:   27.0   1st Qu.:   0.00  
##  Median : 22.0   Median :   6796   Median :  135.5   Median :   3.00  
##  Mean   : 35.6   Mean   :  34526   Mean   :  705.1   Mean   :  48.15  
##  3rd Qu.: 56.0   3rd Qu.:  21676   3rd Qu.:  586.8   3rd Qu.:  26.00  
##  Max.   :165.0   Max.   :1328630   Max.   :27119.0   Max.   :1953.00  
##  NA's   :1                                           NA's   :1        
##    GNIpercap        LEXbirth        LRadult         NetPrimEd     
##  Min.   :  110   Min.   :40.00   Min.   : 23.00   Min.   : 22.00  
##  1st Qu.:  880   1st Qu.:61.00   1st Qu.: 71.00   1st Qu.: 78.00  
##  Median : 3225   Median :71.00   Median : 89.00   Median : 91.00  
##  Mean   : 9976   Mean   :67.42   Mean   : 80.68   Mean   : 84.97  
##  3rd Qu.:10328   3rd Qu.:76.00   3rd Qu.: 97.00   3rd Qu.: 96.00  
##  Max.   :76450   Max.   :83.00   Max.   :100.00   Max.   :100.00  
##  NA's   :10      NA's   :15      NA's   :55       NA's   :8       
##    pctYlow40       pctYlow20      pctInfLBW       u5uweight    
##  Min.   : 6.00   Min.   :35.0   Min.   : 0.00   Min.   : 1.00  
##  1st Qu.:14.00   1st Qu.:42.0   1st Qu.: 6.00   1st Qu.: 5.00  
##  Median :17.00   Median :46.0   Median : 9.00   Median :12.50  
##  Mean   :16.85   Mean   :47.3   Mean   :10.54   Mean   :16.69  
##  3rd Qu.:20.00   3rd Qu.:53.0   3rd Qu.:13.00   3rd Qu.:26.00  
##  Max.   :25.00   Max.   :67.0   Max.   :32.00   Max.   :49.00  
##  NA's   :67      NA's   :67     NA's   :12      NA's   :60     
##    u5wasting        u5stunting        iodize          watTot      
##  Min.   : 0.000   Min.   : 1.00   Min.   :  0.0   Min.   : 22.00  
##  1st Qu.: 2.000   1st Qu.:12.00   1st Qu.: 44.0   1st Qu.: 71.00  
##  Median : 5.000   Median :22.00   Median : 72.0   Median : 90.00  
##  Mean   : 6.323   Mean   :24.17   Mean   : 63.6   Mean   : 83.49  
##  3rd Qu.: 9.000   3rd Qu.:38.00   3rd Qu.: 90.0   3rd Qu.: 99.00  
##  Max.   :25.000   Max.   :54.00   Max.   :100.0   Max.   :100.00  
##  NA's   :68       NA's   :67      NA's   :71      NA's   :32      
##      watUrb           watRur           sanTot           sanUrb      
##  Min.   : 37.00   Min.   : 10.00   Min.   :  5.00   Min.   : 14.00  
##  1st Qu.: 90.00   1st Qu.: 58.75   1st Qu.: 39.75   1st Qu.: 57.00  
##  Median : 98.00   Median : 82.50   Median : 78.00   Median : 89.00  
##  Mean   : 92.89   Mean   : 76.76   Mean   : 67.28   Mean   : 76.92  
##  3rd Qu.:100.00   3rd Qu.: 98.00   3rd Qu.: 96.00   3rd Qu.: 98.00  
##  Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :100.00  
##  NA's   :15       NA's   :32       NA's   :36       NA's   :27      
##      sanRur         GovFinVac         pneumDoc        pneumAB     
##  Min.   :  3.00   Min.   :  0.00   Min.   :12.00   Min.   : 3.00  
##  1st Qu.: 30.25   1st Qu.: 24.00   1st Qu.:41.25   1st Qu.:24.25  
##  Median : 63.50   Median : 96.00   Median :56.00   Median :38.00  
##  Mean   : 60.56   Mean   : 65.35   Mean   :54.57   Mean   :40.10  
##  3rd Qu.: 95.75   3rd Qu.:100.00   3rd Qu.:69.00   3rd Qu.:55.75  
##  Max.   :100.00   Max.   :100.00   Max.   :93.00   Max.   :87.00  
##  NA's   :34       NA's   :60       NA's   :90      NA's   :142    
##      popU18             popU5              popGR             CDR        
##  Min.   :     1.0   Min.   :     0.0   Min.   :-1.700   Min.   : 1.000  
##  1st Qu.:   459.8   1st Qu.:   126.2   1st Qu.: 0.600   1st Qu.: 6.000  
##  Median :  2160.0   Median :   652.5   Median : 1.500   Median : 8.000  
##  Mean   : 11488.1   Mean   :  3267.0   Mean   : 1.487   Mean   : 9.356  
##  3rd Qu.:  8936.5   3rd Qu.:  2739.5   3rd Qu.: 2.350   3rd Qu.:12.000  
##  Max.   :446646.0   Max.   :126808.0   Max.   : 5.000   Max.   :22.000  
##                                        NA's   :1        NA's   :15      
##       CBR             LEX            fertR           pctUrb     
##  Min.   : 8.00   Min.   :40.00   Min.   :1.200   Min.   : 11.0  
##  1st Qu.:13.00   1st Qu.:61.00   1st Qu.:1.800   1st Qu.: 36.0  
##  Median :21.00   Median :71.00   Median :2.500   Median : 57.0  
##  Mean   :23.18   Mean   :67.42   Mean   :3.027   Mean   : 55.3  
##  3rd Qu.:30.00   3rd Qu.:76.00   3rd Qu.:3.900   3rd Qu.: 73.0  
##  Max.   :50.00   Max.   :83.00   Max.   :7.200   Max.   :100.0  
##  NA's   :15      NA's   :15      NA's   :15      NA's   :1      
##     urbanGR          Latitude         Longitude          EdXpctGDP     
##  Min.   :-1.700   Min.   :-41.000   Min.   :-175.000   Min.   : 1.557  
##  1st Qu.: 0.950   1st Qu.:  3.062   1st Qu.:  -8.375   1st Qu.: 3.404  
##  Median : 2.450   Median : 16.000   Median :  21.000   Median : 4.670  
##  Mean   : 2.433   Mean   : 18.611   Mean   :  19.532   Mean   : 4.766  
##  3rd Qu.: 3.875   3rd Qu.: 39.625   3rd Qu.:  49.388   3rd Qu.: 5.603  
##  Max.   :10.300   Max.   : 65.000   Max.   : 178.000   Max.   :11.831  
##  NA's   :2                                             NA's   :79      
##     homi100K         MILLF           WARLIKE           LINGSIML     
##  Min.   : 0.50   Min.   :0.0000   Min.   :0.00000   Min.   :0.1785  
##  1st Qu.: 2.20   1st Qu.:0.3946   1st Qu.:0.00076   1st Qu.:0.5712  
##  Median : 6.80   Median :0.8883   Median :0.00276   Median :0.8156  
##  Mean   :10.96   Mean   :1.2459   Mean   :0.00596   Mean   :0.7579  
##  3rd Qu.:16.88   3rd Qu.:1.4600   3rd Qu.:0.00657   3rd Qu.:0.9612  
##  Max.   :69.00   Max.   :7.8606   Max.   :0.05159   Max.   :1.0000  
##  NA's   :12      NA's   :50       NA's   :48        NA's   :48      
##      PCGDP          IMO9503        SOCCER2004       MISSWORLD     
##  Min.   :  491   Min.   : 0.00   Min.   :  1.00   Min.   :  0.00  
##  1st Qu.: 1658   1st Qu.: 0.00   1st Qu.: 39.75   1st Qu.:  0.00  
##  Median : 4256   Median : 6.75   Median : 82.50   Median :  0.00  
##  Mean   : 7724   Mean   :23.39   Mean   : 87.05   Mean   : 35.37  
##  3rd Qu.: 9762   3rd Qu.:47.08   3rd Qu.:126.25   3rd Qu.: 43.62  
##  Max.   :39510   Max.   :82.00   Max.   :213.00   Max.   :402.00  
##  NA's   :50      NA's   :48      NA's   :48       NA's   :48      
##    MISSINTERN           IQ           CALORIE97      CALORIE70   
##  Min.   :  0.00   Min.   : 63.00   Min.   :1685   Min.   :1628  
##  1st Qu.:  0.00   1st Qu.: 73.00   1st Qu.:2272   1st Qu.:2109  
##  Median :  0.00   Median : 86.00   Median :2622   Median :2337  
##  Mean   : 32.68   Mean   : 84.97   Mean   :2687   Mean   :2456  
##  3rd Qu.: 33.00   3rd Qu.: 95.00   3rd Qu.:3100   3rd Qu.:2860  
##  Max.   :323.00   Max.   :106.00   Max.   :3699   Max.   :3480  
##  NA's   :48       NA's   :48       NA's   :53     NA's   :72    
##    PROTEIN97        PROTEIN70          FAT97            FAT70       
##  Min.   : 28.10   Min.   :  0.00   Min.   : 11.00   Min.   :  0.00  
##  1st Qu.: 57.15   1st Qu.: 51.48   1st Qu.: 47.15   1st Qu.: 34.77  
##  Median : 69.70   Median : 63.55   Median : 69.60   Median : 51.85  
##  Mean   : 73.44   Mean   : 65.27   Mean   : 75.70   Mean   : 59.88  
##  3rd Qu.: 90.75   3rd Qu.: 81.72   3rd Qu.: 96.40   3rd Qu.: 77.72  
##  Max.   :114.90   Max.   :124.40   Max.   :164.00   Max.   :149.00  
##  NA's   :53       NA's   :48       NA's   :53       NA's   :48      
##      TBRATE          HIVRATE             PHYSPP           PCTMPFEM    
##  Min.   :  2.30   Min.   : 0.00500   Min.   :0.03048   Min.   : 0.00  
##  1st Qu.: 19.07   1st Qu.: 0.06559   1st Qu.:0.25445   1st Qu.: 7.15  
##  Median : 39.55   Median : 0.28413   Median :1.27968   Median :10.20  
##  Mean   : 71.22   Mean   : 2.47451   Mean   :1.57534   Mean   :12.66  
##  3rd Qu.: 85.60   3rd Qu.: 2.12329   3rd Qu.:2.75499   3rd Qu.:17.00  
##  Max.   :587.90   Max.   :25.83682   Max.   :5.68000   Max.   :42.70  
##  NA's   :48       NA's   :57         NA's   :60        NA's   :61     
##      MF014            MFTOT            FPCTLF           CHLF      
##  Min.   :0.9905   Min.   :0.8591   Min.   :14.00   Min.   : 0.00  
##  1st Qu.:1.0215   1st Qu.:0.9604   1st Qu.:34.67   1st Qu.: 0.00  
##  Median :1.0381   Median :0.9810   Median :41.10   Median : 2.10  
##  Mean   :1.0369   Mean   :0.9959   Mean   :39.65   Mean   :10.44  
##  3rd Qu.:1.0501   3rd Qu.:1.0108   3rd Qu.:46.20   3rd Qu.:16.30  
##  Max.   :1.1138   Max.   :1.9727   Max.   :52.00   Max.   :52.50  
##  NA's   :48       NA's   :48       NA's   :52      NA's   :51     
##      GCGDP      
##  Min.   : 4.60  
##  1st Qu.:11.10  
##  Median :14.10  
##  Mean   :15.19  
##  3rd Qu.:19.05  
##  Max.   :32.00  
##  NA's   :52

Variables inside a dataframe can be accessed in several ways. For example, a variable named “IQ” in a dataframe called uu can be accessed as:

uu$IQ
##   [1]  NA  90  84  NA  69  NA  96  93  98 102  87  NA  83  81  78  NA 100
##  [18]  83  69  NA  85  NA  72  87  NA  93  67  70  89  70  97  78  68  72
##  [35]  93 100  89  NA  73  NA  91  71  90  NA  92  NA  65  98  68  NA  NA
##  [52]  80  83  84  NA  NA  97  63  84  97  98  66  65  93 102  71  92  NA
##  [69]  79  66  66  84  72  84  99  98  81  89  84  NA  93  94 102  72 105
##  [86]  87  93  72  NA  83  87  89  97  86  72  NA  NA  NA  97 101  79  71
## [103]  92  NA  69  95  NA  74  81  87  NA  95  NA  98  85  72  NA  NA  78
## [120] 102 100  84  67  67  NA  98  NA  83  81  NA  85  NA  85  90  86  99
## [137]  95  NA 106  94  96  70  NA  NA  NA  87  NA  NA  83  65  NA  NA  64
## [154] 103  96  95  NA  NA  72  97  81  72  NA  72 101 101  87  87  91  93
## [171]  NA  69  NA  80  84  90  87  NA  73  96  83  NA  72  98  96  87  NA
## [188]  89  96  83  77  66
#or
uu[,"IQ"]
##   [1]  NA  90  84  NA  69  NA  96  93  98 102  87  NA  83  81  78  NA 100
##  [18]  83  69  NA  85  NA  72  87  NA  93  67  70  89  70  97  78  68  72
##  [35]  93 100  89  NA  73  NA  91  71  90  NA  92  NA  65  98  68  NA  NA
##  [52]  80  83  84  NA  NA  97  63  84  97  98  66  65  93 102  71  92  NA
##  [69]  79  66  66  84  72  84  99  98  81  89  84  NA  93  94 102  72 105
##  [86]  87  93  72  NA  83  87  89  97  86  72  NA  NA  NA  97 101  79  71
## [103]  92  NA  69  95  NA  74  81  87  NA  95  NA  98  85  72  NA  NA  78
## [120] 102 100  84  67  67  NA  98  NA  83  81  NA  85  NA  85  90  86  99
## [137]  95  NA 106  94  96  70  NA  NA  NA  87  NA  NA  83  65  NA  NA  64
## [154] 103  96  95  NA  NA  72  97  81  72  NA  72 101 101  87  87  91  93
## [171]  NA  69  NA  80  84  90  87  NA  73  96  83  NA  72  98  96  87  NA
## [188]  89  96  83  77  66

Plotting is a nice way to gain an understanding of a specific variable. For example, a histogram can be drawn like this:

hist(uu$IQ) 

writing output to disk

A good way to save objects in R is as an R workspace.

save(finXam,file="z.Rdata")

But objects can also be saved in other formats, such as csv or dbf.

write.csv(finXam,file="z1.csv")
write.dbf(finXam,file="z1.dbf")

Linear regression

This semester we will focus on linear regression, using ordinary least squares. The function that we will use, nearly every week is lm().

The dataframe gg contains homesales for Williamson County, 1996 to 1999.If we want to estimate a model where price (our dependent variable) is a function of age and sqft (our independent variables), we would set up the function like this:

pm<-lm(price~age+sqft,data=gg)

We have created an object pm that contains relevant information about our estimated model. To see the structure of pm, one can use the str() function. Or if one simply wants to look at the names of the various parts of pm, one can use the names() function.

names(pm)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
str(pm)
## List of 12
##  $ coefficients : Named num [1:3] 21722.7 -1224.1 92.4
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "age" "sqft"
##  $ residuals    : Named num [1:15773] -22529 5095 7679 5905 -47471 ...
##   ..- attr(*, "names")= chr [1:15773] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:15773] -29072565 4387506 -10656500 6017 -47372 ...
##   ..- attr(*, "names")= chr [1:15773] "(Intercept)" "age" "sqft" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:15773] 102529 304905 282321 358995 377471 ...
##   ..- attr(*, "names")= chr [1:15773] "1" "2" "3" "4" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:15773, 1:3] -1.26e+02 7.96e-03 7.96e-03 7.96e-03 7.96e-03 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:15773] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "age" "sqft"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   ..$ qraux: num [1:3] 1.01 1.01 1
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 15770
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = price ~ age + sqft, data = gg)
##  $ terms        :Classes 'terms', 'formula'  language price ~ age + sqft
##   .. ..- attr(*, "variables")= language list(price, age, sqft)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "price" "age" "sqft"
##   .. .. .. ..$ : chr [1:2] "age" "sqft"
##   .. ..- attr(*, "term.labels")= chr [1:2] "age" "sqft"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(price, age, sqft)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "price" "age" "sqft"
##  $ model        :'data.frame':   15773 obs. of  3 variables:
##   ..$ price: int [1:15773] 80000 310000 290000 364900 330000 145000 145000 346785 312000 55000 ...
##   ..$ age  : int [1:15773] 17 26 4 4 4 1 1 1 2 1 ...
##   ..$ sqft : int [1:15773] 1100 3410 2874 3704 3904 4097 3414 4465 3517 4047 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language price ~ age + sqft
##   .. .. ..- attr(*, "variables")= language list(price, age, sqft)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "price" "age" "sqft"
##   .. .. .. .. ..$ : chr [1:2] "age" "sqft"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "age" "sqft"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(price, age, sqft)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "price" "age" "sqft"
##  - attr(*, "class")= chr "lm"

The information in pm can be extracted with various functions. For example, if we just want to see the usual coefficients with p-values, along with the R^2, we can do this:

summary(pm)
## 
## Call:
## lm(formula = price ~ age + sqft, data = gg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470048  -45397  -23055    3705 7822646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21722.713   8104.002   2.680  0.00736 ** 
## age         -1224.115    278.994  -4.388 1.15e-05 ***
## sqft           92.378      2.844  32.482  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 328100 on 15770 degrees of freedom
## Multiple R-squared:  0.07257,    Adjusted R-squared:  0.07245 
## F-statistic: 616.9 on 2 and 15770 DF,  p-value: < 2.2e-16

Another example: if we want to get the value of price predicted by our model (the fitted or predicted value) we can create it this way:

phat<-predict(pm)
# or we could directly create it as a variable in the dataframe gg:
gg$priceHat<-predict(pm)