Week 1: Entering data and running a regression in R

R is an open-source project, where everything has been created by volunteers. Over 4000 “packages” have been created to perform specific tasks in a variety of fields, ranging from econometrics to textual analysis. A nice overview of the highest quality packages by field can be found in the CRAN Task Views.

Working directory and libraries

The first line in an R script typically sets the working directory: that is, the folder where R will look for input data and write output. If you have a flash drive, you should write to a folder there. Otherwise, write to your My Documents folder. Note that the slashes must be written Linux-style, rather than Windows-style.

setwd("C:/Users/teff/Documents")

To use a specific package, one must first install it. A package that we will use quite often this semester is AER.

install.packages("AER", repos = "http://mirrors.nics.utk.edu/cran/")

You will be prompted to pick a site from which the package will be downloaded. Once it is downloaded and installed, you load the package like this:

library(AER)
## Loading required package: car
## Loading required package: MASS
## Loading required package: nnet
## Loading required package: Formula
## Loading required package: lmtest
## Loading required package: zoo
## Attaching package: 'zoo'
## The following object is masked from 'package:base':
## 
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: strucchange
## Loading required package: survival
## Loading required package: splines

Inputting data

The load command allows one to bring in an R format data set (a “workspace”):

load(file = "S:/TEFF/6060/finXam.Rdata", .GlobalEnv)

The package foreign provides capability to read or write in a number of popular formats. You must first load the foreign library. This is the “csv” format, which can be directly opened by spreadsheet software:

library(foreign)
uu <- read.csv("S:/TEFF/6060/examZ.csv", as.is = TRUE)

This is the “dbf” format.

gg <- read.dbf("S:/TEFF/6060/williamson.dbf", as.is = TRUE)

The command read.table can be modified to read tab-delimited data; readLines is often useful for pulling in large amounts of text for parsing into variables; the package xml will scrape html tables from the web.

Some data is included in R packages. Use the data command to see which datasets are available in base R.

data(package = "base")
## Warning: datasets have been moved from package 'base' to package
## 'datasets'

To load a dataset-for example the dataset LifeCycleSavings-use the data function this way:

data(LifeCycleSavings)

Looking at your data

One can see which objects exist in your environment by typing the command

ls()
## [1] "finXam"           "gg"               "LifeCycleSavings"
## [4] "uu"

There are many classes of objects; data will usually be in a dataframe or a matrix. Learn the class of the object uu with the command

class(uu)
## [1] "data.frame"

Objects have attributes. The attributes of a dataframe include its column and row names, and its dimensions (number of rows and columns). We call each row an observation or a record, and each column a variable or a field. One should always understand the data one is using, and a good way to do that is to look at its attributes and at a few observations. Some useful commands for looking at a dataframe uu are as follows:

Look at first six rows:

head(uu)
##   X             country ISO2 MRunder5 MRinfant Pop2007 Births2007 DeathsU5
## 1 1         Afghanistan   AF      257      165   27145       1314      338
## 2 2             Albania   AL       15       13    3190         52        1
## 3 3             Algeria   DZ       37       33   33858        704       26
## 4 4             Andorra   AD        3        3      75          0        0
## 5 5              Angola   AO      158      116   17024        810      128
## 6 6 Antigua and Barbuda   AG       11       10      85          0        0
##   GNIpercap LEXbirth LRadult NetPrimEd pctYlow40 pctYlow20 pctInfLBW
## 1       250       44      28        61        NA        NA        NA
## 2      3290       76      99        94        21        40         7
## 3      3620       72      75        95        19        43         6
## 4        NA       NA      NA        83        NA        NA        NA
## 5      2560       42      67        58        NA        NA        12
## 6     11520       NA      NA        NA        NA        NA         5
##   u5uweight u5wasting u5stunting iodize watTot watUrb watRur sanTot sanUrb
## 1        39         7         54     28     22     37     17     30     45
## 2         8         7         22     60     97     97     97     97     98
## 3         4         3         11     61     85     87     81     94     98
## 4        NA        NA         NA     NA    100    100    100    100    100
## 5        31         6         45     35     51     62     39     50     79
## 6        NA        NA         NA     NA     NA     95     NA     NA     98
##   sanRur GovFinVac pneumDoc pneumAB popU18 popU5 popGR CDR CBR LEX fertR
## 1     25         0       28      NA  14526  5002   4.5  20  48  44   7.1
## 2     97       100       45      38    986   250  -0.2   6  16  76   2.1
## 3     87       100       53      59  11780  3271   1.7   5  21  72   2.4
## 4    100        NA       NA      NA     14     4   2.0  NA  NA  NA    NA
## 5     16        18       58      NA   9022  3162   2.8  21  48  42   6.5
## 6     NA        NA       NA      NA     28     8   1.9  NA  NA  NA    NA
##   pctUrb urbanGR Latitude Longitude EdXpctGDP homi100K MILLF  WARLIKE
## 1     24     6.4    33.00      65.0        NA      3.4    NA       NA
## 2     47     1.4    41.00      20.0        NA      6.6 1.990 0.004838
## 3     65     3.2    28.00       3.0        NA      9.6 1.302 0.001362
## 4     91     2.0    42.50       1.5      2.59      1.4    NA       NA
## 5     55     5.4   -12.50      18.5        NA     36.0 1.775 0.004215
## 6     38     2.5    17.05     -61.8        NA      7.7    NA       NA
##   LINGSIML PCGDP IMO9503 SOCCER2004 MISSWORLD MISSINTERN IQ CALORIE97
## 1       NA    NA      NA         NA        NA         NA NA        NA
## 2   0.8711  3050    70.1         81         0          0 90      2961
## 3   0.7408  4956    82.0         75         0          0 84      2853
## 4       NA    NA      NA         NA        NA         NA NA        NA
## 5   0.7519  2160     0.0         82         0          0 69      1903
## 6       NA    NA      NA         NA        NA         NA NA        NA
##   CALORIE70 PROTEIN97 PROTEIN70 FAT97 FAT70 TBRATE HIVRATE  PHYSPP
## 1        NA        NA        NA    NA    NA     NA      NA      NA
## 2      2424      98.7      69.8  78.8  52.3   19.1 0.00545 1.34945
## 3      1829      78.6      47.3  69.6  36.0   45.8 0.07238 0.92300
## 4        NA        NA        NA    NA    NA     NA      NA      NA
## 5      2103      40.5      44.9  37.5  34.1  123.8 2.11623 0.07703
## 6        NA        NA        NA    NA    NA     NA      NA      NA
##   PCTMPFEM MF014  MFTOT FPCTLF CHLF GCGDP
## 1       NA    NA     NA     NA   NA    NA
## 2      5.2 1.071 1.0481   41.1  0.6  11.6
## 3      4.0 1.047 1.0264   26.1  0.7  16.5
## 4       NA    NA     NA     NA   NA    NA
## 5     15.5 1.002 0.9779   46.4 26.5  32.0
## 6       NA    NA     NA     NA   NA    NA

Look at last six rows:

tail(uu)
##       X                            country ISO2 MRunder5 MRinfant Pop2007
## 187 187                            Vanuatu   VU       34       28     226
## 188 188 Venezuela (Bolivarian Republic of)   VE       19       17   27657
## 189 189                           Viet Nam   VN       15       13   87375
## 190 190                              Yemen   YE       73       55   22389
## 191 191                             Zambia   ZM      170      103   11922
## 192 192                           Zimbabwe   ZW       90       59   13349
##     Births2007 DeathsU5 GNIpercap LEXbirth LRadult NetPrimEd pctYlow40
## 187          7        0      1840       70      78        87        NA
## 188        597       11      7320       74      93        91        12
## 189       1653       25       790       74      90        95        18
## 190        860       63       870       62      59        75        19
## 191        473       80       800       42      68        57        12
## 192        373       34       340       43      91        88        13
##     pctYlow20 pctInfLBW u5uweight u5wasting u5stunting iodize watTot
## 187        NA         6        NA        NA         NA     NA     NA
## 188        52         9         5         4         12     90     NA
## 189        45         7        20         8         36     93     92
## 190        45        32        46        12         53     30     66
## 191        55        12        19         5         39     77     58
## 192        56        11        17         6         29     91     81
##     watUrb watRur sanTot sanUrb sanRur GovFinVac pneumDoc pneumAB popU18
## 187     NA     NA     NA     NA     NA       100       NA      NA    103
## 188     NA     NA     NA     NA     NA        NA       72      NA  10089
## 189     98     90     65     88     56        87       83      55  30263
## 190     68     65     46     88     30        31       NA      38  11729
## 191     90     41     52     55     51        24       68      NA   6270
## 192     98     72     46     63     37         0       25       8   6175
##     popU5 popGR CDR CBR LEX fertR pctUrb urbanGR Latitude Longitude
## 187    31   2.4   5  29  70   3.8     24     4.2      -16       167
## 188  2896   2.0   5  22  74   2.6     94     2.8        8       -66
## 189  8109   1.6   5  19  74   2.2     27     3.6       16       106
## 190  3740   3.5   8  38  62   5.5     28     5.6       15        48
## 191  2030   2.3  19  40  42   5.2     35     1.7      -15        30
## 192  1706   1.4  19  28  43   3.2     37     3.0      -20        30
##     EdXpctGDP homi100K  MILLF  WARLIKE LINGSIML PCGDP IMO9503 SOCCER2004
## 187        NA      1.0     NA       NA       NA    NA      NA         NA
## 188        NA     37.0 0.8300 0.005048   0.9788  5940    70.6         52
## 189        NA      3.8 1.6017 0.015830   0.8150  1752     6.3        128
## 190        NA      3.2 1.3508 0.008555   0.9320   776     0.0        166
## 191     2.774     22.9 0.4407 0.005797   0.7113   771     0.0         87
## 192        NA     32.9 0.7268 0.005808   0.7643  2744     0.0         78
##     MISSWORLD MISSINTERN IQ CALORIE97 CALORIE70 PROTEIN97 PROTEIN70 FAT97
## 187        NA         NA NA        NA        NA        NA        NA    NA
## 188     269.5      258.5 89      2321      2352      59.0      58.8  65.8
## 189       0.0        7.5 96      2484      2146      56.7      50.0  36.3
## 190       0.0        0.0 83      2051      1768      54.4      49.7  36.5
## 191       0.0        0.0 77      1970      2173      51.5      63.7  29.7
## 192       0.0        0.0 66      2145      2225      52.3      61.2  53.3
##     FAT70 TBRATE  HIVRATE  PHYSPP PCTMPFEM MF014  MFTOT FPCTLF CHLF GCGDP
## 187    NA     NA       NA      NA       NA    NA     NA     NA   NA    NA
## 188  53.7   26.3  0.68538 2.15149     28.6 1.042 1.0138   34.0  0.4   6.8
## 189  21.1  111.0  0.21767 0.52434     26.0 1.030 0.9935   49.1  6.8   7.9
## 190  28.5   73.7  0.01243 0.22562      0.7 1.048 0.9864   28.0 19.3  13.9
## 191  40.7  488.4 19.06989 0.06904     10.1 1.021 1.0051   45.1 15.9  11.1
## 192  49.9  374.6 25.83682 0.13899     14.0 1.005 0.9969   44.5 28.0  18.0

Matrices and dataframes are subsettable: one may extract just part of them. The following command looks at rows 100 to 105 and columns 1 to 8 of dataframe uu.

uu[100:105, 1:8]
##       X    country ISO2 MRunder5 MRinfant Pop2007 Births2007 DeathsU5
## 100 100 Luxembourg   LU        3        2     467          5        0
## 101 101 Madagascar   MG      112       70   19683        722       81
## 102 102     Malawi   MW      111       71   13925        573       64
## 103 103   Malaysia   MY       11       10   26572        555        6
## 104 104   Maldives   MV       30       26     306          7        0
## 105 105       Mali   ML      196      117   12337        595      117

The number of rows and columns in a matrix or dataframe can be found as follows:

dim(uu)
## [1] 192  65
# or
NROW(uu)
## [1] 192
NCOL(uu)
## [1] 65

Note how the # symbol indicates a comment (R will not execute anything to the right of #), and a semi-colon separates executable commands on the same line.

Matrices have rownames and colnames. Dataframes have rownames and names.

names(uu)
##  [1] "X"          "country"    "ISO2"       "MRunder5"   "MRinfant"  
##  [6] "Pop2007"    "Births2007" "DeathsU5"   "GNIpercap"  "LEXbirth"  
## [11] "LRadult"    "NetPrimEd"  "pctYlow40"  "pctYlow20"  "pctInfLBW" 
## [16] "u5uweight"  "u5wasting"  "u5stunting" "iodize"     "watTot"    
## [21] "watUrb"     "watRur"     "sanTot"     "sanUrb"     "sanRur"    
## [26] "GovFinVac"  "pneumDoc"   "pneumAB"    "popU18"     "popU5"     
## [31] "popGR"      "CDR"        "CBR"        "LEX"        "fertR"     
## [36] "pctUrb"     "urbanGR"    "Latitude"   "Longitude"  "EdXpctGDP" 
## [41] "homi100K"   "MILLF"      "WARLIKE"    "LINGSIML"   "PCGDP"     
## [46] "IMO9503"    "SOCCER2004" "MISSWORLD"  "MISSINTERN" "IQ"        
## [51] "CALORIE97"  "CALORIE70"  "PROTEIN97"  "PROTEIN70"  "FAT97"     
## [56] "FAT70"      "TBRATE"     "HIVRATE"    "PHYSPP"     "PCTMPFEM"  
## [61] "MF014"      "MFTOT"      "FPCTLF"     "CHLF"       "GCGDP"
rownames(uu)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
##  [12] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22" 
##  [23] "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33" 
##  [34] "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44" 
##  [45] "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55" 
##  [56] "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66" 
##  [67] "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77" 
##  [78] "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88" 
##  [89] "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99" 
## [100] "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110"
## [111] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
## [122] "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143"
## [144] "144" "145" "146" "147" "148" "149" "150" "151" "152" "153" "154"
## [155] "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165"
## [166] "166" "167" "168" "169" "170" "171" "172" "173" "174" "175" "176"
## [177] "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187"
## [188] "188" "189" "190" "191" "192"

A useful way to learn something about an object is to look at its structure:

str(uu)
## 'data.frame':    192 obs. of  65 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ ISO2      : chr  "AF" "AL" "DZ" "AD" ...
##  $ MRunder5  : int  257 15 37 3 158 11 16 24 6 4 ...
##  $ MRinfant  : int  165 13 33 3 116 10 15 22 5 4 ...
##  $ Pop2007   : int  27145 3190 33858 75 17024 85 39531 3002 20743 8361 ...
##  $ Births2007: int  1314 52 704 0 810 0 693 37 256 77 ...
##  $ DeathsU5  : int  338 1 26 0 128 0 11 1 2 0 ...
##  $ GNIpercap : int  250 3290 3620 NA 2560 11520 6050 2640 35960 42700 ...
##  $ LEXbirth  : int  44 76 72 NA 42 NA 75 72 81 80 ...
##  $ LRadult   : int  28 99 75 NA 67 NA 98 100 NA NA ...
##  $ NetPrimEd : int  61 94 95 83 58 NA 99 99 96 97 ...
##  $ pctYlow40 : int  NA 21 19 NA NA NA 11 21 18 22 ...
##  $ pctYlow20 : int  NA 40 43 NA NA NA 55 43 41 38 ...
##  $ pctInfLBW : int  NA 7 6 NA 12 5 7 8 7 7 ...
##  $ u5uweight : int  39 8 4 NA 31 NA 4 4 NA NA ...
##  $ u5wasting : int  7 7 3 NA 6 NA 1 5 NA NA ...
##  $ u5stunting: int  54 22 11 NA 45 NA 4 13 NA NA ...
##  $ iodize    : int  28 60 61 NA 35 NA 90 97 NA NA ...
##  $ watTot    : int  22 97 85 100 51 NA 96 98 100 100 ...
##  $ watUrb    : int  37 97 87 100 62 95 98 99 100 100 ...
##  $ watRur    : int  17 97 81 100 39 NA 80 96 100 100 ...
##  $ sanTot    : int  30 97 94 100 50 NA 91 91 100 100 ...
##  $ sanUrb    : int  45 98 98 100 79 98 92 96 100 100 ...
##  $ sanRur    : int  25 97 87 100 16 NA 83 81 100 100 ...
##  $ GovFinVac : int  0 100 100 NA 18 NA NA 33 88 NA ...
##  $ pneumDoc  : int  28 45 53 NA 58 NA NA 36 NA NA ...
##  $ pneumAB   : int  NA 38 59 NA NA NA NA 11 NA NA ...
##  $ popU18    : int  14526 986 11780 14 9022 28 12279 760 4802 1573 ...
##  $ popU5     : int  5002 250 3271 4 3162 8 3364 167 1272 393 ...
##  $ popGR     : num  4.5 -0.2 1.7 2 2.8 1.9 1.1 -1 1.2 0.5 ...
##  $ CDR       : int  20 6 5 NA 21 NA 8 10 7 9 ...
##  $ CBR       : int  48 16 21 NA 48 NA 18 12 12 9 ...
##  $ LEX       : int  44 76 72 NA 42 NA 75 72 81 80 ...
##  $ fertR     : num  7.1 2.1 2.4 NA 6.5 NA 2.3 1.4 1.8 1.4 ...
##  $ pctUrb    : int  24 47 65 91 55 38 90 64 89 66 ...
##  $ urbanGR   : num  6.4 1.4 3.2 2 5.4 2.5 1.5 -1.4 1.5 0.5 ...
##  $ Latitude  : num  33 41 28 42.5 -12.5 ...
##  $ Longitude : num  65 20 3 1.5 18.5 ...
##  $ EdXpctGDP : num  NA NA NA 2.59 NA ...
##  $ homi100K  : num  3.4 6.6 9.6 1.4 36 7.7 NA 3.3 1.5 0.8 ...
##  $ MILLF     : num  NA 1.99 1.3 NA 1.78 ...
##  $ WARLIKE   : num  NA 0.00484 0.00136 NA 0.00421 ...
##  $ LINGSIML  : num  NA 0.871 0.741 NA 0.752 ...
##  $ PCGDP     : int  NA 3050 4956 NA 2160 NA 12034 2216 23555 24298 ...
##  $ IMO9503   : num  NA 70.1 82 NA 0 NA 30.6 38.7 19.3 46 ...
##  $ SOCCER2004: int  NA 81 75 NA 82 NA 2 99 14 46 ...
##  $ MISSWORLD : num  NA 0 0 NA 0 ...
##  $ MISSINTERN: num  NA 0 0 NA 0 ...
##  $ IQ        : int  NA 90 84 NA 69 NA 96 93 98 102 ...
##  $ CALORIE97 : int  NA 2961 2853 NA 1903 NA 3093 2371 3224 3536 ...
##  $ CALORIE70 : int  NA 2424 1829 NA 2103 NA 3347 NA 3251 3227 ...
##  $ PROTEIN97 : num  NA 98.7 78.6 NA 40.5 ...
##  $ PROTEIN70 : num  NA 69.8 47.3 NA 44.9 ...
##  $ FAT97     : num  NA 78.8 69.6 NA 37.5 ...
##  $ FAT70     : num  NA 52.3 36 NA 34.1 ...
##  $ TBRATE    : num  NA 19.1 45.8 NA 123.8 ...
##  $ HIVRATE   : num  NA 0.00545 0.07238 NA 2.11623 ...
##  $ PHYSPP    : num  NA 1.349 0.923 NA 0.077 ...
##  $ PCTMPFEM  : num  NA 5.2 4 NA 15.5 NA 21.3 3.1 25.1 25.1 ...
##  $ MF014     : num  NA 1.07 1.05 NA 1 ...
##  $ MFTOT     : num  NA 1.048 1.026 NA 0.978 ...
##  $ FPCTLF    : num  NA 41.1 26.1 NA 46.4 NA 32.1 48.3 43.2 40.3 ...
##  $ CHLF      : num  NA 0.6 0.7 NA 26.5 NA 3.3 0 0 0 ...
##  $ GCGDP     : num  NA 11.6 16.5 NA 32 NA 13 11.4 18.3 20 ...

For most object classes, summary will return useful information.

summary(uu)
##        X           country              ISO2              MRunder5    
##  Min.   :  1.0   Length:192         Length:192         Min.   :  3.0  
##  1st Qu.: 48.8   Class :character   Class :character   1st Qu.: 10.0  
##  Median : 96.5   Mode  :character   Mode  :character   Median : 25.0  
##  Mean   : 96.5                                         Mean   : 50.9  
##  3rd Qu.:144.2                                         3rd Qu.: 72.5  
##  Max.   :192.0                                         Max.   :262.0  
##                                                        NA's   :1      
##     MRinfant        Pop2007          Births2007       DeathsU5     
##  Min.   :  2.0   Min.   :      2   Min.   :    0   Min.   :   0.0  
##  1st Qu.:  9.0   1st Qu.:   1334   1st Qu.:   27   1st Qu.:   0.0  
##  Median : 22.0   Median :   6796   Median :  136   Median :   3.0  
##  Mean   : 35.6   Mean   :  34526   Mean   :  705   Mean   :  48.1  
##  3rd Qu.: 56.0   3rd Qu.:  21676   3rd Qu.:  587   3rd Qu.:  26.0  
##  Max.   :165.0   Max.   :1328630   Max.   :27119   Max.   :1953.0  
##  NA's   :1                                         NA's   :1       
##    GNIpercap        LEXbirth       LRadult        NetPrimEd  
##  Min.   :  110   Min.   :40.0   Min.   : 23.0   Min.   : 22  
##  1st Qu.:  880   1st Qu.:61.0   1st Qu.: 71.0   1st Qu.: 78  
##  Median : 3225   Median :71.0   Median : 89.0   Median : 91  
##  Mean   : 9976   Mean   :67.4   Mean   : 80.7   Mean   : 85  
##  3rd Qu.:10328   3rd Qu.:76.0   3rd Qu.: 97.0   3rd Qu.: 96  
##  Max.   :76450   Max.   :83.0   Max.   :100.0   Max.   :100  
##  NA's   :10      NA's   :15     NA's   :55      NA's   :8    
##    pctYlow40      pctYlow20      pctInfLBW      u5uweight   
##  Min.   : 6.0   Min.   :35.0   Min.   : 0.0   Min.   : 1.0  
##  1st Qu.:14.0   1st Qu.:42.0   1st Qu.: 6.0   1st Qu.: 5.0  
##  Median :17.0   Median :46.0   Median : 9.0   Median :12.5  
##  Mean   :16.9   Mean   :47.3   Mean   :10.5   Mean   :16.7  
##  3rd Qu.:20.0   3rd Qu.:53.0   3rd Qu.:13.0   3rd Qu.:26.0  
##  Max.   :25.0   Max.   :67.0   Max.   :32.0   Max.   :49.0  
##  NA's   :67     NA's   :67     NA's   :12     NA's   :60    
##    u5wasting       u5stunting       iodize          watTot     
##  Min.   : 0.00   Min.   : 1.0   Min.   :  0.0   Min.   : 22.0  
##  1st Qu.: 2.00   1st Qu.:12.0   1st Qu.: 44.0   1st Qu.: 71.0  
##  Median : 5.00   Median :22.0   Median : 72.0   Median : 90.0  
##  Mean   : 6.32   Mean   :24.2   Mean   : 63.6   Mean   : 83.5  
##  3rd Qu.: 9.00   3rd Qu.:38.0   3rd Qu.: 90.0   3rd Qu.: 99.0  
##  Max.   :25.00   Max.   :54.0   Max.   :100.0   Max.   :100.0  
##  NA's   :68      NA's   :67     NA's   :71      NA's   :32     
##      watUrb          watRur          sanTot          sanUrb     
##  Min.   : 37.0   Min.   : 10.0   Min.   :  5.0   Min.   : 14.0  
##  1st Qu.: 90.0   1st Qu.: 58.8   1st Qu.: 39.8   1st Qu.: 57.0  
##  Median : 98.0   Median : 82.5   Median : 78.0   Median : 89.0  
##  Mean   : 92.9   Mean   : 76.8   Mean   : 67.3   Mean   : 76.9  
##  3rd Qu.:100.0   3rd Qu.: 98.0   3rd Qu.: 96.0   3rd Qu.: 98.0  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :15      NA's   :32      NA's   :36      NA's   :27     
##      sanRur        GovFinVac        pneumDoc       pneumAB    
##  Min.   :  3.0   Min.   :  0.0   Min.   :12.0   Min.   : 3.0  
##  1st Qu.: 30.2   1st Qu.: 24.0   1st Qu.:41.2   1st Qu.:24.2  
##  Median : 63.5   Median : 96.0   Median :56.0   Median :38.0  
##  Mean   : 60.6   Mean   : 65.3   Mean   :54.6   Mean   :40.1  
##  3rd Qu.: 95.8   3rd Qu.:100.0   3rd Qu.:69.0   3rd Qu.:55.8  
##  Max.   :100.0   Max.   :100.0   Max.   :93.0   Max.   :87.0  
##  NA's   :34      NA's   :60      NA's   :90     NA's   :142   
##      popU18           popU5            popGR            CDR       
##  Min.   :     1   Min.   :     0   Min.   :-1.70   Min.   : 1.00  
##  1st Qu.:   460   1st Qu.:   126   1st Qu.: 0.60   1st Qu.: 6.00  
##  Median :  2160   Median :   652   Median : 1.50   Median : 8.00  
##  Mean   : 11488   Mean   :  3267   Mean   : 1.49   Mean   : 9.36  
##  3rd Qu.:  8936   3rd Qu.:  2740   3rd Qu.: 2.35   3rd Qu.:12.00  
##  Max.   :446646   Max.   :126808   Max.   : 5.00   Max.   :22.00  
##                                    NA's   :1       NA's   :15     
##       CBR            LEX           fertR          pctUrb     
##  Min.   : 8.0   Min.   :40.0   Min.   :1.20   Min.   : 11.0  
##  1st Qu.:13.0   1st Qu.:61.0   1st Qu.:1.80   1st Qu.: 36.0  
##  Median :21.0   Median :71.0   Median :2.50   Median : 57.0  
##  Mean   :23.2   Mean   :67.4   Mean   :3.03   Mean   : 55.3  
##  3rd Qu.:30.0   3rd Qu.:76.0   3rd Qu.:3.90   3rd Qu.: 73.0  
##  Max.   :50.0   Max.   :83.0   Max.   :7.20   Max.   :100.0  
##  NA's   :15     NA's   :15     NA's   :15     NA's   :1      
##     urbanGR         Latitude        Longitude         EdXpctGDP    
##  Min.   :-1.70   Min.   :-41.00   Min.   :-175.00   Min.   : 1.56  
##  1st Qu.: 0.95   1st Qu.:  3.06   1st Qu.:  -8.38   1st Qu.: 3.40  
##  Median : 2.45   Median : 16.00   Median :  21.00   Median : 4.67  
##  Mean   : 2.43   Mean   : 18.61   Mean   :  19.53   Mean   : 4.77  
##  3rd Qu.: 3.88   3rd Qu.: 39.62   3rd Qu.:  49.39   3rd Qu.: 5.60  
##  Max.   :10.30   Max.   : 65.00   Max.   : 178.00   Max.   :11.83  
##  NA's   :2                                          NA's   :79     
##     homi100K        MILLF         WARLIKE        LINGSIML   
##  Min.   : 0.5   Min.   :0.00   Min.   :0.00   Min.   :0.18  
##  1st Qu.: 2.2   1st Qu.:0.39   1st Qu.:0.00   1st Qu.:0.57  
##  Median : 6.8   Median :0.89   Median :0.00   Median :0.82  
##  Mean   :11.0   Mean   :1.25   Mean   :0.01   Mean   :0.76  
##  3rd Qu.:16.9   3rd Qu.:1.46   3rd Qu.:0.01   3rd Qu.:0.96  
##  Max.   :69.0   Max.   :7.86   Max.   :0.05   Max.   :1.00  
##  NA's   :12     NA's   :50     NA's   :48     NA's   :48    
##      PCGDP          IMO9503        SOCCER2004      MISSWORLD    
##  Min.   :  491   Min.   : 0.00   Min.   :  1.0   Min.   :  0.0  
##  1st Qu.: 1658   1st Qu.: 0.00   1st Qu.: 39.8   1st Qu.:  0.0  
##  Median : 4256   Median : 6.75   Median : 82.5   Median :  0.0  
##  Mean   : 7724   Mean   :23.39   Mean   : 87.0   Mean   : 35.4  
##  3rd Qu.: 9762   3rd Qu.:47.08   3rd Qu.:126.2   3rd Qu.: 43.6  
##  Max.   :39510   Max.   :82.00   Max.   :213.0   Max.   :402.0  
##  NA's   :50      NA's   :48      NA's   :48      NA's   :48     
##    MISSINTERN          IQ        CALORIE97      CALORIE70   
##  Min.   :  0.0   Min.   : 63   Min.   :1685   Min.   :1628  
##  1st Qu.:  0.0   1st Qu.: 73   1st Qu.:2272   1st Qu.:2109  
##  Median :  0.0   Median : 86   Median :2622   Median :2337  
##  Mean   : 32.7   Mean   : 85   Mean   :2687   Mean   :2456  
##  3rd Qu.: 33.0   3rd Qu.: 95   3rd Qu.:3100   3rd Qu.:2860  
##  Max.   :323.0   Max.   :106   Max.   :3699   Max.   :3480  
##  NA's   :48      NA's   :48    NA's   :53     NA's   :72    
##    PROTEIN97       PROTEIN70         FAT97           FAT70      
##  Min.   : 28.1   Min.   :  0.0   Min.   : 11.0   Min.   :  0.0  
##  1st Qu.: 57.1   1st Qu.: 51.5   1st Qu.: 47.1   1st Qu.: 34.8  
##  Median : 69.7   Median : 63.5   Median : 69.6   Median : 51.9  
##  Mean   : 73.4   Mean   : 65.3   Mean   : 75.7   Mean   : 59.9  
##  3rd Qu.: 90.8   3rd Qu.: 81.7   3rd Qu.: 96.4   3rd Qu.: 77.7  
##  Max.   :114.9   Max.   :124.4   Max.   :164.0   Max.   :149.0  
##  NA's   :53      NA's   :48      NA's   :53      NA's   :48     
##      TBRATE         HIVRATE          PHYSPP        PCTMPFEM    
##  Min.   :  2.3   Min.   : 0.00   Min.   :0.03   Min.   : 0.00  
##  1st Qu.: 19.1   1st Qu.: 0.07   1st Qu.:0.25   1st Qu.: 7.15  
##  Median : 39.5   Median : 0.28   Median :1.28   Median :10.20  
##  Mean   : 71.2   Mean   : 2.47   Mean   :1.58   Mean   :12.66  
##  3rd Qu.: 85.6   3rd Qu.: 2.12   3rd Qu.:2.75   3rd Qu.:17.00  
##  Max.   :587.9   Max.   :25.84   Max.   :5.68   Max.   :42.70  
##  NA's   :48      NA's   :57      NA's   :60     NA's   :61     
##      MF014          MFTOT          FPCTLF          CHLF     
##  Min.   :0.99   Min.   :0.86   Min.   :14.0   Min.   : 0.0  
##  1st Qu.:1.02   1st Qu.:0.96   1st Qu.:34.7   1st Qu.: 0.0  
##  Median :1.04   Median :0.98   Median :41.1   Median : 2.1  
##  Mean   :1.04   Mean   :1.00   Mean   :39.6   Mean   :10.4  
##  3rd Qu.:1.05   3rd Qu.:1.01   3rd Qu.:46.2   3rd Qu.:16.3  
##  Max.   :1.11   Max.   :1.97   Max.   :52.0   Max.   :52.5  
##  NA's   :48     NA's   :48     NA's   :52     NA's   :51    
##      GCGDP     
##  Min.   : 4.6  
##  1st Qu.:11.1  
##  Median :14.1  
##  Mean   :15.2  
##  3rd Qu.:19.1  
##  Max.   :32.0  
##  NA's   :52

Variables inside a dataframe can be accessed in several ways. For example, a variable named “IQ” in a dataframe called uu can be accessed as:

uu$IQ
##   [1]  NA  90  84  NA  69  NA  96  93  98 102  87  NA  83  81  78  NA 100
##  [18]  83  69  NA  85  NA  72  87  NA  93  67  70  89  70  97  78  68  72
##  [35]  93 100  89  NA  73  NA  91  71  90  NA  92  NA  65  98  68  NA  NA
##  [52]  80  83  84  NA  NA  97  63  84  97  98  66  65  93 102  71  92  NA
##  [69]  79  66  66  84  72  84  99  98  81  89  84  NA  93  94 102  72 105
##  [86]  87  93  72  NA  83  87  89  97  86  72  NA  NA  NA  97 101  79  71
## [103]  92  NA  69  95  NA  74  81  87  NA  95  NA  98  85  72  NA  NA  78
## [120] 102 100  84  67  67  NA  98  NA  83  81  NA  85  NA  85  90  86  99
## [137]  95  NA 106  94  96  70  NA  NA  NA  87  NA  NA  83  65  NA  NA  64
## [154] 103  96  95  NA  NA  72  97  81  72  NA  72 101 101  87  87  91  93
## [171]  NA  69  NA  80  84  90  87  NA  73  96  83  NA  72  98  96  87  NA
## [188]  89  96  83  77  66
# or
uu[, "IQ"]
##   [1]  NA  90  84  NA  69  NA  96  93  98 102  87  NA  83  81  78  NA 100
##  [18]  83  69  NA  85  NA  72  87  NA  93  67  70  89  70  97  78  68  72
##  [35]  93 100  89  NA  73  NA  91  71  90  NA  92  NA  65  98  68  NA  NA
##  [52]  80  83  84  NA  NA  97  63  84  97  98  66  65  93 102  71  92  NA
##  [69]  79  66  66  84  72  84  99  98  81  89  84  NA  93  94 102  72 105
##  [86]  87  93  72  NA  83  87  89  97  86  72  NA  NA  NA  97 101  79  71
## [103]  92  NA  69  95  NA  74  81  87  NA  95  NA  98  85  72  NA  NA  78
## [120] 102 100  84  67  67  NA  98  NA  83  81  NA  85  NA  85  90  86  99
## [137]  95  NA 106  94  96  70  NA  NA  NA  87  NA  NA  83  65  NA  NA  64
## [154] 103  96  95  NA  NA  72  97  81  72  NA  72 101 101  87  87  91  93
## [171]  NA  69  NA  80  84  90  87  NA  73  96  83  NA  72  98  96  87  NA
## [188]  89  96  83  77  66

Plotting is a nice way to gain an understanding of a specific variable. For example, a histogram can be drawn like this:

hist(uu$IQ)

plot of chunk unnamed-chunk-20

writing output to disk

A good way to save objects in R is as an R workspace.

save(finXam, file = "z.Rdata")

But objects can also be saved in other formats, such as csv or dbf.

write.csv(finXam, file = "z1.csv")
write.dbf(finXam, file = "z1.dbf")

Linear regression

This semester we will focus on linear regression, using ordinary least squares. The function that we will use, nearly every week is lm.

The dataframe gg contains homesales for Williamson County, 1996 to 1999.If we want to estimate a model where price (our dependent variable) is a function of age and sqft (our independent variables), we would set up the function like this:

pm <- lm(price ~ age + sqft, data = gg)

We have created an object pm that contains relevant information about our estimated model. Any part of that information can be extracted. For example, if we just want to see the usual coefficients with p-values, along with the R2, we can do this:

summary(pm)
## 
## Call:
## lm(formula = price ~ age + sqft, data = gg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470048  -45397  -23055    3705 7822646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21722.71    8104.00    2.68   0.0074 ** 
## age         -1224.11     278.99   -4.39  1.2e-05 ***
## sqft           92.38       2.84   32.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 328000 on 15770 degrees of freedom
## Multiple R-squared:  0.0726, Adjusted R-squared:  0.0724 
## F-statistic:  617 on 2 and 15770 DF,  p-value: <2e-16

Another example: if we want to get the value of price predicted by our model (the fitted or predicted value) we can create it this way:

phat <- predict(pm)
# or we could directly create it as a variable in the dataframe gg:
gg$priceHat <- predict(pm)