Using the Dow-Eff functions in R

The latest version of the Dow-Eff functions (Manual: pdf; html) can perform analyses on four different ethnological datasets:

abbreviation dataset codebook
WNAI Western North American Indians codebook
SCCS Standard Cross-Cultural Sample codebook
EA Ethnographic Atlas codebook
LRB Louis R. Binford's forager data codebook

The code below outlines the workflow for working with the LRB.

You will need a number of R packages to run the Dow-Eff functions. These are loaded using the “library” command. If a package is “not found”, it should be first installed. The following command will initiate the installation of a package named “mice”, for example:

install.packages("mice")
# --set working directory and load needed libraries--
setwd("/home/yagmur/Dropbox/functions")
## Error: cannot change working directory
library(mice)
library(foreign)
library(stringr)
library(psych)
library(AER)
library(spdep)
library(geosphere)
library(relaimpo)

The Dow-Eff functions, as well as the four ethnological datasets, are contained in an R-workspace, located in the cloud.

load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
ls()  #-can see the objects contained in DEf01.Rdata
##  [1] "addesc"    "chK"       "chkpmc"    "CSVwrite"  "doMI"     
##  [6] "doOLS"     "EA"        "EAcov"     "EAfact"    "EAkey"    
## [11] "gSimpStat" "kln"       "llm"       "LRB"       "LRBcov"   
## [16] "LRBfact"   "LRBkey"    "mkdummy"   "SCCS"      "SCCScov"  
## [21] "SCCSfact"  "SCCSkey"   "setDS"     "WNAI"      "WNAIcov"  
## [26] "WNAIfact"  "WNAIkey"

The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The four valid options for xx are: “WNAI”, “LRB”, “EA”, “SCCS”. The setDS() command creates objects:

object name description
cov Names of covariates to use during imputation step
dx The selected ethnological dataset is now called dx
dxf The factor version of dx
key A metadata file for dx
wdd A geographic proximity weight matrix for the societies in dx
wee An ecological similarity weight matrix for the societies in dx
wll A linguistic proximity weight matrix for the societies in dx
setDS("LRB")

The next step in the workflow is to create any new variables and add them to the dataset dx. New variables can be created directly, as in the following example. When created in this way, one should also record a description of the new variable, using the command addesc(). The syntax takes first the name of the new variable, and then the description.

dx$lnarea <- log(dx$area)
addesc("lnarea", "log of total land area occupied by the group")

Dummy variables (variables taking on the values zero or one) should be added using the command mkdummy(). This command will, in most cases, automatically record a variable description. Dummy variables are appropriate for categorical variables. The syntax of mkdummy() takes first the categorical variable name, and then the category number (these can be found in the codebook for each ethnological dataset). Note that the resulting dummy variable will be called variable name+“.d”+category number.

mkdummy("systate3", 1)
## [1] "This dummy variable is named systate3.d1"
mkdummy("systate3", 2)
## [1] "This dummy variable is named systate3.d2"

After making any new variables, list the variables you intend to use in your analysis in the following form.

evm <- c("group2", "hunting", "gatherin", "fishing", "war1", "reven", "nomov", 
    "dismov", "store", "subdiv2", "systate3.d2", "systate3.d1", "lnarea", "nagp")

Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 2, and 3 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.

This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These squared variables are given names in the format variable name+“Sq”.

Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name.

smi <- doMI(evm, nimp = 2, maxit = 3)
## [1] "group2"
## [1] "nomov"
## [1] "dismov"
## [1] "store"
## [1] "systate3.d2"
## [1] "systate3.d1"
## Time difference of 2.65 secs
dim(smi)  # dimensions of new dataframe smi
## [1] 678  85
smi[1:2, ]  # first two rows of new dataframe smi
##   .imp   .id group2 nomov dismov store systate3.d2 systate3.d1 hunting
## 1    1 Punan     30    45    240     1           0           0      30
## 2    1 Batek     58     6     50     2           1           0      30
##   gatherin fishing war1 reven subdiv2 lnarea nagp mht.name.d2 mht.name.d8
## 1       65       5    1  1.26   69.86  3.388 4738           0           0
## 2       65       5    1  2.10   69.86  2.282 3852           0           0
##   mht.name.d11 koeppengei.d4 koeppengei.d13 koeppengei.d18 continent.d3
## 1            0             0              0              0            0
## 2            0             1              0              0            0
##   continent.d4 region.d2 region.d7 bio.1  bio.2 bio.3  bio.4  bio.5 bio.6
## 1            0         0         0 1.240 -1.339 2.971 -1.571 0.2588 1.566
## 2            0         0         0 1.258 -1.327 2.079 -1.489 0.4143 1.538
##   bio.8  bio.9 bio.10 bio.11 bio.12 bio.13  bio.14  bio.15 bio.16   bio.17
## 1 1.125 0.9913 0.8121  1.403  4.036 1.9917  7.7985 -1.5882  2.163  7.68694
## 2 1.097 0.9936 0.9058  1.399  1.322 0.9502 -0.2036  0.1907  1.126 -0.05763
##   bio.18 bio.19 meanalt mnnpp  sdalt     x       y    x2     y2     xy
## 1 3.6274  2.635 -0.4907 3.250 0.2440 1.500 -0.7185 2.251 0.5163 -1.078
## 2 0.2692  1.765 -0.3770 0.378 0.7853 1.549 -0.5004 2.398 0.2504 -0.775
##   Australian NaDene UtoAztecan nomovSq storeSq huntingSq gatherinSq
## 1          0      0          0    2025       1       900       4225
## 2          0      0          0      36       4       900       4225
##   fishingSq war1Sq revenSq subdiv2Sq lnareaSq bio.1Sq bio.2Sq bio.3Sq
## 1        25      1   1.588      4880   11.477   1.538   1.792   8.828
## 2        25      1   4.410      4880    5.209   1.581   1.762   4.324
##   bio.4Sq bio.5Sq bio.6Sq bio.8Sq bio.9Sq bio.10Sq bio.11Sq bio.12Sq
## 1   2.469 0.06697   2.451   1.266  0.9826   0.6596    1.968   16.291
## 2   2.217 0.17168   2.364   1.202  0.9872   0.8205    1.958    1.747
##   bio.13Sq bio.14Sq bio.15Sq bio.16Sq  bio.17Sq bio.18Sq bio.19Sq
## 1    3.967 60.81624  2.52237    4.680 59.089006 13.15782    6.941
## 2    0.903  0.04144  0.03638    1.269  0.003321  0.07247    3.115
##   meanaltSq mnnppSq sdaltSq
## 1    0.2408 10.5599 0.05954
## 2    0.1421  0.1429 0.61662

All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.

# --dependent variable--
dpV <- "group2"
# --independent variables in UNrestricted model--
UiV <- c("hunting", "gatherin", "fishing", "war1", "reven", "nomov", "dismov", 
    "store", "subdiv2", "systate3.d2", "systate3.d1", "lnarea")
# --additional exogenous variables (use in Hausman tests)--
oxog <- c("nagp")
# --independent variables in restricted model (all must be in UiV above)--
RiV <- c("hunting", "gatherin", "fishing", "war1", "reven")

The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.

There are several options that increase the time doOLS() takes to run: stepW runs a background stepwise regression to find which variables perform best over the set of estimations; relimp calculates the relative importance of each variable in the restricted model, using a technique to partition R2; slmtests calculates LaGrange multiplier tests for spatial dependence using the three weight matrices. All of these should be set to FALSE if one wishes to speed up estimation times.

h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE, 
    lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE)
## [1] "--finding optimal weight matrix------"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## Time difference of 23 secs
names(h)
##  [1] "DependVarb"       "URmodel"          "Rmodel"          
##  [4] "EndogeneityTests" "Diagnostics"      "OtherStats"      
##  [7] "DescripStats"     "totry"            "didwell"         
## [10] "dfbetas"          "data"

The output from doOLS, here called h, is a list containing 11 items.

name description
DependVarb Description of dependent variable
URmodel Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs). Two pvalues are given for H0: \( \beta \)=0. One is the usual pvalue, the other (hcpval) is heteroskedasticity consistent. If stepkept=TRUE, the table will also include the proportion of times a variable is retained in the model using stepwise regression.
Rmodel Coefficient estimates from the restricted model. If relimp=TRUE, the R2 assigned to each independent variable is shown here.
EndogeneityTests Hausman tests (H0: variable is exogneous), with F-statistic for weak instruments (a rule of thumb is that the instrument is weak if the F-stat is below 10), and Sargan test (H0: instrument is uncorrelated with second-stage 2SLS residuals).
Diagnostics Regression diagnostics for the restricted model: RESET test (H0: model has correct functional form); Wald test (H0: appropriate variables dropped); Breusch-Pagan test (H0: residuals homoskedastic; Shapiro-Wilkes test (H0: residuals normal); Hausman test (H0: Wy is exogenous); Sargan test (H0: residuals uncorrelated with instruments for Wy). If slmtests=TRUE, the LaGrange multiplier tests (H0: spatial lag term not needed) are reported here.
OtherStats Other statistics: Composite weight matrix weights (see details); R2 for restricted model and unrestricted model; number of imputations; number of observations; Fstat for weak instruments for Wy.
DescripStats Descriptive statistics for variables in unrestricted model.
totry Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model.
didwell Character string of variables that were most significant in the unrestricted model.
dfbetas Influential observations for dfbetas (see details)
data Data as used in the estimations. Observations with missing values of the dependent variable have been dropped.

The last two items in the list are large, but the first nine provide a nice overview.

h[1:9]
## $DependVarb
## [1] "Dependent variable='group2': the mean size of the consumer group that regularly camps together during the most aggregated phase of the yearly economic cycles; (Table: 5.01 & 8.01); (Binford 2001:117)"
## 
## $URmodel
##                  coef  stdcoef      VIF stepkept  hcpval    pval star
## (Intercept) 113.69410      NaN      NaN        1 0.90675 0.92474     
## dismov       -0.00539 -0.00922    4.172        0 0.89684 0.91644     
## fishing      -0.64579 -0.20906 8245.517        0 0.94682 0.95708     
## gatherin     -0.57577 -0.17291 7103.915        0 0.95258 0.96175     
## hunting      -0.48762 -0.11715 4564.148        0 0.96009 0.96767     
## lnarea       -4.56222 -0.09918    1.958        1 0.02751 0.09841    *
## nomov        -0.33721 -0.03723    1.980        0 0.42697 0.53756     
## reven         5.25852  0.04181    1.230        0 0.16742 0.37843     
## store        -4.00381 -0.04312    1.995        0 0.40697 0.47714     
## subdiv2      -0.99775 -0.12553    1.224        1 0.00621 0.00800  ***
## systate3.d1  39.25214  0.13450    1.736        1 0.02809 0.01713   **
## systate3.d2  -5.33676 -0.01531    1.073        0 0.46762 0.72970     
## war1         16.39673  0.19027    1.871        1 0.00091 0.00115  ***
## Wy            1.07421  0.50513    2.836        1 0.00001 0.00000  ***
##                                                                                                                         desc
## (Intercept)                                                                                                             <NA>
## dismov                 Total distance residence moved in a year (sum of all moves); (Table: 5.01 & 8.04); (Binford 2001:117)
## fishing                                          Percent dependence on aquatic organisms ; (Table: 5.01); (Binford 2001:117)
## gatherin                                         Percent dependence on terrestrial plants; (Table: 5.01); (Binford 2001:117)
## hunting                                         Percent dependence on terrestrial animals; (Table: 5.01); (Binford 2001:117)
## lnarea                                                                          log of total land area occupied by the group
## nomov                Total number of annual moves in residence of a household unit; (Table: 5.01 & 8.04); (Binford 2001:117)
## reven                                             Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)
## store                                                                            Dependence upon storage; (Binford 2001:388)
## subdiv2               Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") ); (Binford 2001:403,fn2)
## systate3.d1                 Classification of foragers: system's state; (Table: 9.01); (Binford 2001:375) == mounted hunters
## systate3.d2 Classification of foragers: system's state; (Table: 9.01); (Binford 2001:375) == horticulturally augmented cases
## war1                                   Scale of intensity of warfare.  How frequent and how widespread it may be regionally.
## Wy                                                                                                          Network lag term
## 
## $Rmodel
##                 coef stdcoef      VIF  relimp  hcpval    pval star
## (Intercept) -306.511     NaN      NaN     NaN 0.76198 0.80269     
## fishing        2.328 0.75350 8157.171 0.01229 0.81854 0.84950     
## gatherin       2.622 0.78745 7025.986 0.02739 0.79589 0.83080     
## hunting        2.372 0.56988 4509.124 0.00830 0.81685 0.84694     
## reven          7.396 0.05881    1.075 0.00234 0.03407 0.19693     
## war1          14.872 0.17258    1.758 0.11945 0.00175 0.00308  ***
## Wy             1.267 0.59563    2.173 0.27697 0.00000 0.00000  ***
##                                                                                              desc
## (Intercept)                                                                                  <NA>
## fishing               Percent dependence on aquatic organisms ; (Table: 5.01); (Binford 2001:117)
## gatherin              Percent dependence on terrestrial plants; (Table: 5.01); (Binford 2001:117)
## hunting              Percent dependence on terrestrial animals; (Table: 5.01); (Binford 2001:117)
## reven                  Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)
## war1        Scale of intensity of warfare.  How frequent and how widespread it may be regionally.
## Wy                                                                               Network lag term
## 
## $EndogeneityTests
##          weakidF p.Sargan n.IV Fstat        df pvalue star
## fishing    0.285    0.000   16 0.000 6.441e+06  1.000     
## gatherin   0.301    0.440   16 0.092 1.075e+07  0.761     
## hunting    0.000    0.440   16 0.008 7.526e+06  0.928     
## reven     39.114    0.923   11 0.031 4.132e+06  0.861     
## war1       8.893    0.000    5 1.000 3.100e+09  0.000  ***
## 
## $Diagnostics
##                                                           Fstat        df
## RESET test. H0: model has correct functional form        45.338 9.737e+02
## Wald test. H0: appropriate variables dropped              5.000 1.025e+04
## Breusch-Pagan test. H0: residuals homoskedastic          15.111 1.350e+07
## Shapiro-Wilkes test. H0: residuals normal                90.202 5.161e+09
## Hausman test. H0: Wy is exogenous                         4.000 5.237e+05
## Sargan test. H0: residuals uncorrelated with instruments  0.692 6.244e+08
##                                                          pvalue star
## RESET test. H0: model has correct functional form         0.000  ***
## Wald test. H0: appropriate variables dropped              0.000  ***
## Breusch-Pagan test. H0: residuals homoskedastic           0.000  ***
## Shapiro-Wilkes test. H0: residuals normal                 0.000  ***
## Hausman test. H0: Wy is exogenous                         0.000  ***
## Sargan test. H0: residuals uncorrelated with instruments  0.405     
## 
## $OtherStats
##   d l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp nobs
## 1 1 0 0                     16.05         0.4394      0.4821    2  297
## 
## $DescripStats
##                                                                                                                                                                                   desc
## group2      the mean size of the consumer group that regularly camps together during the most aggregated phase of the yearly economic cycles; (Table: 5.01 & 8.01); (Binford 2001:117)
## hunting                                                                                                   Percent dependence on terrestrial animals; (Table: 5.01); (Binford 2001:117)
## gatherin                                                                                                   Percent dependence on terrestrial plants; (Table: 5.01); (Binford 2001:117)
## fishing                                                                                                    Percent dependence on aquatic organisms ; (Table: 5.01); (Binford 2001:117)
## war1                                                                                             Scale of intensity of warfare.  How frequent and how widespread it may be regionally.
## reven                                                                                                       Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)
## nomov                                                                          Total number of annual moves in residence of a household unit; (Table: 5.01 & 8.04); (Binford 2001:117)
## dismov                                                                           Total distance residence moved in a year (sum of all moves); (Table: 5.01 & 8.04); (Binford 2001:117)
## store                                                                                                                                      Dependence upon storage; (Binford 2001:388)
## subdiv2                                                                         Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") ); (Binford 2001:403,fn2)
## systate3.d2                                                           Classification of foragers: system's state; (Table: 9.01); (Binford 2001:375) == horticulturally augmented cases
## systate3.d1                                                                           Classification of foragers: system's state; (Table: 9.01); (Binford 2001:375) == mounted hunters
## lnarea                                                                                                                                    log of total land area occupied by the group
##             nobs    mean      sd    min     max
## group2       297  74.908  85.420 19.500 650.000
## hunting      339  33.119  20.033  0.000  90.000
## gatherin     339  34.525  24.888  0.010  90.300
## fishing      339  32.391  27.316  0.000  95.000
## war1         339   1.808   0.980  1.000   5.000
## reven        339   2.250   0.672  1.190   5.260
## nomov        261   9.695   9.336  0.100  58.000
## dismov       236 171.661 143.703  4.000 570.000
## store        337   2.318   0.934  1.000   3.000
## subdiv2      339  72.343  10.780 46.550  94.230
## systate3.d2  338   0.056   0.231  0.000   1.000
## systate3.d1  338   0.083   0.276  0.000   1.000
## lnarea       339   4.559   1.802 -0.223   8.795
## 
## $totry
##  [1] "bio.15"          "fishingSq"       "gatherin:war1"  
##  [4] "gatherin:Wy"     "hunting:fishing" "hunting:war1"   
##  [7] "hunting:Wy"      "huntingSq"       "subdiv2"        
## [10] "subdiv2Sq"       "systate3.d1"     "war1:Wy"        
## [13] "lnarea"          "subdiv2"         "systate3.d1"    
## 
## $didwell
## [1] "war1"  "reven"

One can also write the list h to a csv format file that can be opened as a spreadsheet. The following command writes h to a file in the working directory called “olsresults.csv”.

CSVwrite(h, "olsresults", FALSE)