Using the Dow-Eff functions in R

The latest version of the Dow-Eff functions (Manual: pdf; html) can perform analyses on four different ethnological datasets:

abbreviation	dataset	codebook
WNAI	Western North American Indians	codebook
SCCS	Standard Cross-Cultural Sample	codebook
EA	Ethnographic Atlas	codebook
LRB	Louis R. Binford's forager data	codebook

The code below outlines the workflow for working with the SCCS.

You will need a number of R packages to run the Dow-Eff functions. These are loaded using the “library” command. If a package is “not found”, it should be first installed. The following command will initiate the installation of a package named “mice”, for example:

install.packages("mice")

# --set working directory and load needed libraries--
setwd("/home/yagmur/Dropbox/functions")

## Error: cannot change working directory

library(mice)
library(foreign)
library(stringr)
library(psych)
library(AER)
library(spdep)
library(geosphere)
library(relaimpo)

The Dow-Eff functions, as well as the four ethnological datasets, are contained in an R-workspace, located in the cloud.

load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
ls()  #-can see the objects contained in DEf01.Rdata

##  [1] "addesc"    "chK"       "chkpmc"    "CSVwrite"  "doMI"     
##  [6] "doOLS"     "EA"        "EAcov"     "EAfact"    "EAkey"    
## [11] "gSimpStat" "kln"       "llm"       "LRB"       "LRBcov"   
## [16] "LRBfact"   "LRBkey"    "mkdummy"   "SCCS"      "SCCScov"  
## [21] "SCCSfact"  "SCCSkey"   "setDS"     "WNAI"      "WNAIcov"  
## [26] "WNAIfact"  "WNAIkey"

The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The four valid options for xx are: “WNAI”, “LRB”, “EA”, “SCCS”. The setDS() command creates objects:

object name	description
cov	Names of covariates to use during imputation step
dx	The selected ethnological dataset is now called dx
dxf	The factor version of dx
key	A metadata file for dx
wdd	A geographic proximity weight matrix for the societies in dx
wee	An ecological similarity weight matrix for the societies in dx
wll	A linguistic proximity weight matrix for the societies in dx

setDS("SCCS")

The next step in the workflow is to create any new variables and add them to the dataset dx. New variables can be created directly, as in the following example. When created in this way, one should also record a description of the new variable, using the command addesc(). The syntax takes first the name of the new variable, and then the description.

dx$valchild = (dx$v473 + dx$v474 + dx$v475 + dx$v476)
addesc("valchild", "Degree to which society values children")

Dummy variables (variables taking on the values zero or one) should be added using the command mkdummy(). This command will, in most cases, automatically record a variable description. Dummy variables are appropriate for categorical variables. The syntax of mkdummy() takes first the categorical variable name, and then the category number (these can be found in the codebook for each ethnological dataset). Note that the resulting dummy variable will be called variable name+“.d”+category number.

mkdummy("v244", 7)

## [1] "This dummy variable is named v244.d7"

mkdummy("v245", 2)

## [1] "This dummy variable is named v245.d2"

mkdummy("v233", 6)

## [1] "This dummy variable is named v233.d6"

After making any new variables, list the variables you intend to use in your analysis in the following form.

evm <- c("v203", "v204", "v1685", "v156", "v72", "v234", "v236", "v238", "v1648", 
    "v2125", "v155", "valchild", "v244.d7", "v245.d2", "v233.d6", "v1260", "v872")

Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 2, and 3 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.

This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These squared variables are given names in the format variable name+“Sq”.

Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name.

smi <- doMI(evm, nimp = 2, maxit = 3)

## [1] "--finding covariates for  valchild --"
## [1] "v1685"
## [1] "v72"
## [1] "v238"
## [1] "v1648"
## [1] "v2125"
## [1] "v872"
## [1] "valchild"
## Time difference of 5.963 secs

dim(smi)  # dimensions of new dataframe smi

## [1] 372  92

smi[1:2, ]  # first two rows of new dataframe smi

##   .imp            .id v1685 v72 v238 v1648 v2125 v872 valchild v203 v204
## 1    1 Nama Hottentot     3   5    1    18     2   12       20    1    3
## 2    1   Kung Bushmen     4   4    3     1     2   19       28    8    2
##   v156 v234 v236 v155 v244.d7 v245.d2 v233.d6 v1260 mht.name.d12
## 1    1    1    2    1       1       1       0     8            0
## 2    1    1    3    1       0       0       0    10            1
##   mht.name.d13 koeppengei.d1 koeppengei.d4 continent.d1 continent.d2
## 1            0             0             0            1            0
## 2            0             0             0            1            0
##   continent.d5 continent.d7 region.d13 region.d16   bio.1  bio.2    bio.3
## 1            0            0          0          0 -0.2685 0.7742 -0.06142
## 2            0            0          0          0  0.3117 1.6157 -0.02294
##       bio.4   bio.5   bio.6   bio.8   bio.9  bio.10  bio.11  bio.12
## 1 -0.036559 -0.4034 -0.2414 -0.1120 -0.2703 -0.4757 -0.1788 -1.3137
## 2 -0.008244  0.5805 -0.1186  0.5895 -0.1000  0.3065  0.1758 -0.9683
##    bio.13  bio.14  bio.15  bio.16  bio.17  bio.18  bio.19 meanalt   mnnpp
## 1 -1.4884 -0.6640 -0.2948 -1.4912 -0.7019 -1.1292 -0.8585  0.8507 -1.2002
## 2 -0.7272 -0.7231  1.4300 -0.8169 -0.7679 -0.5504 -0.9183  0.7684 -0.7372
##      sdalt       x      y        x2    y2       xy Austronesian NigerCongo
## 1  0.04956 0.02732 -1.677 0.0007464 2.811 -0.04581            0          0
## 2 -0.81449 0.06604 -1.372 0.0043618 1.882 -0.09060            0          0
##   v1685Sq v72Sq v238Sq v1648Sq v2125Sq v872Sq valchildSq v203Sq v204Sq
## 1       9    25      1     324       4    144        400      1      9
## 2      16    16      9       1       4    361        784     64      4
##   v156Sq v234Sq v236Sq v155Sq v1260Sq bio.1Sq bio.2Sq   bio.3Sq   bio.4Sq
## 1      1      1      4      1      64 0.07212  0.5993 0.0037730 1.337e-03
## 2      1      1      9      1     100 0.09717  2.6104 0.0005261 6.796e-05
##   bio.5Sq bio.6Sq bio.8Sq  bio.9Sq bio.10Sq bio.11Sq bio.12Sq bio.13Sq
## 1  0.1627 0.05826 0.01254 0.073047  0.22625  0.03196   1.7259   2.2155
## 2  0.3370 0.01407 0.34756 0.009999  0.09395  0.03090   0.9377   0.5288
##   bio.14Sq bio.15Sq bio.16Sq bio.17Sq bio.18Sq bio.19Sq meanaltSq mnnppSq
## 1   0.4408  0.08691   2.2236   0.4927   1.2750   0.7371    0.7237  1.4405
## 2   0.5229  2.04479   0.6674   0.5897   0.3029   0.8433    0.5905  0.5435
##    sdaltSq
## 1 0.002456
## 2 0.663390

All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.

# --dependent variable--
dpV <- "valchild"
# --independent variables in UNrestricted model--
UiV <- c("v203", "v204", "v1685", "v156", "v72", "v234", "v236", "v238", "v1648", 
    "v2125", "v155", "v244.d7", "v245.d2", "v233.d6", "v872")
# --additional exogenous variables (use in Hausman tests)--
oxog <- c("v1260")
# --independent variables in restricted model (all must be in UiV above)--
RiV <- c("v234", "v236", "v238", "v1648", "v2125", "v244.d7", "v245.d2", "v233.d6", 
    "v872")

The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.

There are several options that increase the time doOLS() takes to run: stepW runs a background stepwise regression to find which variables perform best over the set of estimations; relimp calculates the relative importance of each variable in the restricted model, using a technique to partition R^2; slmtests calculates LaGrange multiplier tests for spatial dependence using the three weight matrices. All of these should be set to FALSE if one wishes to speed up estimation times.

h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE, 
    lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE)

## [1] "--finding optimal weight matrix------"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## Time difference of 22.03 secs

names(h)

##  [1] "DependVarb"       "URmodel"          "Rmodel"          
##  [4] "EndogeneityTests" "Diagnostics"      "OtherStats"      
##  [7] "DescripStats"     "totry"            "didwell"         
## [10] "dfbetas"          "data"

The output from doOLS, here called h, is a list containing 11 items.

name	description
DependVarb	Description of dependent variable
URmodel	Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs). Two pvalues are given for H0: \( \beta \)=0. One is the usual pvalue, the other (hcpval) is heteroskedasticity consistent. If stepkept=TRUE, the table will also include the proportion of times a variable is retained in the model using stepwise regression.
Rmodel	Coefficient estimates from the restricted model. If relimp=TRUE, the R² assigned to each independent variable is shown here.
EndogeneityTests	Hausman tests (H0: variable is exogneous), with F-statistic for weak instruments (a rule of thumb is that the instrument is weak if the F-stat is below 10), and Sargan test (H0: instrument is uncorrelated with second-stage 2SLS residuals).
Diagnostics	Regression diagnostics for the restricted model: RESET test (H0: model has correct functional form); Wald test (H0: appropriate variables dropped); Breusch-Pagan test (H0: residuals homoskedastic; Shapiro-Wilkes test (H0: residuals normal); Hausman test (H0: Wy is exogenous); Sargan test (H0: residuals uncorrelated with instruments for Wy). If slmtests=TRUE, the LaGrange multiplier tests (H0: spatial lag term not needed) are reported here.
OtherStats	Other statistics: Composite weight matrix weights (see details); R² for restricted model and unrestricted model; number of imputations; number of observations; Fstat for weak instruments for Wy.
DescripStats	Descriptive statistics for variables in unrestricted model.
totry	Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model.
didwell	Character string of variables that were most significant in the unrestricted model.
dfbetas	Influential observations for dfbetas (see details)
data	Data as used in the estimations. Observations with missing values of the dependent variable have been dropped.

The last two items in the list are large, but the first nine provide a nice overview.

h[1:9]

## $DependVarb
## [1] "Dependent variable='valchild': Degree to which society values children"
## 
## $URmodel
##                 coef  stdcoef   VIF stepkept  hcpval    pval star
## (Intercept) 17.40971      NaN   NaN        1 0.00440 0.01723   **
## v155         0.22874  0.05976 1.591        0 0.55618 0.53735     
## v156         0.17864  0.04979 2.922        0 0.68153 0.70475     
## v1648       -0.09212 -0.11235 1.127        0 0.16588 0.17333     
## v1685       -0.34420 -0.07709 1.103        0 0.33082 0.33852     
## v203        -0.68437 -0.18118 1.716        1 0.05594 0.07135    *
## v204        -0.46587 -0.14443 2.262        1 0.19446 0.21074     
## v2125       -0.61545 -0.09756 1.307        0 0.22312 0.26795     
## v233.d6      1.66883  0.14674 1.633        1 0.14347 0.13415     
## v234        -0.64816 -0.27745 2.487        1 0.00690 0.02181   **
## v236        -0.43455 -0.04549 1.251        0 0.56154 0.59570     
## v238        -0.08301 -0.01743 1.642        0 0.84989 0.85923     
## v244.d7      1.05684  0.08973 2.812        0 0.46735 0.48584     
## v245.d2     -2.14642 -0.17471 2.777        1 0.14197 0.17136     
## v72         -0.79979 -0.16563 1.148        1 0.02350 0.04413   **
## v872         0.00487  0.02376 1.097        0 0.75570 0.76802     
## Wy           0.68902  0.21352 1.174        1 0.00293 0.01015   **
##                                                                                                     desc
## (Intercept)                                                                                         <NA>
## v155                                                                                      Scale 7- Money
## v156                                                                      Scale 8- Density of Population
## v1648                                                     Overall Frequency of Warfare (Resolved Rating)
## v1685                                                       Chronic Resource Problems (Resolved Ratings)
## v203                                                                   Dependence on Gathering (Atlas 1)
## v204                                                                     Dependence on Hunting (Atlas 3)
## v2125                     Importance of Wage Labor inside the Community or outside (If Return Migration)
## v233.d6                                                                 Major Crop Type == Cereal grains
## v234                                                                                 Settlement Patterns
## v236                                                         Jurisdictional Hierarchy of Local Community
## v238                                                                                           High Gods
## v244.d7     Predominant Type of Animal Husbandry == Bovine animals (cattle, mithun, water buffalo, yaks)
## v245.d2                               Milking of Domestic Animals == Milked more often than sporadically
## v72                                                                              Intercommunity Marriage
## v872          Percentage of Married Women Polygynously Married (Share Husband with One or More Co-wives)
## Wy                                                                                      Network lag term
## 
## $Rmodel
##                 coef  stdcoef   VIF  relimp  hcpval    pval star
## (Intercept)  9.29992      NaN   NaN     NaN 0.11608 0.16886     
## v1648       -0.06457 -0.07869 1.081 0.00522 0.34034 0.34414     
## v2125       -0.38562 -0.06116 1.205 0.00143 0.43440 0.47610     
## v233.d6      1.97238  0.17343 1.594 0.01897 0.08123 0.07894    *
## v234        -0.19098 -0.08175 1.428 0.00242 0.34510 0.38160     
## v236        -0.19247 -0.02015 1.165 0.00029 0.80847 0.81131     
## v238         0.11720  0.02461 1.550 0.00051 0.79133 0.80053     
## v244.d7      1.38313  0.11743 2.672 0.00660 0.34556 0.35813     
## v245.d2     -1.63251 -0.13288 2.463 0.00372 0.24735 0.27899     
## v872        -0.00507 -0.02453 1.046 0.00059 0.75437 0.75864     
## Wy           0.69456  0.21525 1.054 0.04396 0.00268 0.00734  ***
##                                                                                                     desc
## (Intercept)                                                                                         <NA>
## v1648                                                     Overall Frequency of Warfare (Resolved Rating)
## v2125                     Importance of Wage Labor inside the Community or outside (If Return Migration)
## v233.d6                                                                 Major Crop Type == Cereal grains
## v234                                                                                 Settlement Patterns
## v236                                                         Jurisdictional Hierarchy of Local Community
## v238                                                                                           High Gods
## v244.d7     Predominant Type of Animal Husbandry == Bovine animals (cattle, mithun, water buffalo, yaks)
## v245.d2                               Milking of Domestic Animals == Milked more often than sporadically
## v872          Percentage of Married Women Polygynously Married (Share Husband with One or More Co-wives)
## Wy                                                                                      Network lag term
## 
## $EndogeneityTests
##         weakidF p.Sargan n.IV Fstat       df pvalue star
## v1648     3.995    0.000    6 0.000   117811  1.000     
## v2125     5.554    0.169    5 0.024        9  0.879     
## v233.d6   9.000    0.039    6 1.547   125012  0.214     
## v234     12.219    0.085    7 0.070   713974  0.792     
## v236      5.529    0.000    3 0.000  1156641  1.000     
## v238      4.905    0.147    5 0.587 16598184  0.444     
## v244.d7   4.000    0.129    6 0.331      224  0.566     
## v245.d2   5.875    0.073    7 1.322    22293  0.250     
## v872      7.354    0.000    9 0.000       94  1.000     
## 
## $Diagnostics
##                                                          Fstat       df
## RESET test. H0: model has correct functional form        1.393    368.8
## Wald test. H0: appropriate variables dropped             4.000  32191.0
## Breusch-Pagan test. H0: residuals homoskedastic          0.820 151437.2
## Shapiro-Wilkes test. H0: residuals normal                0.251   7259.3
## Hausman test. H0: Wy is exogenous                        2.000   1207.0
## Sargan test. H0: residuals uncorrelated with instruments 0.265 131690.0
##                                                          pvalue star
## RESET test. H0: model has correct functional form         0.239     
## Wald test. H0: appropriate variables dropped              0.000  ***
## Breusch-Pagan test. H0: residuals homoskedastic           0.365     
## Shapiro-Wilkes test. H0: residuals normal                 0.616     
## Hausman test. H0: Wy is exogenous                         0.000  ***
## Sargan test. H0: residuals uncorrelated with instruments  0.606     
## 
## $OtherStats
##   d l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp nobs
## 1 1 0 0                     8.423        0.02203     0.09522    2  171
## 
## $DescripStats
##                                                                                                  desc
## valchild                                                      Degree to which society values children
## v203                                                                Dependence on Gathering (Atlas 1)
## v204                                                                  Dependence on Hunting (Atlas 3)
## v1685                                                    Chronic Resource Problems (Resolved Ratings)
## v156                                                                   Scale 8- Density of Population
## v72                                                                           Intercommunity Marriage
## v234                                                                              Settlement Patterns
## v236                                                      Jurisdictional Hierarchy of Local Community
## v238                                                                                        High Gods
## v1648                                                  Overall Frequency of Warfare (Resolved Rating)
## v2125                  Importance of Wage Labor inside the Community or outside (If Return Migration)
## v155                                                                                   Scale 7- Money
## v244.d7  Predominant Type of Animal Husbandry == Bovine animals (cattle, mithun, water buffalo, yaks)
## v245.d2                            Milking of Domestic Animals == Milked more often than sporadically
## v233.d6                                                              Major Crop Type == Cereal grains
## v872       Percentage of Married Women Polygynously Married (Share Husband with One or More Co-wives)
##          nobs   mean     sd min max
## valchild  171 24.023  5.698   8  36
## v203      186  1.108  1.492   0   8
## v204      186  1.554  1.730   0   9
## v1685     144  2.139  1.277   1   5
## v156      186  2.860  1.557   1   5
## v72       185  3.195  1.200   1   5
## v234      186  4.925  2.411   1   8
## v236      186  2.887  0.600   2   4
## v238      168  2.149  1.192   1   4
## v1648     160 10.262  6.863   1  18
## v2125     156  1.974  0.908   1   3
## v155      186  2.511  1.479   1   5
## v244.d7   186  0.366  0.483   0   1
## v245.d2   186  0.306  0.462   0   1
## v233.d6   186  0.489  0.501   0   1
## v872      143 25.287 27.855   0  97
## 
## $totry
##  [1] "bio.14Sq"   "bio.15Sq"   "bio.17Sq"   "mnnpp"      "v203"      
##  [6] "v72"        "valchildSq" "v203"       "v204"       "v72"       
## 
## $didwell
## [1] "v233.d6" "v234"    "v245.d2" "v1648"   "v244.d7"

One can also write the list h to a csv format file that can be opened as a spreadsheet. The following command writes h to a file in the working directory called “olsresults.csv”.

CSVwrite(h, "olsresults", FALSE)