The latest version of the Dow-Eff functions (Manual: pdf; html) can perform analyses on four different ethnological datasets:
abbreviation | dataset | codebook |
---|---|---|
WNAI | Western North American Indians | codebook |
SCCS | Standard Cross-Cultural Sample | codebook |
EA | Ethnographic Atlas | codebook |
LRB | Louis R. Binford's forager data | codebook |
The code below outlines the workflow for working with the EA.
You will need a number of R packages to run the Dow-Eff functions. These are loaded using the “library” command. If a package is “not found”, it should be first installed. The following command will initiate the installation of a package named “mice”, for example:
install.packages("mice")
# --set working directory and load needed libraries--
setwd("e:/Dropbox/functions/")
library(mice)
library(foreign)
library(stringr)
library(psych)
library(AER)
library(spdep)
library(geosphere)
library(relaimpo)
The Dow-Eff functions, as well as the four ethnological datasets, are contained in an R-workspace, located in the cloud.
load(url("http://dl.dropbox.com/u/9256203/DEf01.Rdata"), .GlobalEnv)
ls() #-can see the objects contained in DEf01.Rdata
## [1] "addesc" "chK" "chkpmc" "CSVwrite" "doMI"
## [6] "doOLS" "EA" "EAcov" "EAfact" "EAkey"
## [11] "gSimpStat" "kln" "llm" "LRB" "LRBcov"
## [16] "LRBfact" "LRBkey" "mkdummy" "SCCS" "SCCScov"
## [21] "SCCSfact" "SCCSkey" "setDS" "WNAI" "WNAIcov"
## [26] "WNAIfact" "WNAIkey"
The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The four valid options for xx are: “WNAI”, “LRB”, “EA”, “SCCS”. The setDS() command creates objects:
object name | description |
---|---|
cov | Names of covariates to use during imputation step |
dx | The selected ethnological dataset is now called dx |
dxf | The factor version of dx |
key | A metadata file for dx |
wdd | A geographic proximity weight matrix for the societies in dx |
wee | An ecological similarity weight matrix for the societies in dx |
wll | A linguistic proximity weight matrix for the societies in dx |
setDS("EA")
The next step in the workflow is to create any new variables and add them to the dataset dx. New variables can be created directly, as in the following example. When created in this way, one should also record a description of the new variable, using the command addesc(). The syntax takes first the name of the new variable, and then the description.
dx$matriland <- ((dx$v74 == 2) + (dx$v74 == 3)) * 1
addesc("matriland", "Land inherited matrilineally")
Dummy variables (variables taking on the values zero or one) should be added using the command mkdummy(). This command will, in most cases, automatically record a variable description. Dummy variables are appropriate for categorical variables. The syntax of mkdummy() takes first the categorical variable name, and then the category number (these can be found in the codebook for each ethnological dataset). Note that the resulting dummy variable will be called variable name+“.d”+category number.
mkdummy("v11", 3)
## [1] "This dummy variable is named v11.d3"
mkdummy("v41", 2)
## [1] "This dummy variable is named v41.d2"
mkdummy("v44", 9)
## [1] "This dummy variable is named v44.d9"
After making any new variables, list the variables you intend to use in your analysis in the following form.
evm <- c("v1", "v2", "v3", "v4", "v5", "v30", "v31", "v32", "v33", "v11.d3",
"v41.d2", "v44.d9", "v34", "matriland")
Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 2, and 3 iterations are used to estimate each imputed value (these values are too low: nimp=10 and maxit=7 are the defaults and are reasonable for most purposes). The stacked imputed datasets are collected into a single dataframe which here is called smi.
This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These squared variables are given names in the format variable name+“Sq”.
Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name.
smi <- doMI(evm, nimp = 2, maxit = 3)
## [1] "--finding covariates for v41.d2, matriland --"
## [1] "v1"
## [1] "v2"
## [1] "v3"
## [1] "v4"
## [1] "v5"
## [1] "v30"
## [1] "v31"
## [1] "v32"
## [1] "v33"
## [1] "v11.d3"
## [1] "v44.d9"
## [1] "v34"
## [1] "v41.d2"
## [1] "matriland"
## Time difference of 22.38 secs
dim(smi) # dimensions of new dataframe smi
## [1] 2534 86
smi[1:2, ] # first two rows of new dataframe smi
## .imp .id v1 v2 v3 v4 v5 v30 v31 v32 v33 v11.d3 v44.d9 v34 v41.d2
## 1 1 Kung 8 2 0 0 0 1 1 2 1 0 1 3 0
## 2 1 Herero 1 3 0 6 0 1 1 2 1 0 0 2 1
## matriland mht.name.d2 mht.name.d13 mht.name.d14 koeppengei.d1
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## koeppengei.d4 continent.d1 continent.d2 continent.d5 region.d6
## 1 0 1 0 0 0
## 2 0 1 0 0 0
## region.d14 region.d22 bio.1 bio.2 bio.3 bio.4 bio.5 bio.6
## 1 0 0 -0.0007323 2.055 -0.09546 0.2601 0.3418 -0.4992
## 2 0 0 -0.0144016 1.841 0.31145 -0.1412 0.1073 -0.2458
## bio.8 bio.9 bio.10 bio.11 bio.12 bio.13 bio.14 bio.15 bio.16
## 1 0.3975 -0.4677 0.09138 -0.129634 -1.0900 -1.0593 -0.5991 0.7683 -1.0361
## 2 0.1821 -0.3235 -0.25997 -0.004034 -0.9801 -0.8133 -0.6182 1.4108 -0.7776
## bio.17 bio.18 bio.19 meanalt mnnpp sdalt x y x2
## 1 -0.6400 -0.6337 -0.8421 0.9607 -1.0551 -0.7410 0.2096 -1.701 0.04395
## 2 -0.6689 -0.6223 -0.8565 1.1489 -0.9896 -0.5175 0.1744 -1.599 0.03041
## y2 xy Austronesian NigerCongo v1Sq v2Sq v3Sq v4Sq v5Sq v30Sq
## 1 2.893 -0.3566 0 0 64 4 0 0 0 1
## 2 2.555 -0.2787 0 1 1 9 0 36 0 1
## v31Sq v32Sq v33Sq v34Sq bio.1Sq bio.2Sq bio.3Sq bio.4Sq bio.5Sq
## 1 1 4 1 9 5.362e-07 4.221 0.009112 0.06766 0.11682
## 2 1 4 1 4 2.074e-04 3.390 0.097001 0.01993 0.01152
## bio.6Sq bio.8Sq bio.9Sq bio.10Sq bio.11Sq bio.12Sq bio.13Sq bio.14Sq
## 1 0.24922 0.15803 0.2187 0.008351 1.680e-02 1.1880 1.1220 0.3589
## 2 0.06042 0.03315 0.1047 0.067584 1.627e-05 0.9606 0.6614 0.3822
## bio.15Sq bio.16Sq bio.17Sq bio.18Sq bio.19Sq meanaltSq mnnppSq sdaltSq
## 1 0.5903 1.0736 0.4096 0.4016 0.7091 0.9229 1.1133 0.5491
## 2 1.9904 0.6047 0.4474 0.3873 0.7336 1.3199 0.9794 0.2679
All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
# --dependent variable--
dpV <- "v11.d3"
# --independent variables in UNrestricted model--
UiV <- c("v1", "v2", "v3", "v4", "v5", "v30", "v31", "v32", "v33", "v41.d2",
"v44.d9", "v34", "matriland")
# --additional exogenous variables (use in Hausman tests)--
oxog <- NULL
# --independent variables in restricted model (all must be in UiV above)--
RiV <- c("v30", "v31", "v32", "v33", "v41.d2", "v44.d9", "v34")
The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton's Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.
There are several options that increase the time doOLS() takes to run: stepW runs a background stepwise regression to find which variables perform best over the set of estimations; relimp calculates the relative importance of each variable in the restricted model, using a technique to partition R2; slmtests calculates LaGrange multiplier tests for spatial dependence using the three weight matrices. All of these should be set to FALSE if one wishes to speed up estimation times.
h <- doOLS(smi, depvar = dpV, indpv = UiV, rindpv = RiV, othexog = oxog, dw = TRUE,
lw = TRUE, ew = FALSE, stepW = TRUE, relimp = TRUE, slmtests = FALSE)
## [1] "--finding optimal weight matrix------"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## Time difference of 4.064 mins
names(h)
## [1] "DependVarb" "URmodel" "Rmodel"
## [4] "EndogeneityTests" "Diagnostics" "OtherStats"
## [7] "DescripStats" "totry" "didwell"
## [10] "dfbetas" "data"
The output from doOLS, here called h, is a list containing 11 items.
name | description |
---|---|
DependVarb | Description of dependent variable |
URmodel | Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs). Two pvalues are given for H0: \( \beta \)=0. One is the usual pvalue, the other (hcpval) is heteroskedasticity consistent. If stepkept=TRUE, the table will also include the proportion of times a variable is retained in the model using stepwise regression. |
Rmodel | Coefficient estimates from the restricted model. If relimp=TRUE, the R2 assigned to each independent variable is shown here. |
EndogeneityTests | Hausman tests (H0: variable is exogneous), with F-statistic for weak instruments (a rule of thumb is that the instrument is weak if the F-stat is below 10), and Sargan test (H0: instrument is uncorrelated with second-stage 2SLS residuals). |
Diagnostics | Regression diagnostics for the restricted model: RESET test (H0: model has correct functional form); Wald test (H0: appropriate variables dropped); Breusch-Pagan test (H0: residuals homoskedastic; Shapiro-Wilkes test (H0: residuals normal); Hausman test (H0: Wy is exogenous); Sargan test (H0: residuals uncorrelated with instruments for Wy). If slmtests=TRUE, the LaGrange multiplier tests (H0: spatial lag term not needed) are reported here. |
OtherStats | Other statistics: Composite weight matrix weights (see details); R2 for restricted model and unrestricted model; number of imputations; number of observations; Fstat for weak instruments for Wy. |
DescripStats | Descriptive statistics for variables in unrestricted model. |
totry | Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model. |
didwell | Character string of variables that were most significant in the unrestricted model. |
dfbetas | Influential observations for dfbetas (see details) |
data | Data as used in the estimations. Observations with missing values of the dependent variable have been dropped. |
The last two items in the list are large, but the first nine provide a nice overview.
h[1:9]
## $DependVarb
## [1] "Dependent variable='v11.d3': Transfer of Residence at Marriage: After First Years == Husband to wife's group"
##
## $URmodel
## coef stdcoef VIF stepkept hcpval pval star
## (Intercept) -1.33619 NaN NaN 1 0.34800 0.32953
## matriland 0.36971 0.30755 1.119 1 0.00074 0.01064 **
## v1 0.09803 0.42110 147.268 1 0.46576 0.44669
## v2 0.10689 0.44936 141.663 1 0.43602 0.41625
## v3 0.09774 0.45493 171.301 1 0.47179 0.52491
## v30 -0.00990 -0.05898 2.728 1 0.14775 0.13903
## v31 -0.00276 -0.01519 1.813 1 0.71130 0.72348
## v32 0.02828 0.04944 1.200 1 0.03251 0.05893 *
## v33 -0.00335 -0.00933 1.642 0 0.77789 0.80020
## v34 -0.00617 -0.01916 1.420 0 0.47548 0.51238
## v4 0.10468 0.50825 188.091 1 0.43697 0.41801
## v41.d2 -0.01190 -0.01484 2.389 0 0.60856 0.70304
## v44.d9 0.00576 0.00778 2.208 0 0.83525 0.83250
## v5 0.11011 0.81033 431.568 1 0.42411 0.40543
## Wy 2.64664 0.38923 1.896 1 0.00000 0.00000 ***
## desc
## (Intercept) <NA>
## matriland Land inherited matrilineally
## v1 Gathering
## v2 Hunting
## v3 Fishing
## v30 Settlement Patterns
## v31 Mean Size of Local Communities
## v32 Jurisdictional Hierarchy of Local Community
## v33 Jurisdictional Hierarchy Beyond Local Community
## v34 High Gods
## v4 Animal Husbandry
## v41.d2 Milking of Domestic Animals == milked more often than sporadically
## v44.d9 Sex Differences: Metal Working == absent or unimportant activity
## v5 Agriculture
## Wy Network lag term
##
## $Rmodel
## coef stdcoef VIF relimp hcpval pval star
## (Intercept) -0.36813 NaN NaN NaN 0.00000 0.00000 ***
## v30 0.00239 0.01421 1.432 0.00093 0.64753 0.63905
## v31 -0.00838 -0.04642 1.731 0.00784 0.19854 0.22897
## v32 0.04211 0.07363 1.168 0.00275 0.00361 0.00811 ***
## v33 0.00145 0.00423 1.632 0.00447 0.88993 0.90397
## v34 0.00019 0.00077 1.362 0.00584 0.98439 0.98539
## v41.d2 -0.03291 -0.04103 1.754 0.02103 0.17315 0.24733
## v44.d9 -0.04960 -0.06709 1.834 0.01145 0.05856 0.04976 **
## Wy 3.06380 0.45057 1.675 0.14295 0.00000 0.00000 ***
## desc
## (Intercept) <NA>
## v30 Settlement Patterns
## v31 Mean Size of Local Communities
## v32 Jurisdictional Hierarchy of Local Community
## v33 Jurisdictional Hierarchy Beyond Local Community
## v34 High Gods
## v41.d2 Milking of Domestic Animals == milked more often than sporadically
## v44.d9 Sex Differences: Metal Working == absent or unimportant activity
## Wy Network lag term
##
## $EndogeneityTests
## weakidF p.Sargan n.IV Fstat df pvalue star
## v30 39.354 0.002 11.5 7.245 448620 0.007 ***
## v31 14.574 0.551 7.5 0.243 3 0.656
## v32 7.000 0.000 9.0 0.000 4 1.000
## v33 9.342 0.506 8.0 0.013 1580 0.910
## v34 14.136 0.001 10.0 0.059 38 0.809
## v41.d2 18.791 0.008 15.0 0.310 176 0.579
## v44.d9 37.000 0.000 15.0 0.000 46 1.000
##
## $Diagnostics
## Fstat df
## RESET test. H0: model has correct functional form 37.93 158.15
## Wald test. H0: appropriate variables dropped 49.00 2.00
## Breusch-Pagan test. H0: residuals homoskedastic 174.32 50.89
## Shapiro-Wilkes test. H0: residuals normal 132.21 85680.02
## Hausman test. H0: Wy is exogenous 59.00 476.00
## Sargan test. H0: residuals uncorrelated with instruments 20.77 27798.04
## pvalue star
## RESET test. H0: model has correct functional form 0 ***
## Wald test. H0: appropriate variables dropped 0 ***
## Breusch-Pagan test. H0: residuals homoskedastic 0 ***
## Shapiro-Wilkes test. H0: residuals normal 0 ***
## Hausman test. H0: Wy is exogenous 0 ***
## Sargan test. H0: residuals uncorrelated with instruments 0 ***
##
## $OtherStats
## d l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp
## 1 0.25 0.75 0 48.94 0.237 0.3114 2
## nobs
## 1 1235
##
## $DescripStats
## desc
## v11.d3 Transfer of Residence at Marriage: After First Years == Husband to wife's group
## v1 Gathering
## v2 Hunting
## v3 Fishing
## v4 Animal Husbandry
## v5 Agriculture
## v30 Settlement Patterns
## v31 Mean Size of Local Communities
## v32 Jurisdictional Hierarchy of Local Community
## v33 Jurisdictional Hierarchy Beyond Local Community
## v41.d2 Milking of Domestic Animals == milked more often than sporadically
## v44.d9 Sex Differences: Metal Working == absent or unimportant activity
## v34 High Gods
## matriland Land inherited matrilineally
## nobs mean sd min max
## v11.d3 1235 0.162 0.369 0 1
## v1 1266 1.019 1.588 0 8
## v2 1266 1.441 1.551 0 9
## v3 1266 1.532 1.707 0 9
## v4 1266 1.558 1.798 0 9
## v5 1266 4.451 2.712 0 9
## v30 1163 5.102 2.219 1 8
## v31 586 3.618 2.245 1 8
## v32 1143 1.824 0.644 1 3
## v33 1131 1.905 1.049 1 5
## v41.d2 1158 0.307 0.462 0 1
## v44.d9 951 0.543 0.498 0 1
## v34 748 2.170 1.169 1 4
## matriland 831 0.110 0.312 0 1
##
## $totry
## [1] "bio.1" "bio.10" "bio.10Sq" "bio.11" "bio.12"
## [6] "bio.13" "bio.16" "bio.18" "bio.1Sq" "bio.3"
## [11] "bio.4" "bio.5" "bio.5Sq" "bio.6" "bio.8"
## [16] "bio.8Sq" "bio.9" "bio.9Sq" "matriland" "v1"
## [21] "v1Sq" "v30:v31" "v30Sq" "v31:v44.d9" "v32:v44.d9"
## [26] "v32:Wy" "v32Sq" "v33:v44.d9" "v34:Wy" "v41.d2:Wy"
## [31] "v5" "matriland" "v1" "v2" "v3"
## [36] "v4" "v5"
##
## $didwell
## [1] "v30" "v31" "v32" "v44.d9"
One can also write the list h to a csv format file that can be opened as a spreadsheet. The following command writes h to a file in the working directory called “olsresults.csv”.
CSVwrite(h, "olsresults", FALSE)