The latest version of the Dow-Eff functions (Manual: pdf; html) can perform analyses on five different ethnological datasets:
abbreviation | codebook | dataset |
---|---|---|
SCCS | codebook | Standard Cross-Cultural Sample |
EA | codebook | Ethnographic Atlas |
LRB | codebook | Lewis R. Binford forager data |
WNAI | codebook | Western North American Indians |
XC | codebook | Merged 371 society data |
The code below outlines the workflow for estimating a model using data from the EA.
You will need a number of R packages to run the Dow-Eff functions. These are loaded using the “library” command. If a package is “not found”, it should be first installed. The following command will initiate the installation of a package named “mice”, for example:
install.packages("mice")
#--set working directory and load needed libraries--
options('width'=150)
# setwd("/home/yagmur/Dropbox/functions")
setwd("e:/Dropbox/functions/")
library(Hmisc)
## Warning: package 'ggplot2' was built under R version 3.2.5
library(mice)
## Warning: package 'Rcpp' was built under R version 3.2.5
library(foreign)
library(stringr)
library(AER)
library(spdep)
library(psych)
library(geosphere)
library(relaimpo)
library(linprog)
library(dismo)
library(forward)
library(pastecs)
library(classInt)
library(maps)
library(dismo)
library(plyr)
library(aod)
library(reshape)
library(RColorBrewer)
library(XML)
library(tm)
library(mlogit)
library(mapproj)
The Dow-Eff functions, as well as the five ethnological datasets, are contained in an R-workspace, located in the cloud.
load(url("http://capone.mtsu.edu/eaeff/downloads/DEf01f.Rdata"))
#-show the objects contained in DEf01f.Rdata
data.frame(type=sapply(ls(),function(x) class(get(x))))
## type
## addesc function
## capwrd function
## chK function
## CSVwrite function
## doLogit function
## doMI function
## doMNLogit function
## doOLS function
## EA data.frame
## EAcov list
## EAfact data.frame
## EAkey data.frame
## fv4scale function
## GISaux character
## gSimpStat function
## kln function
## llm matrix
## LRB data.frame
## LRBcov list
## LRBfact data.frame
## LRBkey data.frame
## MEplots function
## mkcatmappng function
## mkdummy function
## mkmappng function
## mknwlag function
## mkscale function
## mkSq function
## mmgg function
## p.gis data.frame
## plotSq function
## quickdesc function
## resc function
## rmcs function
## rnkd function
## SCCS data.frame
## SCCScov list
## SCCSfact data.frame
## SCCSkey data.frame
## setDS function
## showlevs function
## spmang function
## widen function
## WNAI data.frame
## WNAIcov list
## WNAIfact data.frame
## WNAIkey data.frame
## XC data.frame
## XCcov list
## XCfact data.frame
## XCkey data.frame
The setDS( xx ) command sets one of the four ethnological datasets as the source for the subsequent analysis. The five valid options for xx are: XC, LRB, EA, SCCS, and WNAI. The setDS() command creates objects:
object name | description |
---|---|
cov | Names of covariates to use during imputation step |
dx | The selected ethnological dataset is now called dx |
dxf | The factor version of dx |
key | A metadata file for dx |
wdd | A geographic proximity weight matrix for the societies in dx |
wee | An ecological similarity weight matrix for the societies in dx |
wll | A linguistic proximity weight matrix for the societies in dx |
setDS("EA")
The next step in the workflow is to create any new variables and add them to the dataset dx. New variables can be created directly, as in the following example. When created in this way, one should also record a description of the new variable, using the command addesc(). The syntax takes first the name of the new variable, and then the description.
dx$matriland<-((dx$v74==2)+(dx$v74==3))*1
addesc("matriland","Land inherited matrilineally")
Dummy variables (variables taking on the values zero or one) should be added using the command mkdummy(). This command will, in most cases, automatically record a variable description. Dummy variables are appropriate for categorical variables. The syntax of mkdummy() takes first the categorical variable name, and then the category number (these can be found in the codebook for each ethnological dataset). Note that the resulting dummy variable will be called variable name+“.d”+category number.
mkdummy("v11",3)
## [1] "This dummy variable is named v11.d3"
## [1] "The variable description is: 'Transfer of Residence at Marriage: After First Years == Husband to wife's group'"
mkdummy("v41",2)
## [1] "This dummy variable is named v41.d2"
## [1] "The variable description is: 'Milking of Domestic Animals == milked more often than sporadically'"
mkdummy("v44",9)
## [1] "This dummy variable is named v44.d9"
## [1] "The variable description is: 'Sex Differences: Metal Working == absent or unimportant activity'"
After making any new variables, list the variables you intend to use in your analysis in the following form.
evm<-c("v1","v2","v3","v4","v5","v30","v31","v32","v33","v11.d3","v41.d2","v44.d9","v34","matriland")
Missing values of these variables are then imputed, using the command doMI(). Below, the number of imputed datasets is 5, and 7 iterations are used to estimate each imputed value (5 imputations is borderline OK, 10 or 15 would be better). The stacked imputed datasets are collected into a single dataframe which here is called smi.
This new dataframe smi will contain not only the variables in evm, but also a set of normalized (mean=0, sd=1) variables related to climate, location, and ecology (these are used in the OLS analysis to address problems of endogeneity). In addition, squared values are calculated automatically for variables with at least three discrete values and maximum absolute values no more than 300. These squared variables are given names in the format variable name+“Sq”.
Finally, smi contains a variable called “.imp”, which identifies the imputed dataset, and a variable called “.id” which gives the society name.
smi<-doMI(evm,nimp=5,maxit=7)
## [1] "--create variables to use as covariates--"
## [1] "v1"
## [1] "v2"
## [1] "v3"
## [1] "v4"
## [1] "v5"
## [1] "v30"
## [1] "v31"
## [1] "v32"
## [1] "v33"
## [1] "v11.d3"
## [1] "v41.d2"
## [1] "v44.d9"
## [1] "v34"
## [1] "matriland"
## [1] "foo"
## [1] "WARNING: variable may not be ordinal--society" "WARNING: variable may not be ordinal--dxid" "WARNING: variable may not be ordinal--foo"
## Time difference of 37.33296 secs
dim(smi) # dimensions of new dataframe smi
## [1] 6325 93
All of the variables selected to play a role in the model must be found in the new dataframe smi. Below, the variables are organized according to the role they will play.
# --dependent variable--
dpV<-"matriland"
#--independent variables in UNrestricted model--
UiV<-c("v1","v2","v3","v4","v5","v30","v31","v32","v33","v41.d2","v44.d9","v34")
#--additional exogenous variables (use in Hausman tests)--
oxog<-NULL
#--independent variables in restricted model (all must be in UiV above)--
RiV<-c("v44.d9", "v34", "v41.d2")
The command doOLS() estimates the model on each of the imputed datasets, collecting output from each estimation and processing them to obtain final results. To control for Galton’s Problem, a network lag model is used, with the user able to choose a combination of geographic proximity (dw), linguistic proximity (lw), and ecological similarity (ew) weight matrices. In most cases, the user should choose the default of dw=TRUE, lw=TRUE, ew=FALSE.
There are several options that increase the time doOLS() takes to run: stepW runs a background stepwise regression to find which variables perform best over the set of estimations; relimp calculates the relative importance of each variable in the restricted model, using a technique to partition R2; slmtests calculates LaGrange multiplier tests for spatial dependence using the three weight matrices. All of these should be set to FALSE if one wishes to speed up estimation times. Bootstrap standard errors are calculated by setting option doboot equal to some number between 10 and 10,000 (usually values between 500 and 1,000 are good choices). Bootstrapping also consumes lots of estimation time.
h<-doOLS(smi,depvar=dpV,indpv=UiV,rindpv=RiV,othexog=NULL,
dw=TRUE,lw=TRUE,ew=FALSE,stepW=TRUE,boxcox=FALSE,getismat=FALSE,
relimp=TRUE,slmtests=FALSE,haustest=NULL,mean.data=TRUE,doboot=1000)
## [1] "--finding optimal weight matrix------"
## [1] "Exogenous variables used to instrument Wy: xWv1, xWv2, xWv3, xWv4, xWv5, xWv30, xWv31, xWv32, xWv33, xWv41.d2, xWv44.d9, xWv34, xWv2Sq, xWv3Sq, xWv4Sq, xWv5Sq, xWv31Sq, xWv32Sq, xWv33Sq"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## Time difference of 2.006919 mins
names(h)
## [1] "DependVarb" "URmodel" "model.varbs" "Rmodel" "EndogeneityTests"
## [6] "Diagnostics" "OtherStats" "DescripStats.ImputedData" "DescripStats.OriginalData" "totry"
## [11] "didwell" "usedthese" "dfbetas" "data"
The output from doOLS(), here called h, is a list containing 14 items, explained in more detail in the manual.
name | description |
---|---|
DependVarb | Description of dependent variable. |
URmodel | Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs). Two pvalues are given for H0: coefficient =0. One is the usual pvalue, the other (hcpval) is heteroskedasticity consistent. If stepkept=TRUE, the table will also include the proportion of times a variable is retained in the model using stepwise regression. |
model.varbs | Short descriptions of model variables: shows the meaning of the lowest and highest values of the variable. This can save a trip to the codebook. |
Rmodel | Coefficient estimates from the restricted model. If relimp=TRUE, the R2 assigned to each independent variable is shown here. |
EndogeneityTests | Hausman tests (H0: variable is exogneous), with F-statistic for weak instruments (a rule of thumb is that the instrument is weak if the F-stat is below 10), and Sargan test (H0: instrument is uncorrelated with second-stage 2SLS residuals). |
Diagnostics | Regression diagnostics for the restricted model: RESET test (H0: model has correct functional form); Wald test (H0: appropriate variables dropped); Breusch-Pagan test (H0: residuals homoskedastic; Shapiro-Wilkes test (H0: residuals normal); Hausman test (H0: Wy is exogenous); Sargan test (H0: residuals uncorrelated with instruments for Wy). If slmtests=TRUE, the LaGrange multiplier tests (H0: spatial error model not appropriate) are reported here. |
OtherStats | Other statistics: Composite weight matrix weights; R2 for restricted model and unrestricted model; number of imputations; number of observations; Fstat for weak instruments for Wy. |
DescripStats.ImputedData | Descriptive statistics for model variables found only in imputed data. |
DescripStats.OriginalData | Descriptive statistics for model variables found in pre-imputation dataset. |
totry | Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model. |
didwell | Character string of variables that were most significant in the unrestricted model. |
usedthese | Table showing how observations used differ from observations not used, regarding ecology, continent, and subsistence. |
dfbetas | Influential observations for dfbetas. |
data | Data as used in the estimations. Observations with missing values of the dependent variable have been dropped. If mean.data=TRUE, will output format that can be used to make maps. |
The last two items in the list can be quite large. Here are three of the first 12 items:
h$Rmodel
## coef stdcoef VIF relimp pval hcpval bootpval star desc
## (Intercept) -0.14945 NaN NaN NaN 0.00057 0.00022 0.00077 *** <NA>
## v34 0.01923 0.07096 1.29120 0.00181 0.04690 0.04463 0.05610 * High Gods
## v41.d2 -0.07881 -0.11978 1.45405 0.01900 0.00158 0.00133 0.00202 *** Milking of Domestic Animals == milked more often than sporadically
## v44.d9 0.08023 0.12498 2.03961 0.01034 0.00749 0.00939 0.01415 ** Sex Differences: Metal Working == absent or unimportant activity
## Wy 1.75441 0.38924 1.65153 0.10060 0.00000 0.00000 0.00000 *** Network lag term
h$Diagnostics
## Fstat df pvalue star
## RESET test. H0: model has correct functional form 16.3007 41 0.0002 ***
## Wald test. H0: appropriate variables dropped 1.4361 2123 0.2309
## Breusch-Pagan test. H0: residuals homoskedastic 88.9654 3411 0.0000 ***
## Shapiro-Wilkes test. H0: residuals normal 143.8126 53134 0.0000 ***
## Hausman test. H0: Wy is exogenous 138.4708 1727 0.0000 ***
## Sargan test. H0: residuals uncorrelated with instruments 2.3390 1100 0.1265
h$OtherStats
## d l e Weak.Identification.Fstat R2.final.model R2.UR.model nimp nobs BClambda
## 1 0.22 0.78 0 76.64597 0.2486119 0.2548237 5 830 none
The 14th item in list h is a dataframe containing mean values of variables across imputations. This can be used to make maps, employing the functions mkmapppng() (for ordinal data) or mkcatmapppng() (for categorical data).
mkcatmappng(h$data,"v41.d2")
## png
## 2
Click here to see the map png
One can also write the list h to a csv format file that can be opened as a spreadsheet. The following command writes h to a file in the working directory called “olsresultsEA.csv”.
CSVwrite(h,"olsresultsEA",FALSE)
Click here to see the spreadsheet csv
Models with binary dependent variables are usually estimated with logit or probit ML methods. However, it is a good idea to first estimate the model with OLS, as we did above, to find a good model, and then estimate it with logit, as we do below, using the function doLogit().
q<-doLogit(smi, depvar=dpV, indpv=UiV, rindpv=RiV, dw=TRUE, lw=TRUE, ew=FALSE, doboot=1000, mean.data=TRUE, getismat=FALSE, othexog=NULL)
## [1] "--finding optimal weight matrix------"
## [1] "Exogenous variables used to instrument Wy: xWv1, xWv2, xWv3, xWv4, xWv5, xWv30, xWv31, xWv32, xWv33, xWv41.d2, xWv44.d9, xWv34, xWv2Sq, xWv3Sq, xWv4Sq, xWv5Sq, xWv31Sq, xWv32Sq, xWv33Sq"
## [1] "--looping through the imputed datasets--"
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## Time difference of 3.766827 mins
The output from doLogit(), here called q, is a list containing 8 items.
name | description |
---|---|
DependVarb | Description of dependent variable |
URmodel | Coefficient estimates from the unrestricted; pvalues are from bootstrap standard errors. |
model.varbs | Short description of model variables. Can save a trip to the codebook. |
Rmodel | Coefficient estimates from the restricted model. |
Diagnostics1 | Three likelihood ratio tests: LRtestNull-R (H0: all variables in restricted model have coefficients equal zero); LRtestNull-UR (H0: all variables in unrestricted model have coefficients equal zero); LRtestR-R (H0: variables in unrestricted model, not carried over to restricted model, have coefficients equal zero). One Wald test: waldtest-R (H0: variables in unrestricted model, not carried over to restricted model, have coefficients equal zero). |
Diagnostics2 | Statistics without formal hypothesis tests. pLargest: the largest of proportion 1s or proportion 0s; the model should be able to outperform simply picking the most common outcome. pRight: proportion of fitted values that equal actual value of dependent variable. NetpRight=pRight-pLargest; this is positive in a good model. McIntosh.Dorfman: (num. correct 0s/num. 0s) + (num. correct 1s/num. 1s); this exceeds one in a good model; McFadden.R2 and Nagelkerke.R2 are two versions of pseudo R2. |
OtherStats | Other statistics: Composite weight matrix weights; number of imputations; number of observations. |
data | Data as used in the estimations. Observations with missing values of the dependent variable have been dropped. |
Here are selected portions of the output:
names(q)
## [1] "DependVarb" "URmodel" "model.varbs" "Rmodel" "Diagnostics1" "Diagnostics2" "OtherStats" "data"
q$Rmodel
## coef fst df pval star desc
## (Intercept) -5.801676 76.03 5 0.0003 *** <NA>
## Wy 20.767710 56.89 5 0.0006 *** Network lag term
## v44.d9 1.145178 10.30 5 0.0238 ** Sex Differences: Metal Working == absent or unimportant activity
## v34 0.181654 2.39 4 0.1968 High Gods
## v41.d2 -0.825259 7.18 4 0.0553 * Milking of Domestic Animals == milked more often than sporadically
q$Diagnostics1
## fst df pval star desc
## LRtestNull-R 94.2103 1600 0.0000 *** H0:All coefficients in restricted model equal zero
## LRtestNull-UR 77.9150 1187 0.0000 *** H0:All coefficients in UNrestricted model equal zero
## LRtestR-R 0.6449 1461 0.4221 H0:Variables dropped from unrestricted model have coefficients equal zero (likelihood ratio test)
## waldtestR-R 0.5778 1203 0.4473 H0:Variables dropped from unrestricted model have coefficients equal zero (Wald test)
q$Diagnostics2
## R.model UR.model desc
## pLargest 0.890361446 0.89036145 max(Prob(y==1),Prob(y==0)) [best guess]
## pRight 0.896144578 0.89783133 Prob(y==yhat) [prop. correct]
## NetpRight 0.005783133 0.00746988 prop. correct net of best guess
## McIntosh.Dorfman 1.070091749 1.10474803 prop. correct 0s + prop. correct 1s
## McFadden.R2 0.195769170 0.21202441 McFadden pseudo R2
## Nagelkerke.R2 0.126604883 0.13636626 Nagelkerke psuedo R2
q$OtherStats
## d l e nimp nobs
## 1 0.22 0.78 0 5 830
Compiled on 2017-04-19 by E. Anthon Eff