Compiled on 2021-09-08 by E. Anthon Eff
You might find it easier to read and navigate this html file if you download it to your own computer.
A typical social science research hypothesis takes the form: the presence of x creates conditions favorable to the appearance of y. An example would be: societies practicing intensive agriculture are more likely to believe that the supernatural supports human morality (this is a hypothesis tested in this paper, using data from the Standard Cross-Cultural Sample (SCCS)). Our hypotheses should be derived from social science theory, or at least be plausible. In economics, plausible usually means that the hypothesis is what one might expect in a world populated with self-interested actors allocating scarce resources to alternative ends.
A regression model formulates the hypothesis as a linear function, \(y_i=\beta_0 + \beta_1 x_i + \epsilon_i\). The usual names for these elements are dependent variable for the vector \(y\); independent variable for the vector \(x\); error term for the vector \(\epsilon\); intercept coefficient for the scalar \(\beta_0\); and slope coefficient for the scalar \(\beta_1\). The dependent and independent variables are drawn from databases; the intercept and slope coefficient are estimated by the regression procedure.
In a test of our initial hypothesis, \(y\) would be a measure of the degree to which a society believes that supernatural beings care about human morality, and \(x\) would be a measure of the degree to which a society practices intensive agriculture. Our attention would focus on the coefficient \(\beta_1\); if that coefficient is greater than zero (and has a p-value less than 0.10) then we can say that the evidence is consistent with our hypothesis.
Unfortunately, the slope coefficient from a simple univariate model like the above will almost certainly be biased, and is very likely to provide a misleading answer to a hypothesis test. Among the many reasons for the bias, the most important is omitted variable bias: there are multiple factors that might make a society more likely to believe in moralizing gods, and intensity of agriculture is just one of these. If our independent variable \(x\) (agricultural intensity) correlates with some of these other causes, then the slope coefficient \(\beta_1\) will represent the net effect of these combined causes, rather than the pure effect of agricultural intensity. In order to provide an unbiased slope coefficient for agricultural intensity, one must include in the regression model the most important of these other causes for moralizing gods. This gives us a multivariate model, with two kinds of independent variables: the true variables of interest, often called treatment variables, and variables included to avoid omitted variable bias, usually called control variables.
Selecting control variables turns out to not be as easy as it seems: inappropriate controls can lead to bias or make it impossible to identify the effect of treatment variables. As a solution to this problem it is often a good idea to sketch out the main causal sequences likely to affect the dependent variable. The figure below gives an illustration of what such a schema might look like for our particular hypothesis, where our treatment variable is intensive agriculture and the dependent variable is moralizing gods. A diagram like this is never reported in a publication, but it helps to structure thinking about which control variables to include in a model.
The general rule is that one should not choose control variables that rob the treatment variables of explanatory power. There are two pieces of advice that help one follow that general rule. First, avoid controls that are themselves causes of the treatment variable. So, for example, it would be inappropriate to select high suitability for agriculture as a control variable since this variable shares so much variation with our treatment variable. Second, avoid controls that are intermediate or alternative outcomes of the treatment variable.1 Thus, high population or state-level political organization would not be appropriate controls. The two best control variables to include thus seem to be animal husbandry and high frequency of external war.
The variables found in datasets are either ordinal (including both ordinal and quantitative variables) or categorical (sometimes called nominal). For ordinal variables, there is a clear meaning associated with moving from lower to higher values. All variables used in a regression analysis must be ordinal.
SCCS variable v893 is an example of an ordinal variable. Note that higher values signify lower frequency of external war, a somewhat counter-intuitive meaning. Before using variables in a regression one must be clear about how they are ordered.
description | value | freq |
---|---|---|
Continual | 1 | 26 |
Frequent | 2 | 67 |
Infrequent | 3 | 60 |
NA | NA | 33 |
Other variables are categorical, in that the values signify different categories, and there is no meaning associated with moving from higher values to lower. SCCS variable v858 is an example of a categorical variable. A categorical variable can only be used in a regression model if it is converted into dummy variables. A dummy variable takes on the values of zero or one. Variable v858 can be used to create a dummy variable for animal husbandry (call it animHusb
), where animHusb=1
if the society practices pastoralism, and animHusb=0
otherwise. The dummy variable is ordinal and may be used in a regression model.
description | value | freq |
---|---|---|
Gathering | 1 | 9 |
Hunting and/or Marine Animals | 2 | 9 |
Fishing | 3 | 12 |
Anadromous Fishing (spawning fish such as Salmon) | 4 | 8 |
Mounted Hunting | 5 | 5 |
Pastoralism | 6 | 18 |
Shifting Cultivation, with digging sticks or wooden hoes | 7 | 33 |
Shifting Cultivation, with metal hoes | 8 | 19 |
Horticultural Gardens or Tree Fruits | 9 | 18 |
Intensive Agriculture, with no plow | 10 | 23 |
Intensive Agriculture, with plow | 11 | 32 |
An important step in model building is to develop an understanding of the data used. Know whether a variable is ordinal, or whether it is categorical and must be converted to a dummy to use in a regression. Know what higher values of your ordinal variables signify. Only use independent variables that have a theoretically justified relationship to the dependent variable, and only use control variables that do not rob explanatory power from your treatment variables.
Always think carefully about what your variables mean. For example, suppose that you are interested in testing the hypothesis that immigration is encouraged by government spending. You set up a model with total annual immigrants as the dependent variable, and total government spending as the independent variable: \(Immig_i=\alpha_0+\alpha_1 GovtExp_i+\epsilon_i\). If the estimated coefficient \(\hat\alpha_1\) is significant and positive you will accept your hypothesis. Unfortunately, the coefficient \(\alpha_1\) is almost guaranteed to be significant and positive, since larger countries will tend to have both higher immigration and higher government spending. You have run what is called a spurious regression, one in which the independent and dependent variables share some underlying trend, so that they appear to be related, when in fact they might not be. A much better way to have tested the hypothesis would have been to use immigration as a percent of the population as the dependent variable and government expenditures as a share of GDP as the independent variable.
Setting up the regression model is called specification, which includes both determining the independent variables to include and the functional form of the model. The next section will look at functional form.
see pages 214-217 in Angrist, J. D., & Pischke, J.-S. (2014). Mastering metrics: The Path from Cause to Effect. Princeton University Press.↩︎