Langston, Research Methods, Notes 9 -- Simple Experiments

I.  Goals:
A.  Choosing an IV.
B.  Simple experiments (single factor, between participants).
C.  Problems in design.
D.  Choosing a DV.
E.  Analysis.
II.  Choosing an IV.
A.  At this point, you've turned your question into a hypothesis.  Now you need to choose an IV from that hypothesis.  What will you manipulate?  The first step is to operationally define the “if” part of the hypothesis.  If you haven't been thinking clearly, then this is the first place you get caught.  Consider, for example, the following hypothesis:

 H:  If people are in love, then they'll have a hard time concentrating

What variable might we manipulate here (you might think in love/not in love).  What does it mean to be in love?  What's the operational definition?  Until you have one, you'll have a hard time deciding what to manipulate.  In fact, what you manipulate will look entirely different depending on your definition.
B.  Construct validity comes into play here.  Look at the operational definitions we came up with in class.  Some of these are clearly better definitions of “in love” than others.  To the extent that a definition captures what is generally considered to be the important aspects of a construct (like love), that definition has high construct validity.  Obviously, we need this to be high.  If I define “in love” as “a general positive feeling about someone” it seems better than “being sexually aroused by someone.”
C.  Once we have a good definition, then we can decide what to manipulate, and choose the appropriate levels.  Remember restriction of range.  We need to have enough levels to capture any relationship that might exist between the IV and DV, and we need to have them spaced far enough apart that any relationship that exists can be detected.  Two things:
1.  If you have good operational definitions, this part is easy.
2.  Even though we're trying to pretend that you can take each of these steps independently of the others, it's smart to address restriction of range in the context of the DV you've chosen.  We'll pretend for now that you don't, but that's just to give us time to get to choosing a DV when it's more appropriate.
III.  Simple experiments.
A.  These will all be between participants designs.  This means that each participant in our experiment will participate in one and only one condition.
The designs we'll cover here are also all single factor designs.  This means that you're only manipulating one IV.
B.  The 2-group, (after-only) design:
This is the simplest.  Schematically:

2-group because you have two groups, after-only because you only observe after the treatment.
Basic:  One group gets a treatment, the other group gets nothing, and then you measure the two groups, and look for a difference.  If a difference exists, you can say it's due to the treatment.  (Note the similarity to Mills’ Method of Difference.)
The only effort to equate the groups prior to the experiment is through random assignment.  This ensures that each group will be approximately equal, and that any differences that do exist will be due to chance and not your own biases.  Randomization is almost always your friend, but it doesn't guarantee that the groups are equal, so you might want to check to be sure that they are.
You can do this design without an explicit control group.  For example, if I'm manipulating alcohol as no alcohol and one beer, my no alcohol group might be non-alcoholic beer.  This helps make the groups more similar (both got a treatment, and that treatment seemed identical from the standpoint of the participants).  Also, all sorts of potential confounds or problems are eliminated.  For example, there might be some effect of beer beyond its intoxicating effect (socially learned consequences of drinking).  The non-alcoholic beer group would also have these same reactions.  The only difference is then the alcohol
You don't even need an implicit control group.  If you predict that more of a variable will impact participants one way, and less in another, you can just have two levels that are both treatments, and compare those treatments.  This is particularly useful when you're working with variables that don't have natural zero levels (like mood:  is there a truly neutral mood, or should you instead compare happy-sad?).
C.  2-group, pre-post design:


2-group because we have two groups, pre-post because we observe both before and after treating.
Because randomization is not always your friend, you can use this design to check whether or not the two groups were equal prior to running the experiment.  Basic:  Give both groups the DV and get a measurement, treat one group and do nothing to the other, then give them both the DV again to get a measurement.  Two ways to use these scores:
1.  Just look at the pre-test to see if there's a difference between the groups prior to treatment.  If there's not, assume they were the same and continue as in 2-group, after-only.
2.  Subtract each participant's pre-test score from their post-test score and analyze change scores.  This is better because it statistically equates the groups.  Since each participant serves as their own baseline, and all you're looking at is change, it doesn't matter how different the two groups were prior to the treatment (but see ceiling and floor effects in the discussion of DV's below).  The treatment group should change more, and that's what you're really interested in.  In fact, since you're throwing out all of the variability between people and just looking at change, you should have more statistical power.
A couple of additional notes:
1.  You've statistically controlled for differences, but you haven't eliminated them.  Any differences that were there are still there.  These might still come back to haunt you.
2.  You introduce the problem of test-retest effects.  Since people have seen the DV prior to treatment, it might impact the treatment in some way that will then show up in the post-test.  It might take the form of a practice effect (as in a study on math skills where just practicing the test once boosts people's performance regardless of treatment) or it might influence how people attend to the treatment (as in a study of attitude formation where answering questions about your attitudes prior to someone attempting to change those attitudes might make you firm up your beliefs and make you less likely to be affected by the treatment).
D.  2-group, matched design:


Another approach to handling preexisting differences.  Basic:  If some variable is really important to your topic area (as in you're studying representation of spatial texts and spatial ability is an important individual difference) then you screen participants on that variable, match up participants who score the same on the variable, and randomly assign one member of each pair to group one and the other member to group two.  After that, everything is the same as the simple 2-group, after-only.  This way, you know prior to experimentation that the two groups are equal (with respect to the matched variable).
Some notes:
1.  There's nothing really preventing a pre-post design.
2.  Suppose you don't get a set of pairs, but approximately continuous variation.  Then you have some alternatives to participant by participant matching described above.
a.  ABBA:  If you put the highest score in A, next in B, next in A... (ABAB...), then for each pair the person in A is a little better than the person in B.  This introduces a potential bias.  Instead, put first in A, next in B, next in B, next in A...  This will approximately equate those slight differences.
3.  Again:  Differences between the groups are not eliminated.  The matching variable has been accounted for, but there are countably infinite other sources of variation that you haven't controlled.  You can try to match on multiple variables at once, but that's harder, and the more you match the worse it becomes.  This will help you appreciate how randomization can be your friend.
IV.  Problems in design.
A.  Threats to internal validity (whether you measure what you're supposed to measure).
1.  Is your manipulation effective?  Sometimes, you're not sure if the levels of the IV that you were supposed to have in your experiment were actually present.  For instance, if one group is supposed to be in a sad mood, were they really in that mood?  If the manipulation was not effective, then the results of the experiment aren't very informative (if they're not sad, then you're not measuring the effect of a sad mood).
What do we do in this situation?  Build in a manipulation check.  Either have an extra condition or an extra dependent measure that can tell us whether or not the manipulation was effective.
2.  Do you have a control condition that is appropriate.  Remember, this is supposed to be a baseline for comparison, but sometimes choosing the proper baseline can be difficult.  If you've chosen badly, your comparisons might not be valid.
3.  Do you have any confounds?  We already discussed the two criteria for confounds:  1) they covary with the IV, and 2) they could reasonably be expected to produce a change in the DV.  If you have confounds in your experiment then you don't know if changes were due to the IV or the confound.  Confound stuff:
a.  Looking for them:  This is hard.  Before running the experiment, ask yourself “could anything else reasonably be expected to produce this effect?”  List everything that comes to mind, however trivial.  After you brainstorm, take each item on your list and evaluate it with respect to the two criteria.  Anything that seems a threat should be dealt with.
b.  Fixing confounds:
1).  Control them:  If a variable doesn't covary with the levels of the IV, then it's not an issue.  So, if you're looking at the effect of training on test performance and you're worried that time of day of the test might also affect performance, then run everyone at the same time.  This makes time a constant, and it isn't a problem.
2).  Counterbalance:  If you have two groups (like experimental and control) run half of each group at one time and half at the other.  This way, even though time of day varies (you have at least two times), it doesn't covary with the IV (which would be the case if you ran all of one group at one time and all of the other group at another time).
c.  Using confounds:  Sometimes, confounds can yield useful information.  Spotting confounds can lead you to discover new things about the relationships between your IV and your DV.  For example, if time of day does affect test performance, that information has a lot of important implications.  Even if you don't discover the confound until after the experiment, you've still learned something important.  In other words, you can sometimes profit from your mistakes.
d.  When to control for confounds:  Before the experiment!
e.  How does experimental control affect confounds?  The more control you exercise, the less you need to worry.  Confounds are systematic sources of variation.  To the extent that the only differences between your groups are due to the IV (which is the case in a perfect experiment) there will be no confounds.  As you get sloppier, the potential gets bigger.
4.  Running the experiment:
a.  Demand characteristics:  When part of your experiment suggests to participants what the hypothesis of the research might be or what the experimenter wants them to do.  This can lead to participants performing in the way they think they're expected to perform, instead of in the way they would perform naturally.  Sometimes, you can have demand characteristics that suggest something other than the real hypothesis, but which will still influence behavior.  You want to watch for these as well.
b.  Hawthorne effects:  The classic story (which may not be entirely accurate but will illustrate the situation suitably for our purposes):  Researchers were investigating the effect of light levels on productivity of factory workers in the Hawthorne electric plant.  First, they turned on more light.  Productivity increased.  Then they turned down the light.  Productivity increased.  Even when the workers were toiling away in semi-darkness, productivity increased.  The problem was that just knowing they were being observed was causing the workers to change their behavior (who wants to look bad in the experiment?).  This is a problem you might have as well, so when observation can influence behavior, the observation should be done as discretely as possible.
The point of this section:  You're working with humans, and they will not act like machines.  Knowing they're in an experiment can impact what they do.  You should be aware of that fact.
B.  Threats to external validity:  Generalizability:  As you become more and more control oriented, your experimental task ceases to resemble the real world phenomenon that you're trying to study.  You need to try and balance your need for control with your desire to make general statements about the world.  Control can be a two-edged sword.  Three issues here:
1.  Are the participants representative of the population?  Any time you work with a sample, you have to worry about whether they're really representative of the population.  Random sampling procedures (like we discussed with surveys) would help a lot, but they're rarely, if ever, used in experimentation.  Sometimes this is no problem, sometimes it is a problem.
2.  Are the variables you're using representative of the changes that happen in the real world?  If not, then your findings won't apply.  For example, manipulating background noise with a white-noise generator might not relate well to the real kinds of background interruptions people normally face.
3.  Is the experimental setting representative?  This is related to ecological validity.  The lab setting may not be representative of the normal setting for what you want to study.
V.  Choosing a DV.
A.  Just like with the IV, you start with a good operational definition.  In particular, for the “then” part of the hypothesis (what did we mean by “hard time concentrating?”).  Some things to think about as you choose a DV:
1.  Scales of measurement:  Ratio data is more informative than ordinal data, and allows us to perform more interesting tests.  If you keep this in mind during the planning stages, you can shoot for the highest quality of data.
2.  Sensitivity:  You need a DV that will be affected by changes in the IV.  If you did the Stroop experiment one word at a time (instead of in groups), then naming time in seconds would be a bad variable.  It's not sensitive enough to capture the differences.  You don't want to find no effect in your experiment just because your DV was too crude.  One thing you can do is build in a condition that should be very strongly affected, and use it like a manipulation check.  If there's a big difference between that condition and some other, then you know the DV was sensitive to at least some changes.
3.  Ceiling and floor effects:
a.  Ceiling effects:  When your task is too easy, and all participants perform at or near perfect, you have a ceiling effect.
b.  Floor effects:  When the task is too hard and everyone performs at the worst possible level.
Note the sensitivity issue here:  You need variability in performance.  If nothing else, you want differences between the groups in your experiment.  So, you need people to perform around 80-90% of perfect in your experiment to make room for the best participants to be better than the rest and the worst participants to be worse.
4.  Reliability:  As defined before:  How well you measure what you intend to measure.  You want a dependent measure that will consistently yield the same score in the presence of the same circumstances.  For example, if your questionnaire measures my achievement orientation trait, then it should yield the same score every time you use it to measure that trait.  As you just saw in 3, variability is good, but too much variability kills you.  If your measure isn't consistent in the scores it provides, you'll get so much variability that real differences will be obscured.
5.  Validity:
a.  Construct validity:  You need an IV that manipulates the appropriate construct, and you need a DV that measures the appropriate construct.  The issues here are very similar to the ones raised when choosing an IV.
b.  Face validity:  How well it seems to measure what you want to measure (a superficial sort of measure).  On the one hand, this is important to get people to believe you.  If your measure appears to bear no relationship to what you're claiming to measure, nobody will take it seriously.  But, if it's too transparent, you might get into some of those participant compliance issues.
VI.  Analysis.  For a two-group, between-participants design, you will use an independent samples t-test for the analysis.  The computations are complex enough that it's worth letting a computer do it for you.  When you finish, here's a sample of how to write up the results:

“The data were analyzed using an independent samples t-test.  The independent variable was amount of love, and the conditions were in love and not in love.  The dependent variable was concentration.  The mean concentration scores for people in love and not in love were 2.00 (0.71) and 4.80 (0.45) respectively.  With alpha = .05, the two population means were significantly different, t(8) = -7.57, estimated standard error = 0.37.”

Where are we now?
We have a hypothesis.
We've chosen an IV.
We've chosen a design.
We've checked that design for problems (like confounds, etc.).
We've chosen a DV.

Now, we'll proceed to add some wrinkles to this basic design.

Research Methods Notes 9
Will Langston

Back to Langston's Research Methods Page