Langston, Research Methods, Notes 5 -- Measurement

I.  Goals.
A.  Introduction.
B.  Psychophysical scaling.
B.  Psychometric scaling.
C.  Reliability and validity.

II.  Introduction.  The problem is this:  Your experience is private.  I can't look at you and know what is inside.  But, I want to measure that stuff.  The more directly I can measure it the better.  It's been said that “Behavior is the road to subjective experience.”  In other words, I can infer from your behavior what must be in your head, but, can I measure your mental life directly?  Trying to answer this question leads us into some very thorny and difficult areas.
A.  First, we have to talk about what numbers are and what they can stand for.  Depending on where the numbers come from (what sorts of measurements produce them), they can be more or less informative.  There are four types of measurement scales:
1.  Nominal:  The data are names or labels for categories.  If they're numerical, the numbers have no meaning outside of their function as labels.  For example, if I collect people's soft-drink preference and assign Pepsi = 1, Coke = 2, Mountain Dew = 3, etc., I couldn't say that Pepsi is better than Coke because it's a lower number.  All I can say is how many of each kind I got.
2.  Ordinal:  The order matters.  So, 1 is now ahead of 2.  But, I don't know anything about the distance between the numbers.  The space between 1 and 2 could be huge, but the space between 2 and 3 could be small.  If you think about finish times for a race, you have this kind of data.  The one who finishes first is first, but we don't know by how much they beat the one who was second.
3.  Interval:  The intervals between the numbers are equal.  So, 2 - 1 = 4 - 3.  Temperature is an example of this.  The difference between 50 and 60 degrees is the same as the difference between 20 and 30 degrees.  If you have this kind of data, you can meaningfully do math (like take an average).  But, you can't make ratio statements (like 60 degrees is twice as hot as 30 degrees).
4.  Ratio:  The point that is zero on the scale really has none of the thing being measured.  So, if I said something had zero length, it has no length.  Note that on a Celsius scale, you don't get this property.  Zero is the freezing point of water, not “no temperature.”  With ratio data you can make ratio statements.  So, a line of two inches is twice as long as a line of one inch.

Top

III.  Psychophysical scaling.

Top

IV.  Psychometric scaling.  When we can't measure the physical thing being presented on a physical scale we have to do psychometric scaling.
A.  Guttman scaling.  An attempt to get a ratio scale out of otherwise nominal (maybe ordinal) data.  Imagine that I wanted to assess what kind of student you are.  I could have ordinal categories:  1 = Excellent, 2 = Above average, 3 = Average, 4 = Below average, 5 = Poor.  Instead, I can list behaviors in order from the least amount of student activity possible up to extreme amounts.  For example: attends class, does outside reading, does homework, studies regularly, meets with professor, etc.  Then, you say whether or not you do these things.
If the scale works (for example, people who answer “yes” to number six answer “yes” to 1-5, and people don't say “yes” to any item after the first “no”) then we can treat the number of “yes” responses as a score on a ratio scale.  A person who says “yes” to all of my items would get a five, “no” to all gets zero, and others can be in-between.  Then we can do all statistics with these numbers.  So, we get ratio data out of an otherwise troublesome situation.
Note:  We would be hard-pressed to come up with a solid definition of a good student that could be measured on a physical scale.
I have an example of a Guttman scale for test anxiety.
B.  Likert scales.  Another way to measure stuff that's hard to scale physically.  I have a theory that people can be extroverts or introverts.  I want a measure that's at least interval (to allow statistics) that assesses whether you're an extrovert or an introvert.  For a Likert scale, I give you items and you say how much those items apply.  For example:

At a party, I frequently talk to as many strangers as possible.

 Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree 1 2 3 4 5
You circle the one that applies.  I would also have an item like:

At a party, I usually only talk to the people I already know.

 Strongly agree Agree Neither agree nor disagree Disagree Strongly disagree 5 4 3 2 1
I assign the choices a number which the person taking the test doesn't see.  I can sum your ratings on these items (note that I reversed the numbers for the second one because it's the opposite question) to get your score.  In this case, the higher your score, the more introverted you are.
A problem here is that people probably don't treat these as interval.  Is the difference between “strongly agree” and “agree” exactly the same as the difference between “neither agree nor disagree” and “disagree?”  If they're not the same distance apart, this is only ordinal data.  The idea is that by summing scores from several items, we overwhelm minor differences, and we get interval data.
I have a Likert scale that also measures test anxiety.

Top

V.  Reliability and validity.
A.  Assessing reliability:  Reliability has to do with consistency of measurement.  I want a measuring tool that always gives the same value to the same amount of the thing being measured.  For example, if a person weighs 100 pounds, a reliable scale will always register 100 pounds when they get on it.  How can I assess reliability?
1.  Test-retest reliability:  I measure the strength of the relationship between people's score on a test the first time they take it and their score the second time they take it.  If the correlation is high, the test is reliable.  If it's low, the test is not reliable.
2.  Split-half reliability:  I correlate your score on two halves of the test (usually items in a “half” are chosen at random and not just the first half encountered vs. second half).  If it correlates well, the test has high reliability.
Why is this so important?  I want to measure changes.  For example, changes in your depression after therapy.  If there's a change, then you will get a different score on a test before therapy than you'll get after therapy.  If the test is reliable, I can believe that the difference in scores is due to the therapy.  If the test is unreliable, the difference could be due to the therapy or it could be due to error in the test.  In other words, unreliable tests make causal relationships harder to determine.
B.  Assessing validity:  I also need a measuring tool that measures what it claims to measure.  A depression instrument should measure depression, not phobias.  How can I assess this?
1.  Criterion validity:  We know that depressed people are slower at finger-tapping than non-depressed people.  If you ask a depressed person to tap their finger as fast as possible, they'll have fewer taps per minute than a non-depressed person.  We can use this criterion for depression to check the validity of our test.  If people who score as depressed on my test are also slow finger-tappers, the test has good criterion validity.
2.  Face validity:  On the surface, it looks like a good way to measure what I'm interested in.  Finger-tapping has low face validity for depression, but it works as a test, so you should be suspicious of this type of validity assessment.
3.  Predictive validity:  If your measure is valid, then it should reliably predict certain behaviors.  For example, depressed people on your test should show evidence of a sad mood.  If they do, it has good predictive validity.
In general, what you're trying to do is figure out if your measure is accurately tapping into the correct underlying construct.  The more ways you can demonstrate that you're measuring what you should be measuring, the better your measure.