Langston, Cognitive Psychology, Notes 2 -- Pattern Recognition
 
I.  Goals.
A.  The scheme.
B.  Input.
C.  Identification.
D.  Recognition/meaning.
 
II.  The scheme.  How do you know what you're looking at?  (Hearing?  Touching?  Smelling?  Tasting?)  That's the question for this section.  Keep in mind that our overall program here will be a move from inputs to outputs.  We're going to try to work from the bottom (identifying stuff in the world) to the top (thinking).  Remember that this is the basic model we're working from:
 
Information processing model
 
This unit will be about the sensory store (and pattern recognition needed to get the system going).  There are three components to recognizing something:
A.  Input.  This is primarily about sensory systems.  Some very brief storage of information takes place and attention samples some of that stuff for us to identify.
B.  Identification:  The first stage in getting the meaning.  What are the parts of the thing (features).  Or, if you're holistically bent, what is the overall gestalt of the thing.  This could be done via low-level feature modules or it could be more complicated.  One thing we know:  More information leads to better identification.
C.  Attach meaning (recognize):  This is the part where it gets interesting.  What is the representation of meaning?  What processes act on that representation?  A lot of books just equate recognition with identification, but that's cheating.  It's one thing to know the letters of a word.  The meaning of the word is entirely different from its letters or even how the letters are arranged.  You might be able to experience this if you repeat a word over and over.  For example, you might never have thought about how strange a word like "over" is until you study it separate from its meaning.
We'll handle each of these stages in turn.
 
Top
 
III.  Input.  As you should remember from our basic architecture, there's a brief sensory store that holds information for processing.  Whenever we look at one of the boxes we'll generally ask:  What's the capacity, what's the duration, what's the code, and how does information get removed?  Sperling (1960) did an important experiment working out the properties of this store.  Here's his basic task:
You look at a grid of letters presented for a very brief period of time (1/20 second):
 
G T F B
Q Z C R
K P S N
 
CogLab:  Partial Report.  We'll look at the class data.

Then, I ask you to tell me the letters (whole-report technique).  Or, I can play a tone.  If you hear a high tone, report top row.  Middle tone report middle row.  Low tone report bottom row.  This is called the partial report technique.
Using the partial report technique, Sperling found that people could normally report about three letters per row.  So, they can report about 75% of the information in the display.  That's the capacity of your sensory store.
Then he manipulated the time before the tone.  If you play the tone immediately, people get between 9 and 10 letters (multiply three per row * 12 letters and you get around 9).  Then he delayed the tone.  The delays were 150 msec, 300 msec, or 1 second.  By one second, people were back to around 4.5 letters, which is the same amount as the whole report technique gets.  So, the information in the sensory store seems to decay within one second.
You might be asking yourself why this is relevant.  The environment is relatively continuous, so if you miss something, you can look again.  In other words, the duration and amount of information in the sensory store might just be an artifact of the situation, and in the real world, it's all irrelevant.  This is a nice cognitive psychology problem.  Either your mind works contrary to intuition (using brief snapshots of information for further processing even though the original persists in the environment) or a laboratory result has no relevance for real processing.  Your book has a nice discussion of how one might decide which of these views is correct.  Some nice reaction papers could come from this.
Conclusion:  The raw sensory data for pattern recognition is about 75% of the input, and lasts for less than one second.  How do you go about using this information?
 
Top
 
IV.  Identification.  We want to see how much we can do based entirely on the inputs.  Bottom-up processing is processing that is unaffected by higher cognitive processes (like knowing what you think you're looking at).  It's driven by the stimulus.  Top-down processes are coming from higher cognitive processes.  If we can, we want to avoid talking about top-down stuff.  If there's enough information in the input to do the work, then we don't need anything else.  That's antithetical to cognitive psychology, but it can at least convince us that cognitive stuff is involved.
A.  There are three ways that identification might work.  I'm going to start with a list now, and later we'll discuss potential applications.
1.  Template matching.  You store a perfect copy of everything that you might encounter.  Then, when you see something (say a cat) you compare that image to everything in memory and you take the best match.
Support:
a.  Instance theories of memory (you store some trace of everything you encounter) work pretty well.  For example, Hintzman (1986) presented a model that recorded a trace of experience based on "transducable" features.  The details of an experience would include things like emotional tone, color, odor, etc. as well as primitive abstract relations like above, before, etc.  Memory retrieval (or recognition) was based on an echo of all traces currently in memory.  So, the echo is like a template, but it's abstract.  If instance models can store a trace of every experience, the number problem with templates (discussed below) isn't as big a deal.
b.  Perceptual priming effects (repetition priming).  Example:  Jacoby (1983).  People look at a list of words.  They can do three tasks with those words.  The condition we're interested in is reading the word.  Then, they get either an explicit or implicit memory test.  For explicit, the old words are mixed with new words, and people circle the one they saw before.  For the implicit test, people identify the words under constrained conditions (the words are presented really fast).  As you can see in the results, the implicit task actually showed better performance.  The point:  You could easily store a template of every item that you encounter because you'd have to store a copy of everything to produce repetition priming.
Problems with the template model:
a.  Too much memory (too much variability).  In spite of what's above, there are a lot of things to store with this model.  Maybe too many.
b.  Problems in matching (orientation, position, size, etc.).
c.  Don't know what produces the match (two things can be close, but different; without analyzing features, how can you tell which is closer?).
d.  Can't produce two interpretations of the same thing (for example, Necker cube).
2.  Feature models.  Everything can be broken down into a set of features.  This set has these properties:
a.  The features are critical (they allow you to tell things apart because different things have different features).
b.  Features are the same when brightness, size, and perspective change.
c.  The features yield a unique pattern for every input.
d.  Reasonably small number of features.
If you do the feature part well, you get 2N patterns that you can identify with N features.  So, with 2 features, you can identify 4 things.  With 8 features, you can identify 256 patterns.  With 20 features, you can identify 1,048,576 patterns.  (Note that a perfect game of 20 questions should almost always be solved.)
Gibson (1968) worked out a model for letters.  This chart has the features (along the left) and the letters (across the top).  With this set, you can distinguish typed, capital letters in English.
Support:
a.  Confusion matrices.  For example, if you compare G and W it takes on average 458 msec, but it takes 571 msec to compare R and P (R and P share more features).  Note that this is also consistent with template matching.  If you work out a complete matrix for letters, you can make an argument about which pairs are harder based on features.  If we look at Gibson's table, we can probably do a little of that.  (Gibson, Schapiro, & Yonas, 1968)
b.  Cluster analyses.  If you look at a bunch of comparisons and figure out what goes together, shared features seems to be an important determinant.  For example, curves separate from straight lines, and C and G form a category.  The complete cluster analysis is presented here.  You can see how letters aggregate into smaller and smaller groups on the basis of shared features.  (Johnson, 1967)
c.  Face recognition.  Caricatures that exaggerate distinctive features are recognized faster than faithful line drawings.  (Rhodes, Brennan, & Carey, 1987)
d.  Brain organization.  Certain cells respond best to lines of a particular orientation of a particular length.  Your brain seems to be looking for features.
Two examples of this.  In frogs, if you record from retinal ganglion cells (which bundle together to form the optic nerve), you can find four cell types:  1)  stationary edges, 2)  moving edges, 3)  dimming, and 4)  small, moving spots.  Most of these cells were also location specific, so they wanted a particular thing in a particular spot.  (Lettvin, et al., 1959)
Farther back in the cortex are cells that are not location or eye specific, but still look for particular features.  For example, simple cortical cells look for edges of a particular orientation, from vertical to horizontal, and angles in between.  (Hubel & Wiesel, 1963)
Both of the neural studies just discussed suggest that feature detection is an integral component of brains doing recognition, so our cognitive models should probably take that into account.
Problems with feature models:
a.  It's hard to get a set of features with the properties we want.  Letters are pretty easy, but what about things like desks and chairs or dogs and cats.  Do we need custom sets for different classes of things?  Where do they come from?
b.  It's hard to tell if these predictions are different from template theory.
c.  When you ignore how features combine, you're making a big mistake.  You need to know the relationships as well as the features.
3.  Structure models (recognition by components).  As the gestaltists say, the whole is greater than the sum of its parts.  Analyzing the features may be a step, but putting them together is where the action is.  The analysis involves grouping things together that go together (the general problem is figure-ground segregation:  How do you tell the object from the background?).  Here are five grouping principles:
a.  Proximity:  Stuff that's close together is part of the same unit.
b.  Similarity:  Stuff that looks alike goes together.
c.  Continuity:  The perceptual system prefers continuous interpretations to discrete interpretations.
d.  Closure:  Closed figures are preferred.
e.  Connectedness:  Stuff that's joined gets grouped as one object.
Support:
a.  Object recognition stuff shows that eliminating the edges isn't as bad as eliminating the vertices (the vertices show how the parts go together).  So, features without relations is not much help.
 
Top
 
V.  Recognition/meaning.  Assume we know the features and their relationships, are we done?  Not really.  Neuropsychological evidence suggests that meaning is separate from identification.  Luria had a patient who could make out letters and letter features, but could no longer figure out what words meant.  In prosopagnosia, a person loses the ability to recognize faces.  Generally, face discrimination is fine, the faces just lack meaning.  Other disorders also suggest that getting all the information for meaning and the meaning itself are separate.  Let's look at some kinds of recognition that you do and see how you get meaning.
A.  Letter recognition.  A good place to start is with a survey of the inputs for letter recognizers.  These could be printed letters or handwriting.  I'm going to skip handwriting because it's so complicated.  Let's look at some properties of machine type.  (See Crist, W.B., & Lockhead, G.R.  (1980).  Making letters distinctive.  Journal of Educational Psychology, 72, 483-493 for an example of research on this topic.)
1.  Font characteristics:
 
A list of some basic properties Best option
a.  Serif vs. sans-serif:  A serif is the little horizontal mark on the tops and bottoms of vertical lines in fonts (the line on the bottom of this f).  Originally, it was used when cutting letters in stone to prevent the stone from cracking, but it was preserved out of tradition.  Luckily, serif is better. 
b.  Weight difference:  Some lines in a character are thicker than others (e for example). 
c.  Bias:  The fonts can be on a bias or they can be vertical. 
d.  x-height:  How much height is devoted to the body of the characters (how tall the x is). 
e.  Spacing:  Some characters are wider than others (i vs. e).  Typewriters force these to take up the same amount of space, but it's not required in printing (proportional is when they take up only the required space) (piece vs. piece). 
f.  Proportions:  How big is the x-height relative to the heights of ascenders and descenders (parts going above and below the body of the letter).  There is an optimal proportion for each font.
Serif 
 
 
Difference  
Biased 
More is better 
Proportional  
 
Optimal 
 
 
2.  Impact of features.  Fonts can have a huge influence on identification.  As an example of the importance of features on letter identification, consider an experiment by Neisser on visual search:  You scan for an 'X' in a field of 'Z's and 'N's or a field of 'O's and 'P's.  'X' is easier to see in letters with different features than letters with similar features.  Try it:
 
N N Z N Z N Z N Z   
Z N Z Z N Z Z N N   
N N N Z N X N Z N   
N N Z N Z N Z N Z   
Z N Z Z N Z Z N N 
O O P O P O P O P  
P O P P O P P P O  
O O P P O X P O P  
O O P O P O P O P  
P O P P O P P P O
 
The X should "pop out" of the grid with dissimilar features.  A similar process of analysis-by-features probably takes place in reading, making it very important to understand the features.
B.  Word recognition.  Once you have letters, you need to recognize words.  The letter features still apply, but now we add some additional features.
1.  Word envelope.  If you outline the word, that's the word envelope.  This holistic feature may help in identification.
2.  Spelling rules (orthography).  If there are rules that govern spelling patterns, then knowledge of these rules can help the process of identifying words.  The system for English was worked out by Vinetsky.
a.  Some of the big rules:
1)  Avoid letter doubling.
2)  VC V,  V C C V,  V C:  A vowel before a consonant-vowel is long, a vowel before a C-C-V is short, a vowel before a consonant is short.
a)  To override V C, add a dummy 'e' to get V C V.  Examples:  "fin", "can" vs. "fine," "cane."
b)  To override V C V, you have to double.  Example:  "cunning."
3)  Especially avoid doubling at the beginnings and endings of words.
a)  Except for ff, ll, ss.
b)  Except for 3-letter words (egg, inn, add, ebb).  Why?  18th century editors decided to reserve 2-letter words for function words (to, in).
Why do these features matter?
1.  They give you hints at pronunciation even if you've never seen the word before ("mabe," "mab," "mabing," "mabbing").
2.  They help you to know what letters to expect in a particular situation.
3.  Example of orthography constraints:  Word superiority effect.  Letters are perceived better in words than alone or in scrambled words.  Imagine the following experiment (Reicher, 1969):
 
See Choose
d d or k?
 
vs.
 
See Choose
word d or k?
 
vs.
 
See Choose
rwod d or k?
 
People can identify the letter better if it was in a word than alone.  There are other examples of similar effects.  For example, Huey found that words can be perceived at distances that are too great to perceive the letters that make up the words.  All of this suggests that words are the unit.  But, you clearly have to see letters too.

CogLab:  Word Superiority.  We'll look at the class data.

Note that the finding is paradoxical.  The word helps you identify the letters, but you should have to look at the letters before you can identify the word.
Miller, Bruner, and Postman (1954) show how word superiority could be due in part to spelling rules.  They made strings of letters that get closer and closer to English.  A zero-order word would be YRULPZOC.  This word is unrelated to English spelling.  A fourth-order approximation would be VERNALIT.  All successive sets of four letters match English spelling.  The closer to English a string is, the easier it is for people to remember it.
The best model for explaining word superiority is the interactive-activation model (McClelland & Rumelhart, 1981).  This model has three layers of nodes.  Feature nodes take features as input.  Letter nodes are activated by feature nodes.  For example, if you have / and ), that would activate 'D', 'R', 'B', etc.  Word nodes are activated by letter nodes.  For example, if the letters 'W', 'O', and 'R' were activated, "WORD" and "WORK" would be coming on.  Information from letters also feeds down to features.  If you're pretty sure you have a 'K', then you can suppress all the curved features.  Words also feed down to letters.  If you think it's "WORD," you can knock out the 'K'.  If you look at graphs of letter activations, you can see how this model produces word superiority.  When you have other letters they can activate a word, and the word can help you with the letters.  When the letter is by itself, it doesn't get this help.
C.  Speech perception.  This is a special problem for recognition.  You hear lots of speech, it's really hard to identify the sounds, you do it anyway.  How?
The first thing to ask is:  How can we describe speech?  Some terms:  A phone is a sound.  You can make around 4,096 sounds.  However, in a database of (nearly) every human language, only 869 distinct phones are used.  Most of these are very rare (occurring only once or twice), with a smaller set (100 or so) accounting for most of the sounds used in languages.  A phoneme is a sound that changes meaning.  Languages cluster phones together to get a phoneme.  The individual phones within a phoneme are called allophones.  A phoneme conveys meaning in the sense that changing from one phoneme to another will change the meaning of the word being produced (as in going from "bit" to "pit", one sound changed and the meaning changed with it).
How do languages "choose" phonemic differences?  The general plan is to maximize distinctiveness.  Look at the plot of vowel space below:
 
 Vowel space
 
These graphs plot the first frequency component of vowel sounds against the second.  The line represents a boundary between sounds humans can produce and sounds that they can't.  Inside the line is possible, outside isn't.  If a language has just three vowel sounds, odds are it uses the three in the first picture.  With five, it's likely to use the five in the second.  Note how this spreads the vowels as far as possible in the space that humans can produce.  This makes them easier to discriminate while listening to speech.  The general rule is "ease of articulation is secondary to distinctiveness."  For instance, it's easier to say the vowel in "bit" than in "beet," but the one in "beet" is more discriminable, so it's much more likely to be used.
There are two ways to classify speech, and both have their special features so we'll discuss both of them:
1.  Articulatory phonetics:  Articulatory phonetics describes speech sounds in terms of the vocal tract mechanisms used to produce them.  The sounds come out of a system involving an air source (lungs), a sound source (larynx:  vocal cords), and filters (pharynx:  chamber in the throat, mouth, and nasal passages).  Speech signals can be analyzed according to the contributions these parts make.  The sound source can allow the sound to be either voiced (you make sound) or voiceless (no sound).  The filters can disrupt the air flow by stopping it, causing turbulence, or modifying it.  These disruptions can take place in several locations along the vocal tract.  This leads to three dimensions along which sounds vary:  voicing (voiced or voiceless), manner (how the air flow is disrupted), and place (where the air flow is disrupted).  Overall, a sound is:  air + voicing + manner + place.
This matters because a prominent model of speech recognition is the motor theory.  The idea is that you recognize speech by mentally working out how you would have to position your mouth to produce the sounds.  Obviously, knowing something about sound production would make a big difference.
2.  Acoustic phonetics:  The other way to characterize the speech signal.  We're no longer interested in describing how it's produced.  Instead, we want to know what is produced.
a.  Primary methodology:  Spectrogram:  Plot frequency (of a sound) with duration and intensity.  How does it work?  Imagine a long row of tuning forks, each responding to a particular pitch.  No two forks are alike, but the differences between each one are very slight.  Arrange these in order from highest to lowest pitch.  Then, hook an electrode to each that sends a charge when it vibrates.  Hook a pen to the other end of the electrode.  When you pass a sound over these forks, each fork will only vibrate if its pitch is in the sound.  So, there will only be marks on the paper corresponding to the pitches in the sound.  This produces a spectrograph:  a recording of the pitch (frequency) components of a sound.
Some important points about spectrographs:
1)  The sounds you make are complex.  In other words, they're composed of many different frequencies.  The spectrogram breaks sounds into these frequencies.  Loosely speaking, each dark band represents a frequency.  Each band is called a formant.  These are numbered from bottom to top.  So, F1 (first formant) is the lowest, then F2, ...
Formant transitions are places where there's a sharp rise or fall in a formant.  Generally, these correspond to consonants.  A steady state is a place where there is little change in a formant.  These generally correspond to vowels.
2)  The darker the band is, the higher the intensity (loudness) of that sound.
3)  As you go from left to right you can see how the sounds change in time.
Now that we know how to read these, we can look at a topics in speech perception that makes it such a hard task.
b.  Problems in perception:  Context conditioned variation:  The phoneme is different (physically) depending on what's around it.
For example, the second formant in "di" is totally different from "du", but hearers perceive both as having a /d/ at the beginning.
The question is:  How can two totally different physical stimuli be classed as the same thing?  This problem is also called "lack of invariance."  In order to classify sounds they need to be consistent (invariant).  Since they're not, you suffer from lack of invariance.
D.  How does context affect recognition?  So far, it might sound like the process is data-driven (the inputs dictate what happens).  But, recognition is also conceptually driven (your idea of what you're looking for affects recognition).  Context is one place where this comes into play.  Consider an experiment by Biederman (1972).  If you take a natural scene (like a kitchen) and a rearranged version of it (same stuff, but not in the usual arrangement) people recognize objects more rapidly in the properly arranged scene.  Context (the other items) can help when it's accurate because it directs attention to the right location and tells you what's there.  There are lots of other context effects, and we'll return to this when we get to language.
E.  So, what about meaning?  Let's consider words because they're a little easier.  What is the meaning of a word?  Two simple ideas:  The meaning is what the word refers to in the world or The meaning is the mental image a word evokes.  These are wrong.  Think about the meaning of the words "young girl."  If they mean what they refer to, then they only mean something in the context of some particular young girl.  You'd probably agree that that isn't correct.  As for images, think of a cat.  We're probably getting different images, but we can easily agree on the meaning.  Therefore, the meaning must be more than the image.  What is it?
How about a two part model.  The meaning is in propositions and models derived from them.  A proposition is an idea unit.  Take a sentence like "the star is right of the circle."  You can make the necessary propositions.  You also need a model that you can match to perceptual experience to see if the statement is true.  With those two things, the meaning is specified.  This is also probably incorrect, but it's an idea.
F.  How is meaning grounded?  A variant of the Chinese Room problem is for you to imagine yourself just off the plane in China.  You have a Chinese dictionary.  When you ncounter a sign, you look up its symbols in your dictionary.  These take you to more symbols, etc.  At what point would you say you know the meaning of the sign?  Without grounding the symbols in some way, the answer is you would probably never figure out the meaning.  How do we ground symbols?  One hypothesis comes from research on embodied cognition.  Symbols are grounded in the relationships between our bodies and the environment.  The conceptual system can start with simple relations and gradually build to complex representations.  We'll look at a couple of areas of research as we think about this:
1.  The action-sentence compatibility effect (ACE).
2.  Constructions.
3.  Gesture.
The basic point of this section is that it's not enough to identify the features or show that people are attending to the features.  Recognition is more than just getting a list or accessing the correct symbol.  For something to have meaning, it has to be more than that.
 
Top
 

Cognitive Psychology Notes 2
Will Langston

 Back to Langston's Cognitive Psychology Page