Langston, Psychology of Language, Notes 3 -- Speech Perception
 
I.  Goals.
A.  What is speech?
B.  Articulatory phonetics.
C.  Acoustic phonetics.
D.  Topics in speech perception.
 
II.  What is speech?  Speech can be analyzed at many levels.  The three big ones:
A.  Acoustic:  This is an analysis at the physical level.  It's made up of frequencies, durations, and intensities.  We'll come back to acoustics a bit later.
B.  Phonetic:  This is the level of speech sounds.  Each of the possible sounds the human vocal system can produce is called a phone.  Theoretically, you can produce around 4,000 phones.  However, in a database of (nearly) every human language, only 869 distinct phones are used.  Most of these are very rare (occurring only once or twice), with a smaller set (100 or so) accounting for most of the sounds used in languages.
C.  Phonemic:  A phoneme is a set of phones that are treated as identical within a language.  The individual phones within a phoneme are called allophones.  A phoneme conveys meaning in the sense that changing from one phoneme to another will change the meaning of the word being produced (as in going from “bit” to “pit”, one sound changed and the meaning changed with it).  Some phonemic changes result in other words, some result in nonsense (“bit” vs. “dit”).  But, phonemes themselves don't convey meaning (they change it, but don't have any of their own).
To map the phonemes of a language, researchers present native speakers with pairs that differ by only one phone and ask if the two words are the same or different.  If the speaker says “same” the phones that were changed are allophones.  If the speaker says “different”, then they're different phonemes.  As you might imagine, this can be a rather tedious process.  As an example, an English speaker would say that the /k/ in “keep” is the same as the /k/ in “cool.”  So, if I substitute one of the /k/’s (the /k/ from “keep” in “cool”), the speaker should say “same.”  On the other hand, voicing /p/ is a phonemic difference (“pit” vs. “bit”).  So, substituting a voiced /p/ (a /b/) will result in the speaker saying “different.”
How do languages “choose” phonemic differences?  The general plan is to maximize distinctiveness.  Look at the plot of vowel space below:
 
Vowel space
 
These graphs plot the first frequency component of vowel sounds against the second (approximately copied from Kluender, 1994, Fig. 1).  The line represents a boundary between sounds humans can produce and sounds that they can't.  Inside the line is possible, outside isn't.  If a language has just three vowel sounds, odds are it uses the three in the first picture.  With five, it's likely to use the five in the second.  Note how this spreads the vowels as far as possible in the space that humans can produce.  This makes them easier to discriminate while listening to speech.  The general rule is “ease of articulation is secondary to distinctiveness.”  For instance, it's easier to say the vowel in “bit” than in “beet”, but the one in “beet” is more discriminable, so it's much more likely to be used.
 
Top
 
III.  Articulatory phonetics.  There are two ways to describe speech.  Articulatory phonetics describes speech sounds in terms of the vocal tract mechanisms used to produce them.  The sounds come out of a system involving an air source (lungs), a sound source (larynx:  vocal folds), and filters (pharynx:  chamber in the throat, mouth, and nasal passages).  Speech signals can be analyzed according to the contributions these parts make.  The sound source can allow the sound to be either voiced (you make sound) or voiceless (no sound).  The filters can disrupt the air flow by stopping it, causing turbulence, or modifying it.  These disruptions can take place in several locations along the vocal tract.  This leads to three dimensions along which sounds vary:  voicing (voiced or voiceless), manner (how the air flow is disrupted), and place (where the air flow is disrupted).  Overall, a sound is:  air + voicing + manner + place.  That said, let's look at the sounds
 
What belongs here is a picture of your vocal tract.  To avoid drawing it, I've handed it out.  Get one in class.
 
A.  Consonants:  6 manners (the first three are all orals:  Sound comes out through the mouth, using the velum to close off the nasal passages).
 

Voiced Voiceless
1.  Stops:  Complete blockage of the air flow (manner = stop). 
a.  Bilabials (stop with lips): 
b.  Alveolars (tongue to alveolar ridge): 
c.  Velars (tongue to velum): 
 
/b/  “big” 
/d/  “dip” 
/g/  “got”
 
/p/  “pig” 
/t/  “tip” 
/k/  “cot”
2.  Fricatives:  Interrupt air flow to create turbulence (manner = turbulence). 
a.  Labiodental (lips to teeth): 
b.  Dental (tongue to teeth): 
c.  Alveolar (tongue to alveolar ridge): 
d.  Palatal (tongue to palate): 
e.  Glottal (constrict vocal cord):
 
/v/  “vat” 
/  /  “then” 
/z/  “zap” 
/  /  “azure” 
 
 
/f/  “fat” 
/  /  “thin” 
/s/  “sap” 
/  /  “sure” 
/h/  “hat” 
3.  Affricatives:  A stop released to a fricative (manner = stop -> turbulence). 
a.  Palatal: 
 
/  /  “jug”
 
/  /  “chug”
4.  Nasals:  Sound comes through the nasal passages (manner = nasal). 
a.  Bilabial (stop with lips): 
b.  Alveolar (tongue to alveolar ridge): 
c.  Velar (tongue to velum): 
 
/m/  “maze” 
/n/  “near” 
/  /  “bring

5.  Liquids:  Partial obstruction, no stoppage, no turbulence (manner = modify). 
a.  Alveolar (tongue to alveolar ridge): 
b.  Palatal (tongue to palate): 
 
/l/  “look” 
/r/  “rook”

6.  Glides (semivowels):  Glide into a vowel (manner = glide). 
a.  Bilabial (with lips): 
b.  Palatal (tongue to palate): 
 
/w/  “work” 
/y/  “your” 

NOTE:  Some phonetic symbols couldn't be drawn here.  I left those blank.
 
7.  Additional manners not phonemic in English (but phonemic in other languages):
a.  Aspiration:  A release of a puff of air as the sound is produced.  Try holding your hand to your mouth and saying “pin” and “spin.”  Which /p/ is aspirated?
b.  Labialization (lip rounding):  Rounding the lips to produce the sound.  Try saying “table” and “twin.”  Which is labialized?
B.  Vowels:  Vowels are harder to classify because they're more continuous.  3 manners:
1.  Orals:  The two dimensions used both involve tongue position.  One dimension is the part of the tongue that's raised (front, center, and back).  The other dimension is how high it's raised (high, medium, low).
 

front center back
High 
 
 

Middle 
 

Low 
 

/i/  “beet” 

/I/  “bit” 

/e/  “baby” 

/  /  “bet” 
/ae/  “bat” 
 
 

 

/  /  “bird” 

/  /  “sofa” 

/  /  “but” 
 
 

/u/  “boot” 
/U/  “book” 

/o/  “bode” 
 

/  /  “bought” 
 

/  /  “palm”

NOTE:  Some phonetic symbols couldn't be drawn here.  I left those blank.
 
As you say the sounds down a column, you should feel your tongue come down.  As you say the sounds across a row, you should feel the rise in your tongue go from front to back.
2.  Nasals:  Vowel sounds come out through the nasal cavity.  Except to say that these are used in French, we won't talk about these.
3.  Rounding:  Rounding your lips as you produce a vowel can change its characteristics.  English doesn't use rounding as a phonemic cue.
C.  Suprasegmentals:  In addition to consonants and vowels you have a class of phonemic sound patterns that are added to another phoneme:
1.  Stress:  Say “black*bird” vs. “blackbird*” (* = stress).  One is a particular bird (the black one), and the other is a kind of bird.  Which is which?
2.  Length:  The length of a phoneme can be meaningful.  English doesn't use length.
3.  Tone contour:  High pitch vs. low pitch can be meaningful.  Chinese uses tone contour.
 
Top
 
IV.  Acoustic phonetics.  The other way to characterize the speech signal.  We're no longer interested in describing how it's produced.  Instead, we want to know what is produced.
A.  Primary methodology:  Spectrogram:  Plot frequency (of a sound) with duration and intensity.  How does it work?  Imagine a long row of tuning forks, each responding to a particular pitch.  No two forks are alike, but the differences between each one are very slight.  Arrange these in order from highest to lowest pitch.  Then, hook an electrode to each that sends a charge when it vibrates.  Hook a pen to the other end of the electrode.  When you pass a sound over these forks, each fork will only vibrate if its pitch is in the sound.  So, there will only be marks on the paper corresponding to the pitches in the sound.  This produces a spectrograph:  a recording of the pitch (frequency) components of a sound.  They look like this (it says “this is me talking”):
 
Speech example
 
Some important points about spectrographs:
1.  The sounds you make are complex.  In other words, they're composed of many different frequencies.  The spectrogram breaks sounds into these frequencies.  Loosely speaking, each dark band represents a frequency.  Each band is called a formant.  These are numbered from bottom to top.  So, F1 (first formant) is the lowest, then F2, ...
Formant transitions are places where there's a sharp rise or fall in a formant.  Generally, these correspond to consonants.  A steady state is a place where there is little change in a formant.  These generally correspond to vowels.
2.  The darker the band is, the higher the intensity (loudness) of that sound.
3.  As you go from left to right you can see how the sounds change in time.
Now that we know how to read these, we can look at some topics in speech perception.
 
Top
 
V.  Topics in speech perception.  This is intended to introduce you to some of the biggest phenomena in the study of spoken language.
A.  Problems in perception:
1.  Parallel transmission:  Phonemes aren't produced one after the other.  Instead, you produce them simultaneously.  It's a function of using your mouth to make sounds.  You might intend to produce each distinctly, but if your tongue is making a /d/ just prior to an /i/ it's in a different position relative to making an /e/.  The mechanics of getting your tongue and mouth into position while speaking at the normal rate means sounds get mashed up.  A way to think about it:  Imagine a bunch of painted Easter eggs on a conveyor belt.  Each egg is a particular phoneme in what you plan to say.  Now, run these eggs under a big wheel that smashes them.  This is what your mouth does when it produces the sounds.  Try and separate one egg from the other after the roller.  This is the task of the person listening to you.  Let me illustrate with an example:
 
bag
 
This is the spectrograph for “bag.”  The sounds are /b/, /ae/, and /g/.  If I want to change from “bag” to “gag,” I'd have to change the first two thirds (the /ae/ is modified by the /b/, if I don't replace it then it won't sound like an /ae/ after a /g/).  The last two thirds would need replacing to go from “bag” to “bad.”  To go from “bag” to “big”, you'd have to start over.
The problem is:  Given that phonemes are not distinct, how does your brain make any sense out of speech?
2.  Context conditioned variation:  The phoneme is different (physically) depending on what's around it.  Look at the /d/ in the spectrographs below:
 
di-du
 
The second formant in “di” is totally different from “du”, but hearers perceive both as having a /d/ at the beginning.
The question is:  How can two totally different physical stimuli be classed as the same thing?  This problem is also called “lack of invariance.”  In order to classify sounds they need to be consistent (invariant).  Since they're not, you suffer from lack of invariance.
3.  Categorical perception:  People seem to perceive speech sounds categorically.  To illustrate, imagine a smooth transition from “ba” to “da.”  The change is in the second formant (its transition becomes less pronounced).  Around 3, people don't hear a sound that's between “ba” and “da.”  Instead, they hear “ba”, “ba”, “ba”, “da”, “da”, “da.”  It's like at some point you cross a boundary and go all at once from “ba” to “da.”
 
A picture of this goes here.  I'll include it in the handout.
 
The problem is that if you clip these sounds to just the first few ms, people can easily perceive a continuous variation.  How is it categorical when it sounds like speech, but continuous when it doesn't?  In other words, how can the same system perceive in two different ways?  These effects are what led many researchers to argue that speech perception requires a special module that deals with only speech.  When the sounds are speech-like, you get categorical perception from the module.  When they aren't, you don't because the module isn't involved.
B.  Speech phenomena:  These are general speech perception findings.  I've included them because they hint at the mechanism your brain uses to comprehend speech.  They're also effects that any model of speech perception has to include.
1.  Prosodic influences on perception:
a.  Stress:  Stress can be used to disambiguate information.  For example:
 (1)  Because Bill left the room seemed empty.
When reading this sentence, most people “garden-path.”  The analogy is to walking in a garden.  Some paths look like promising ways out, but then leave you in a dead-end.  Analyzing this sentence works the same way.  You analyze it one way, and don't realize it's a dead-end until “seemed.”  However, if you stress “Bill” when speaking, there won't be a problem.  Where does stress have its effect on processing of the sentence?
b.  Rate:  When people change their rate of speaking, the changes in the phonemes produced are not identical.  For instance, if you try to talk more slowly some phonemes are stretched out more than others.  But, perceptually speaking, the changes are all the same.  Why aren't differential changes in rate perceived?
2.  Semantic and syntactic factors:  So far, we've discussed speech from the bottom up (the stimulus to understanding).  But, there are a lot of top-down influences on speech perception (understanding influences what is perceived):
a.  Context effects:  People are better able to identify speech sounds in grammatically correct sentences than sentences that are not grammatically correct.  How can grammar influence perception?
b.  Phonemic restoration:  If you replace a phoneme with a cough, people will still hear it.  What mechanism is responsible for filling in the gap?  (Check the language software on my software page for a demonstration.)
c.  Phonemic restoration in context:  If you delete a speech sound and change the context around it, people replace it with whatever sound fits the context.  (Check the language software on my software page for a demonstration.)
c.  Mispronunciation detection:  The more you know about a topic area the harder it is to detect mispronunciations.  It seems contrary to expectation since you should have more capacity to spend looking for errors if you don't have to listen as carefully to what's being said.  Why does it go the wrong way?
 
Top
 

Psychology of Language Notes 3
Will Langston

 Back to Langston's Psychology of Language Page