Psychology of Language, Notes 3 -- Speech Perception
A. What is speech?
B. Articulatory phonetics.
C. Acoustic phonetics.
D. Topics in speech perception.
II. What is speech? Speech can be analyzed at many
levels. The three big ones:
A. Acoustic: This is an analysis at the physical
It's made up of frequencies, durations, and intensities. We'll
back to acoustics a bit later.
B. Phonetic: This is the level of speech sounds.
Each of the possible sounds the human vocal system can produce is
a phone. Theoretically, you can produce around 4,000
However, in a database of (nearly) every human language, only 869
phones are used. Most of these are very rare (occurring only once
or twice), with a smaller set (100 or so) accounting for most of the
used in languages.
C. Phonemic: A phoneme is a set of phones that are treated
as identical within a language. The individual phones within a
are called allophones. A phoneme conveys meaning in the sense
changing from one phoneme to another will change the meaning of the
being produced (as in going from “bit” to “pit”, one sound changed and
the meaning changed with it). Some phonemic changes result in
words, some result in nonsense (“bit” vs. “dit”). But, phonemes
don't convey meaning (they change it, but don't have any of their own).
To map the phonemes of a language, researchers present native speakers
with pairs that differ by only one phone and ask if the two words are
same or different. If the speaker says “same” the phones that
changed are allophones. If the speaker says “different”, then
different phonemes. As you might imagine, this can be a rather
process. As an example, an English speaker would say that the /k/
in “keep” is the same as the /k/ in “cool.” So, if I substitute
of the /k/’s (the /k/ from “keep” in “cool”), the speaker should say
On the other hand, voicing /p/ is a phonemic difference (“pit” vs.
So, substituting a voiced /p/ (a /b/) will result in the speaker saying
How do languages “choose” phonemic differences? The general plan
is to maximize distinctiveness. Look at the plot of vowel space
These graphs plot the first frequency component of vowel sounds against
the second (approximately copied from Kluender, 1994, Fig. 1).
line represents a boundary between sounds humans can produce and sounds
that they can't. Inside the line is possible, outside
If a language has just three vowel sounds, odds are it uses the three
the first picture. With five, it's likely to use the five in the
second. Note how this spreads the vowels as far as possible in
space that humans can produce. This makes them easier to
while listening to speech. The general rule is “ease of
is secondary to distinctiveness.” For instance, it's easier to
the vowel in “bit” than in “beet”, but the one in “beet” is more
so it's much more likely to be used.
III. Articulatory phonetics. There are two ways
to describe speech. Articulatory phonetics describes speech
in terms of the vocal tract mechanisms used to produce them. The
sounds come out of a system involving an air source (lungs), a sound
(larynx: vocal folds), and filters (pharynx: chamber in the
throat, mouth, and nasal passages). Speech signals can be
according to the contributions these parts make. The sound source
can allow the sound to be either voiced (you make sound) or voiceless
sound). The filters can disrupt the air flow by stopping it,
turbulence, or modifying it. These disruptions can take place in
several locations along the vocal tract. This leads to three
along which sounds vary: voicing (voiced or voiceless), manner
the air flow is disrupted), and place (where the air flow is
Overall, a sound is: air + voicing + manner + place. That
let's look at the sounds
What belongs here is a picture of your vocal tract. To avoid
drawing it, I've handed it out. Get one in class.
A. Consonants: 6 manners (the first three are all
Sound comes out through the mouth, using the velum to close off the
NOTE: Some phonetic symbols couldn't be drawn here. I left
|1. Stops: Complete blockage of the air flow
(manner = stop).
a. Bilabials (stop with lips):
b. Alveolars (tongue to alveolar ridge):
c. Velars (tongue to velum):
|2. Fricatives: Interrupt air flow to create
(manner = turbulence).
a. Labiodental (lips to teeth):
b. Dental (tongue to teeth):
c. Alveolar (tongue to alveolar ridge):
d. Palatal (tongue to palate):
e. Glottal (constrict vocal cord):
/ / “then”
/ / “azure”
/ / “thin”
/ / “sure”
|3. Affricatives: A stop released to a fricative
= stop -> turbulence).
/ / “jug”
/ / “chug”
|4. Nasals: Sound comes through the nasal passages
a. Bilabial (stop with lips):
b. Alveolar (tongue to alveolar ridge):
c. Velar (tongue to velum):
/ / “bring”
|5. Liquids: Partial obstruction, no stoppage, no
(manner = modify).
a. Alveolar (tongue to alveolar ridge):
b. Palatal (tongue to palate):
|6. Glides (semivowels): Glide into a vowel
(manner = glide).
a. Bilabial (with lips):
b. Palatal (tongue to palate):
7. Additional manners not phonemic in English (but phonemic in
a. Aspiration: A release of a puff of air as the sound
is produced. Try holding your hand to your mouth and saying “pin”
and “spin.” Which /p/ is aspirated?
b. Labialization (lip rounding): Rounding the lips to
the sound. Try saying “table” and “twin.” Which is
B. Vowels: Vowels are harder to classify because they're
more continuous. 3 manners:
1. Orals: The two dimensions used both involve tongue
One dimension is the part of the tongue that's raised (front, center,
back). The other dimension is how high it's raised (high, medium,
NOTE: Some phonetic symbols couldn't be drawn here. I left
/ / “bet”
/ / “bird”
/ / “sofa”
/ / “but”
/ / “bought”
/ / “palm”
As you say the sounds down a column, you should feel your tongue come
down. As you say the sounds across a row, you should feel the
in your tongue go from front to back.
2. Nasals: Vowel sounds come out through the nasal
Except to say that these are used in French, we won't talk about these.
3. Rounding: Rounding your lips as you produce a vowel
can change its characteristics. English doesn't use rounding as a
C. Suprasegmentals: In addition to consonants and vowels
you have a class of phonemic sound patterns that are added to another
1. Stress: Say “black*bird” vs. “blackbird*” (* =
One is a particular bird (the black one), and the other is a kind of
Which is which?
2. Length: The length of a phoneme can be meaningful.
English doesn't use length.
3. Tone contour: High pitch vs. low pitch can be
Chinese uses tone contour.
IV. Acoustic phonetics. The other way to
the speech signal. We're no longer interested in describing how
produced. Instead, we want to know what is produced.
A. Primary methodology: Spectrogram: Plot frequency
(of a sound) with duration and intensity. How does it work?
Imagine a long row of tuning forks, each responding to a particular
No two forks are alike, but the differences between each one are very
Arrange these in order from highest to lowest pitch. Then, hook
electrode to each that sends a charge when it vibrates. Hook a
to the other end of the electrode. When you pass a sound over
forks, each fork will only vibrate if its pitch is in the sound.
So, there will only be marks on the paper corresponding to the pitches
in the sound. This produces a spectrograph: a recording of
the pitch (frequency) components of a sound. They look like this
(it says “this is me talking”):
Some important points about spectrographs:
1. The sounds you make are complex. In other words, they're
composed of many different frequencies. The spectrogram breaks
into these frequencies. Loosely speaking, each dark band
a frequency. Each band is called a formant. These are
from bottom to top. So, F1 (first formant) is the lowest, then
Formant transitions are places where there's a sharp rise or fall in
a formant. Generally, these correspond to consonants. A
state is a place where there is little change in a formant. These
generally correspond to vowels.
2. The darker the band is, the higher the intensity (loudness)
of that sound.
3. As you go from left to right you can see how the sounds change
Now that we know how to read these, we can look at some topics in
V. Topics in speech perception. This is intended
to introduce you to some of the biggest phenomena in the study of
A. Problems in perception:
1. Parallel transmission: Phonemes aren't produced one
after the other. Instead, you produce them simultaneously.
It's a function of using your mouth to make sounds. You might
to produce each distinctly, but if your tongue is making a /d/ just
to an /i/ it's in a different position relative to making an /e/.
The mechanics of getting your tongue and mouth into position while
at the normal rate means sounds get mashed up. A way to think
it: Imagine a bunch of painted Easter eggs on a conveyor
Each egg is a particular phoneme in what you plan to say. Now,
these eggs under a big wheel that smashes them. This is what your
mouth does when it produces the sounds. Try and separate one egg
from the other after the roller. This is the task of the person
to you. Let me illustrate with an example:
This is the spectrograph for “bag.” The sounds are /b/, /ae/,
and /g/. If I want to change from “bag” to “gag,” I'd have to
the first two thirds (the /ae/ is modified by the /b/, if I don't
it then it won't sound like an /ae/ after a /g/). The last two
would need replacing to go from “bag” to “bad.” To go from “bag”
to “big”, you'd have to start over.
The problem is: Given that phonemes are not distinct, how does
your brain make any sense out of speech?
2. Context conditioned variation: The phoneme is different
(physically) depending on what's around it. Look at the /d/ in
The second formant in “di” is totally different from “du”, but hearers
perceive both as having a /d/ at the beginning.
The question is: How can two totally different physical stimuli
be classed as the same thing? This problem is also called “lack
invariance.” In order to classify sounds they need to be
(invariant). Since they're not, you suffer from lack of
3. Categorical perception: People seem to perceive speech
sounds categorically. To illustrate, imagine a smooth transition
from “ba” to “da.” The change is in the second formant (its
becomes less pronounced). Around 3, people don't hear a sound
between “ba” and “da.” Instead, they hear “ba”, “ba”, “ba”, “da”,
“da”, “da.” It's like at some point you cross a boundary and go
at once from “ba” to “da.”
A picture of this goes here. I'll include it in the handout.
The problem is that if you clip these sounds to just the first few
ms, people can easily perceive a continuous variation. How is it
categorical when it sounds like speech, but continuous when it
In other words, how can the same system perceive in two different
These effects are what led many researchers to argue that speech
requires a special module that deals with only speech. When the
are speech-like, you get categorical perception from the module.
When they aren't, you don't because the module isn't involved.
B. Speech phenomena: These are general speech perception
findings. I've included them because they hint at the mechanism
brain uses to comprehend speech. They're also effects that any
of speech perception has to include.
1. Prosodic influences on perception:
a. Stress: Stress can be used to disambiguate
(1) Because Bill left the room seemed empty.
When reading this sentence, most people “garden-path.” The
is to walking in a garden. Some paths look like promising ways
but then leave you in a dead-end. Analyzing this sentence works
same way. You analyze it one way, and don't realize it's a
until “seemed.” However, if you stress “Bill” when speaking,
won't be a problem. Where does stress have its effect on
of the sentence?
b. Rate: When people change their rate of speaking, the
changes in the phonemes produced are not identical. For instance,
if you try to talk more slowly some phonemes are stretched out more
others. But, perceptually speaking, the changes are all the
Why aren't differential changes in rate perceived?
2. Semantic and syntactic factors: So far, we've discussed
speech from the bottom up (the stimulus to understanding). But,
are a lot of top-down influences on speech perception (understanding
what is perceived):
a. Context effects: People are better able to identify
speech sounds in grammatically correct sentences than sentences that
not grammatically correct. How can grammar influence perception?
b. Phonemic restoration: If you replace a phoneme with
a cough, people will still hear it. What mechanism is responsible
for filling in the gap? (Check the language software on my
page for a demonstration.)
c. Phonemic restoration in context: If you delete a speech
sound and change the context around it, people replace it with whatever
sound fits the context. (Check the language software on my
page for a demonstration.)
c. Mispronunciation detection: The more you know about
a topic area the harder it is to detect mispronunciations. It
contrary to expectation since you should have more capacity to spend
for errors if you don't have to listen as carefully to what's being
Why does it go the wrong way?
Psychology of Language Notes 3
Back to Langston's Psychology of Language