Praat Speech

Praat Speech Analyzer
Praat Speech Synthesis
Praat Speech To Text
Praat Speech Analysis

Welcome to the Monthly Mystery Spectrogram webzone. These pages are Rob Hagiwara's professional web-space. For personal musings, please see Rob's blog.

Advanced speech analysis tools II: Praat and more Judging from mentions spotted on the Internet, Praat (Dutch for 'talk'), created by Paul Boersma and David Weenink of the Institute of Phonetic Sciences, University of Amsterdam, is currently among the most popular of free, downloadable speech analysis software p. Advanced speech analysis tools II: Praat and more Judging from mentions spotted on the Internet, Praat (Dutch for 'talk'), created by Paul Boersma and David Weenink of the Institute of Phonetic Sciences, University of Amsterdam, is currently among the most popular of free, downloadable speech analysis software packages. To run a Praat script, go to the Control menu in the Praat objects window and select New Praat script. Then pull up the code for the desired script by clicking on one of the links below. Copy all the code there (e.g. Highlight and Ctrl-C on a PC) and paste it into the new untitled script window. This is my Moby Dick of Praat scripts – a wizard-style GUI to let you alter formant structure of natural speech sounds. The goal of this script is to alter the formant structure of a single word to either make it more like another pre-existing word, or to simply alter it free-form. PRAAT is software for the analysis of speech (www.praat.org). Below you will find a script that automatically detects syllable nuclei in order to measure speech rate without the need of a.

This is the How To page of the mystery spectrogram webzone.Contents for this page:

How do I read a spectrogram?

May 2009: fixed broken rollovers, miscellaneous text cleaning

September 2007: updated menus and navigation

May 2006: Commentary about stuff that I plan to change, or would like some in put on, is interspersed throughout this version of the page, in this goofy text. Depending on your browser, I think this is rendered in colo(u)r. General stuff changing throughout:

~~Fix Unicode IPA symbol calls to uniform style (hex?)~~
~~Re-do all figures with better original recordings and cleaner spectrograms~~
~~Rollover formant markings for all figures~~
~~Incorporate other rollover information (nasal poles/zeroes, duration) into text~~
Clean up text
Separate sections onto separate pages for faster loading(? maybe vowels, obstruents and sonorant consonants? or vowels, consonants, and allophonic/prosodic stuff?)
Beef up allophony section to include flapping, glottal stops and glottalization, nasality (nasal vowels and nasal taps?), and ...?

How do I read a spectrogram?

The same way you get to Carnegie Hall: practice, practice, practice!

First, read the chapter on acoustic analysis in Ladefoged's A Course inPhonetics, or better yet take a course based on Ladefoged's Elements ofAcoustic Phonetics or Johnson's Acoustic and Auditory Phonetics. Or you can just read this summary, but bear in mind there's going to be a lot left out, especially in the 'why' realm. Then (as usual) learn by doing!

The goal of this page is to provide just enough basic information for the noviceto begin, perhaps with some guidance, the process of decoding the monthlymystery spectrogram. This page is not intended to be the last word in spectrographicanalysis in general, nor even the last word on spectrogram reading. However,reasoning your way through a mystery spectrogram is very instructive, especiallyin relating acoustic events with (presumed) articulatory ones. That is, in relatingphysical sounds with speech production.

If you're reading this, I assume you are familiar with basic articulatory phonetics,phonetic transcription, the International Phonetic Alphabet, and the surfacephonology of 'general' North American English (i.e. phonemes and basic contrasts,and major allophonic variation such as vowel nasalization, nasal place assimilation,and so forth). I try to keep in mindthat I have an international audience, but there are some details I just take to be 'given' for English. Someday if we do spectrograms of other languages, we'll have to adjust.

I really recommend that beginners find someone to discuss spectrographic issues with. If you're doing spectrograms as part of a class, form a study group. If you're a 'civilian', form a club. Or something. I'm toyingwith the idea of starting a Yahoo group or something for us to do some discussionsas 'community'. Strong opinions anyone? Unfortunately, I don't have time to answer in detail every e-mail I receive about specific spectrograms or sounds or features, but if you have a general question or suggestions, please feel free to contact me.

Please note: My style sheet calls for this page to be renderedin either Victor Gaultney's Gentium font, or in SIL's SILDoulosIPAUnicode. These fonts are (in my opinion) the best available freeware fonts for IPA-ing in Unicode for the web. Please see my list of currently supportedfonts for justification and links to download these fonts.

So what is a spectrogram anyway?

A sound spectrogram (or sonogram) is a visual representation of an acousticsignal. To oversimplify things a fair amount, a Fast Fourier transform is applied to an electronically recorded sound. This analysis essentially separates thefrequencies and amplitudes of its component simplex waves. The result can then be displayed visually, with degrees of amplitude (represented light-to-dark, as inwhite=no energy, black=lots of energy), atvarious frequencies (usually on the vertical axis) by time (horizontal).

Depending on the size of the Fourier analysis window, different levels of frequency/time resolution are achieved. A long window resolves frequency at the expense of time—the result is a narrow band spectrogram, which reveals individual harmonics(component frequencies), but smears together adjacent 'moments'. If a short analysis window is used, adjacent harmonicsare smeared together, but with better time resolution. The result is a wideband spectrogram in which individual pitch periods appear as vertical lines(or striations), with formant structure. Generally, wide band spectrogramsare used in spectrogram reading because they give us more information about what'sgoing on in the vocal tract, for reasons which should become clear as we go.

Sources and filters

We often talk about speech in terms of source-filter theory.Put simply, we can view the vocal tract like a musical instrument. There's a partthat actually makes sound (e.g. the string, the reed, or the vocal folds),and the part that 'shapes' the sound (e.g. the body of the violin, the horn of theclarinet, or the supralaryngeal articulators). In speech, this sourceof sound is provided primarily by the vibration of the vocal folds.From a mathematical standpoint, vocal fold vibration is complex, consisting ofboth a fundamental frequency and harmonics.Because the harmonics always occur as integral multiples of the fundamental (x1, x2, x3, etc.—which phenomenon was mathematically proven by Fourier, hence 'Fourier's Theorem' and 'Fourier Transform'), it turns out that the sensation of pitch of voice is correlated to both the fundamental frequency, and the distance betweenharmonics.

The point is that vocal source isn't just one frequency, but many frequenciesranging from the fundamental all the way up to infinity, in principle, in integralmultiples. Just as white light is many frequencies of light all mixed up together,so is the vocal source a spectrum of acoustic energy, going from lowfrequencies (the fundamental) to high frequencies. In principle, there'ssome energy at all frequencies (although unless you're talking aboutan integral multiple of the fundamental, the amount will be zero).

The energy provided by the source is then filtered or shaped by the body of the instrument. In essence, the filter sifts the energy of some harmonicsout (or at least down) while boosting others. The analogy to light again is apt. If you pass a white light through a red filter,you end up removing (or lessening) the energy at the blue end of the spectrum,while leaving the red end of the spectrum untouched. Depending on the filter,you might pass a band of energy in the red end and a band of energy inthe green band, and something else. The 'color' of light that results will be different depending on which frequencies exactly get passed, and which ones getfiltered.

In speech, these different tonal qualities change depending on vocal tractconfiguration. What makes an [i]sound like an [i] is not something to do with the source,but the shape of the filter, boostingsome frequencies and damping others, depending on the shape of the vocal tract.So the 'quality' of the vowel depends on the frequencies being passed through theacoustic filter (the vocal tract), just as the 'color' of light depends on the frequencies being passed through the light filter.

So, we can manipulate source characteristics (the relative frequency and amplitudeof the fundamental—and some properties of some of the harmonics) at the larynxindependently of filter characteristics (vocal tract shape).Figure 1, is a spectrogram of me saying [ i ɑ i ɑ ] (i.e. 'ee ah ee ah')continuously on a steady pitch. On the left, a wide band spectrogram shows the formants (darker bands running horizontally across the spectrogram) changing rapidly as my vocal tract moves between vowel configurations. (Take a moment to notice thatthe wide band spectrogram is striated, and the horizontal formants are 'overlaid' over the basic pattern of vertical striationsn.)On the right, a narrow band spectrogramreveals that the harmonics—the complex frequencies provided by the source—are steady, i.e. the pitch throughout is flat. Because some harmonics are stronger than others at any given moment, you can make out the formant structure even in the narrow band spectrogram. The filter function (the formant structure) issuperimposed over the source structure.

Figure 1. wide band (left) and narrow band (right) spectrograms, illustrating changing vowel quality with level pitch.

The other side of the source-filter coin is that you can vary the pitch(source) while keeping the the same filter. Figure 2 shows wide and narrow band spectrograms of me going [aː], butwildly moving my voice up and down. The formants stay steady in the wide band spectrogram, but the spacing between the harmonics changes as the pitch does. (Harmonics are always evenly spaced, so the higher the fundamental frequency —the pitch of my voice—the further apart the harmonics will be.)

Figure 2. wide band (left) and (narrow band) spectrgrams of me saying [aː],but with wild pitch changes.

A word on sources

I like to divide the kinds of sources in speech into three categories: periodicvoicing (or vibration of the vocal folds), non-voicing (which most people don't consider, but I like to distinguish it from my third category), and aperiodic noise (which results from turbulent airflow).

Voicing is represented on a wide band spectrogram by vertical striations, especially in the lowest frequencies. Each vertical 'line' represents a single pulse of the vocal folds, a single puff of air moving through the glottis. We sometimes refer to a 'voicing bar', i.e. a row of striated energy in the very low frequencies, corresponding to the energy in the first and second harmonics (typically the strongest harmonics in speech). For men, this is about 100-150 Hz, for women it can be anywhere between 150-250 Hz, and of course there's lots of variation both within and between individuals. In a narrow band spectrogram, voicing results in harmonics, with again the lowest one or two being the strongest.

Non-voicing is basically silence, and doesn't show up as anything in aspectrogram. So while there isn't a lot going on during silence that we cansee in a spectrogram, we can still tell the difference between voiced sounds(with a striated voicing bar) and voiceless sounds (without). And usually there's still air moving through the vocal tract, which can provide analternative source of acoustic energy, via turbulence or 'noise'.

On the other hand, it's worth distinguishing several glottal states that lead to non-voicing. Typically, active devoicing, results from vocal fold abduction. The vocalfolds are held wide apart and thus movement of air through the glottis doesn'tcause the folds to vibrate. If the vocal folds are tightly adducted (broughttogether in the midline) and stiffened, the result is no air movement through the glottis, due to glottal closure. Ideally,this is how a 'glottal stop' is produced. Finally,the vocal folds may be in 'voicing position', loosely adducted and relativelyslack. But if there is insufficient pressure below the glottis (or too much abovethe glottis) the air movement through the glottis won't be enough to drive vibration,and passive devoicing occurs.

Noise is random (rather than striated or harmonically organized) energy,and usually results from friction. In speech this friction is of two types.There's the turbulence generated by the air as it moves past the walls ofthe vocal tract, usually called 'channel frication'. This is just 'drag',resistance to the free flow of air. If the air is blown against (instead of across) an object, you get even more turbulence, which we sometimes call 'obstaclefrication'. For instance, when we make an [s], a jet of air is blown againstthe front teeth—the sudden displacement results in a lot of turbulence,and therefore noise. In spectrograms, noise is 'snowy'. The energy is placedin frequency and amplitude more randomly rather than being organized neatly intostriations or clear bands. (Not to say they're aren't or can't be bands. They'rejust usually don't have 'edges' to the degree that formants do. Or may.)

We'll return to voicing and voicelessness below, after we deal with vowels.

So what's the deal with formants?

A formant is a dark band on a wide band spectrogram, which corresponds to a vocal tract resonance. Technically, it represents a set of adjacent harmonicswhich are boosted by a resonance in some part of the vocal tract. Thus, differentvocal tract shapes will produce different formant patterns, regardless of what thesource is doing. Consider the spectrograms in Figure 3,which represents the simplex vowels of American English (at least in my voice). In the top row are'beed, bid, bade, bed, bad' (i.e. [bid] [bɪd] [beɪd] [bɛd] [bæd]). Notice that as the vowels get lower in the 'vowel space', the firstformant (formants are numbered from the bottom up) goes up. In the bottom row, thevowels raise from 'bod' to 'booed'—the F1 starts relatively high, and goes downindicating that the vowels start low and move toward high.The first formant correlates (inversely) to height (or directly toopenness) of the vocal tract.

Now look at the next formant, F2. Notice that theback, round vowels have a very low F2. Notice that the vowel with the highest F2is [i], which is the frontmost of the front vowels. F2 corresponds to backnessand/or rounding, with fronter/unround vowels having higher F2s than backer/roundervowels. It's actually much more complicated than that, but that will dofor the beginner. If you're picky about facts or the math, take a class inacoustic phonetics.

Figure 3. Wide band spectrograms of the vowels of American English in a/b__d/ context.
Top row, left to right: [i, ɪ, eɪ, ɛ, æ].Bottom row, left to right: [ɑ, ɔ, o, ʊ, u].

There are a variety of studies showing various acoustic correlates of vowelquality, among them formant frequency, formant movement, and vowel duration.Formant frequency (and movement) are probably the most important. So we canplot vowels in an F1xF2 vowel space, where F1 corresponds (inversely) to height,and F2 corresponds (inversely) to backness and we'll end up with something likethe standard 'articulatory' vowel space.

Note that some of the vowels in Figure 3 ([eɪ] and [ʊ] especially) show more movement during the vowel (beyond just the transitions). Whether that makes them diphthongs (or should be represented likediphthongs) I'll leave for somebody else to argue. But before we get too far,what would you imagine an [aɪ] or [aɪ] diphthong would look like?

It's worth pointing out now that all the formants show consonant transitions at the edges. Remember that the frequency of anygiven formant has to do with the size and shape of the vocal tract—as thevocal tract changes shape, so do the formants change frequency. So the way theformants move into and out of consonant closures and vowel 'targets', is an important source of information about how the articulators are moving.

Manner (and place) of articulation

Plosives (oral stops) involve a total occlusion of the vocal tract, and thus a 'complete' filter, i.e. no resonances being contributed by the vocal tract. The result a period of silence in the spectrogram,known as a 'gap'. A voiced plosive may have a low-frequency voicing barof striations, usually thought of as the sound of voicing being transmitted through the flesh of the vocal tract. However, due to passive devoicing,it may not. And due to perseverative voicing even a 'voiceless' plosive may show some vibration as the pressures equalize and before the vocal folds fully separate. But let's not get lost in too many details.

Generally we can think about the English plosives as occurring at three places of articulation—at the lips, behind the incisors, and at the velum (with some room to play around each). The bilabial plosives, [p] and [b] are articulated with the lower lip pressed against the upper lip. The coronal plosives [t,d] are made with the tongue blade pressing against the alveolar ridge (or thereabouts). [k] and [g] are described as 'dorsal' (meaning 'articulated with the tongue body') and 'velar' (meaning 'articulated against or toward the velum'), depending on your point of view. (I tend to use the 'dorsal' and 'velar' interchangeably, which is very bad. I use 'coronal' because it's more accurate than 'alveolar', in the sense that everybody uses their tongue blade (if not the apex) for [t,d], but not everybody uses only their alveolar ridge.)

That controversy aside, the thing to remember is that during a closure, there'sno useful sound coming at you—there's basically silence. So while the gaptells you it's a plosive, the transitions into and out of the closure(i.e. in the surrounding vowels) are going to be the best source of informationabout place of articulation. Figure 4contains spectrograms of me saying 'bab' 'dad' and 'gag'.

There's no voicing during the initial closure of any of these plosives, confirming what yourteachers have always told you: 'voiced' plosives in English aren't always fullyvoiced during closure. Then suddenly, there's a burst of energy and the voicingbegins, goes for a couple hundred milliseconds or so, followed by an abrupt lossof energy in the upper frequencies (above 400 Hz or so), followed by another burstof energy, and some noise. The first burst of energy is the release of the initialplosive. Notice the formants move or change following the burst, hold more or less steady during the middle of the vowel, and then move again into thefollowing consonant. We know there's a closurebecause of the cessation of energy at most frequencies. The little blob of energyat the bottom is voicing, only transmitted through flesh rather than resonatingin the vocal tract. Look closely, and you'll see that it's striated, but very weak.The final burst is the release of the final plosive, and the lastbit of noise is basically just residual stuff echoing around the vocal tract.

Take a look at those formant transitions out of and into each plosive. Notice howthe transitions in the F2 of 'bab' point down (i.e. the formant rises out of theplosive and falls into it again), where the F2 of 'gag' points up? Notice how in'gag' the F2 and F3 start out and end close together? Notice how the F3 of 'dad'points slightly up at the plosives? Notice how the F1 always starts low, rises intothe vowel, and then falls again.

Okay, these aren't necessarily the best examples, but basically, labials havedownward pointing transitions (usually all visible formants, but especiallyF2 and F3), dorsals tend to have F2 and F3transitions that 'pinch' together (hence 'velar pinch'), and the the F3 of coronalstends to point upward. The direction any transition points obviously is going todepend on the position of the formant for the vowel, so F2 of [t,d] might go up ordown. A lot of people say coronal transitions point to about 1700 or 1800 Hz, butthat's going to depend a lot on speaker-individual factors. Generally, I think ofcoronal F2 transitions as pointing upward unless the F2 of the vowel isparticularly high.

Another thing to notice is the burst energy. Notice that the bursts for 'dad'are darker (stronger) than the others. Notice also that they get darker in thehigher frequencies than the lower. The energy of the bursts in 'gag' areconcentrated in the F2/F3 region, and less in the higher frequencies. The burst of[b] is sort of broad—across all frequencies, but concentrated in the lowerfrequencies, if anywhere. So bursts and transitions also give you information about place.

Figure 4 also illustrates that in initial position, phonemic /b, d, g/ tend to surface with no voicing duringthe closure, but a short voice onset time, i.e. as unaspirated [p, t, k]. In finalposition, they tend to surface as voiced, although there's room for variation heretoo.

Fricatives

Frankly, fricatives are not my favorite. They're acousticallyand aerodynamically complex, not to mention phonologically and phoneticallyvolatile. There's not a lot you can say about them without getting way toocomplicated, but I'll try.

Fricatives, by definition, involve an occlusion or obstruction in the vocaltract great enough to produce noise (frication). Frication noise is generated intwo ways, either by blowing air against an object (obstacle frication) or movingair through a narrow channel into a relatively more open space (channelfrication). In both cases, turbulence is created, but in the second case, it'sturbulence caused by sudden 'freedom' to move sideways (Keith Johnson uses theterrific analogy of a road suddenly widening from two to four lanes, with a lot ofsideways movement into the extra space), as opposed to air crashing around itselfhaving bounced off an obstacle (Keith's freeway analogy of a road narrowing fromfour lanes to two works here, but I don't really want to think about serioussibilance in this respect....)

Sibilant fricatives involve a jet of air directed against the teeth. Whilethere is some (channel) turbulence, the greater proportion of actual noise iscreated by bouncing the jet of air against the upper teeth. The result is veryhigh amplitude noise. Non-sibilant fricatives are more likely 'pure' channelfricatives, particularly bilabial and labiodental fricatives, where there's not alot of stuff in front to bounce the air off of.

In Figure 5, there are spectrograms of the fricatives,extracted from a nonce word ('uffah', 'ussah', etc.).

Figure 5: Top row, left to right: f, theta, s, esh. Bottom row, left to right: v, eth, z, yogh.

Let's start with the sibilants 's' and 'sh', in the upper right of Figure 5.They are by far the loudest fricatives. The darkest part of [s] noise is off thetop of the spectrograms, even though these spectrograms have a greater frequencyrange than the others on this page. [s] is centered (darkest) above 8000 Hz. Thepostalveolar 'sh', on the other hand, while almost as dark, has most of its energyconcentrated in the F3-F4 range. Often, [s]s will have noise at all frequencies,where, as here, the noise for [ʃ] seems to drop offdrastically below the peak (i.e. there's sometimes no noise below 1500 or 2000Hz.) [z] and [ʒ] are distinguished from their voiceless counterparts by a)lesser amplitude of frication, b) shorter duration of frication and c) a voicingbar across the bottom. (Remember, however, that a lot of underlyingly voicedfricatives in English have voiceless allophones. What other cues are there tounderlying voicing? Discuss.) Take a good look at the voicing bar through thefricatives in the bottom row. You may never see a fully voiced fricative from meagain.

It's worth noting that F2 transitions are greater and higher with [ʃ] thanwith [s], and I seem to depress F4 slightly in [ʃ], but I don't know howconsistent these markers are.

Labiodental and (inter)dental (nonsibilant) fricatives are notoriouslydifficult to distinguish, since they're made at about the same place in the vocaltract (i.e. the upper teeth), but with different active articulators. Havingestablished (in a mystery spectrogram) that a fricative isn't loud enough to be asibilant, you can sometimes tell from transitions whether it is labiodental orinterdental—labiodental will have labial-looking transitions, interdentals mighthave slightly more coronal looking transitions. But that's poor consolation—oftenunderlying labiodental and interdental fricatives don't have a lot of noise in thespectrogram at all, looking more like approximants. Sometimes, the lenite intoapproximants, or fortisize to stoppy-looking things. I hate fricatives.

Before moving on, we need to talk about [h]. [h] is always described as aglottal fricative, but since we know about channels and such, it's not clear wherethe noise actually comes from. Aspiration noise, which is also [h]-like, isproduced by moving a whole lot of air through a very open glottis. I heard a paperonce where they described the spectrum of [h]-noise as 'epiglottal', implyingthat the air is being directed at the epiglottis as an obstacle. Generallyspeaking, we don't think of the vocal cords moving together to form a 'channel'in [h], although breathy-voicing and voiced [h]s in English (as many intervocalic[h]s are produced) maybe be produced this way. So I don't know. What I do knowabout [h]s is that the noise is produced far enough back in the vocal tract thatit excites all the forward cavities, so it's a lot like voicing in that respect.It's common to see 'formants' excited by noise rather than harmonics inspectrograms of [h]. Certainly, the noise will be concentrated in the formantregions. Compare the spectrograms in Figure 6.

Notice how different the frication looks in each spectrogram. In 'hee', thenoise is concentrated in F2, F3 and higher, with every little in the 1000 Hzrange. In 'ha', in which F1 and F2 straddle 1000 Hz, the [h] noise is right downthere. In 'who', there is a lot less amplitude to the noise between 2000 and 3000Hz, but there around F2 (around 1000 Hz) and lower, there's a great deal. You caneven see F2 really clearly in the [h] of 'who'. So that's [h]. Don't ask me. It'snot very common in my spectrograms....

Nasal stops

Nasals have some formant stucture, but are better identified by the relative'zeroes' or areas of little or no spectral energy. In Figure7, the final nasals have identifiable formants that are lesser in amplitudethan in the vowel, and the regions between them are blank. Nasality on vowels canresult in broadening of the formant bandwidths (fuzzying the edges), and theintroduction of zeroes in the vowel filter function. Nasals can be tough, and Ihope to get someone who knows more about them than I do to say something elseuseful about them. You can sometimes tell from the frequency of the nasal formantand zero what place of articulation was, but it's usually easier to watch theformant transitions. (This is particularly true of initial nasals; final nasals Iusually don't worry about--if you can figure out the rest of the word, there'sonly three possible nasals it could end with.) (Actually, being loose with theamount of information you actually have before you start trying to fit words tothe spectrogram is one of the tricks to the whole operation.)

Figure 7. Spectrograms of 'dinner', 'dimmer', 'dinger'.

Praat Speech Analyzer

The real trick to recognizing nasals stops is a) formant structure, but b)relatively lower-than-vowel amplitude. Place of articulation can be determined bylooking at the formant transitions (they are stops, after all), and sometimes, ifyou know the voice well, the formant/zero structure itself. Comparing thespectrograms above, we can see that 'dinger' (far right) has an F2/F3 'pinch'—the high F2 of [ɪ] moves up and seems to merge with the F3. In the nasal itself, the pole (nasal formant) is up in the neutral F3 region. 'Dinner' (middle) has a pole about 1500 Hz and a zero (a region of low amplitude) below it until you get down to about 500 Hz again. The pole for [m] in 'dimmer'is lower, closer to 1000 Hz, but there's still a zero between it and what we might call F1. Note also thatthe transitions moving into the [m] of dinner are all sharply down-pointing, evenin the higher formants, a very strong clue to labiality, if you're lucky enough tosee it.

Approximants

In case you're not familiar with the term (generally attibuted to Ladefoged'sPhonetic Study of West African Languages or as modified in Catford's Fundamental Problems in Phonetics), the approximants are non-vowel oralsonorants. In English, this amounts to /l, r, w, j/. They are characterized byformant structure (like vowels), but constrictions of about the degree of highvowels or slightly closer. Generally there's no friction associated with them, butthe underlying approximants can have fricative allophones, just as fricativephonemes can occasionally have frictionless (i.e. approximant) allophones.

Canonically, the English approximants are those consonants which have obviousvowel allophones. The classic examples are the [j-i] pair and the [w-u] pair. Ihave argued that [ɹ] is basically vowel-like in structure, i.e. that syllabic /r/is the most basic allophone, but there are those who disagree. Syllabic [l]s areall at least plausibly derived from underlying consonants, but I'm guessingthat'll change in the next hundred years.

Figure 8. Spectrograms of 'ball', 'bar', 'bough', 'buy'.

In Figure 8, the approximants are presented incoda/final position, where the formant transitions are easiest to discern. Note that in all four words, the F1 is mid-to-high, indicating a more open constriction than with a typical high vowel. For /l/, the F2 is quite low, indicating a back tongue position—velarization of 'dark l' in English. The F3, on the other hand, is very high, higher than one ever sees unless the F2 is pushing it up out of the way. In 'bar', the F3 comes way down, which is characteristics of [ɹ] in English.Compare the position of the F3 in 'bar' with that in 'bough' and 'buy', where theF3 is relatively unaffected by the constriction.

In 'bough', the F2 is very low, as the tongue position is relatively back andthe lips are relatively rounded. Note that the this has no effect on F3, so let itbe known that lip rounding has minimum effect on F3. Really. The next reviewer whobrings up lip rounding without having some data to back it up is going to get itbetween the eyes. It's worth noting that the nuclear part of the diphthong isrelatively front (as indicated by the F2 frequency in the first half of thediphthong) with the [aʊ] than in [aɪ]. In 'buy', the offglide has aclearly fronting (rising) F2.

Common allophonic variation

One of the absolutely characteristic features of American English is'flapping'. This is when an underlying /t/ (and sometimes /d/), is repaced bysomething which sounds a lot like a tapped /r/ in languages with tapped /r/s. Irefer the reader to Susan Banner-Inouye's M.A. and Ph.D. theses on thephonological and phonetic interpretations of flappy/tappy things in general. Butthe easiest thing to do is compare them. The spectrograms in Figure 9 are of me reading 'a toe', 'a doe' and 'otto', with an aspirated /t/, voiced /d/ and a flap respectively.

Figure 9. Spectrograms of 'a toe', 'a doe' and 'otto'.

Note that for both proper plosives, there's a longish period of relative silence(with a voicing bar in the case of /d/), on the order a 100 ms. The actual lengthvaries a lot, but notice how short the 'closure' of the flapped case is incomparison. It's just a slight 'interruption' of the normal flow, a momentarything, not something that looks very forceful or controlled. It doesn't evenreally have any transitions of its own. The interruption is something on the orderof three pulses long, between 10 and 30 ms. That's basically the biggest thing.Sometimes they're longer, sometimes they're voiceless (occasionally evenaspirated), but basically a flap will always be significantly shorter than acorresponding plosive.

Okay, so let's turn back to the proper plosives. Notice the aspiration followingthe /t/, and the short VOT following the /d/. Note the dying-off voicing duringthe /d/ closure, presumably due to a build up of supralaryngeal pressure.(Frankly, we're lucky to get any real voicing during the closure at all.)

(Other big allophonic categories I want to cover arenasalized vowels and rhoticized vowels, but I'm wondering how important they areat this level. Remember that this is a primer, not the be-all and end-all work onspectrogram reading. Also worth doing is some prosodic stuff, pitch and duration,amplitude and that kind of thing, as it relates to finding word and phraseboundaries in spectrogram reading. Comments?)

Is that it?

Well, obviously not. But it should be enough to get you started reading themonthly mystery spectrogram. We could go on and on about various things, butthat's not the point right now. Remember, identify the features you can, try toguess some words, hypothesize, and then see if you can use your hypotheses to fillin some of the features you're unsure about. Do some lexical access, try somephrases, and see how well you do. Reading spectrograms, like transcription, and somany other things can be taught in a short time, but takes a long time andexperience to learn. But then that's why we're here, right?

Robert Hagiwara, Ph.D.
Dept. of Linguistics
University of Manitoba
Winnipeg, Manitoba
CANADA R3T 5V5

Current Mystery - Solution - Past Mysteries
How To - Research - Courses
To the Lab - To the Department - To the University

Introduction¶

Up to now, we have assumed that all sounds are equal. In this chapter, I finally single out speech as being the most equal, though I may refer to music every now and then.

One of the goals of this chapter is for you to become familiar with acoustic representations, such as that of a recording of me pronouncing the vowels [ieaou]:

Praat Speech Synthesis

Your browser does not support the audio element.

I open it in Praat to the following screen; you can download it yourself from here if you want to follow along at home:

Fig. 15 The author’s pronunciation of [ieaou].

This image stacks two graphs one on top of the other.

Whenever you are confronted with a graph, the first thing that you should do is look at the labels on its axes. The graphs above have two of them, a horizontal one which is usually called “x” and a vertical one which is usually called “y”.

Questions

What is the x axis? What is its range?

I tried to articulate each vowel for the same length of time. How long is this?

Praat has set the cursor (red line) at the middle of the recording by default. How long is each half?

What are the two y axes? What is the range of the first? What about the second?

Why does the last vowel in the top graph look different from the others?

You should be able to figure out most of the answers. I’ll give you a hand with the two hardest ones.

Praat Speech To Text

The top graph is extremely unhelpful, since its three measurements, -0.1416, 0, and 0.145 are not labeled with any unit. They are measures of a sound’s amplitude. In theory, this should be pressure, whose unit is the pascal, though I have not found any mention of it in the Praat documentation. Praat normalizes this amplitude tracing to its highest and lowest measures – here nearly 1 and -1. These extremes indicate the levels at which the equipment used to record the sound and the equipment used to play it back reach their limits of fidelity, beyond which the sound will be distorted.

The last vowel has a lesser ‘footprint’ because we tend to reduce our speech towards the end of an utterance, presumably to save effort. Thus the amplitude tracing at the top gradually shrinks, while the frequency trace underneath gets gradually fainter.

Praat Speech Analysis

By the way, a graph with two axes is often called two dimensional, under the assumption that each axis measures a different dimension.