Related Articles

AT SOME EARLIER POINT IN YOUR LIFE, YOU PROBABLY played the sound game of repeating one word over and over again until it lost its meaning altogether. Though we rational adults rarely sit around doing that, the game does indicate some interesting ideas about speech. We use speech every day and, for the most part, we take it for granted. But when we think of words as sounds with a range of amplitudes and frequencies, then we might wonder what speech sounds actually are.

Speech consists of complex acoustical waveforms that rapidly vary in frequency and amplitude. On one level, speech signals can be thought of as mere signals with temporal, spectral and dynamic properties. We know, however, that there is much more to speech than this, for these acoustic signals carry semantic meanings that embody the subtleties and richness of language. These sounds allow us to express ourselves, communicate with each other and share ideas and feelings. Sounds ranging from the words of Shakespeare to a clothing ad droning on the radio to an intimate conversation with a friend all arrive at our ears as small pressure fluctuations in the air. These fluctuations are detected and converted into nerve impulses. The brain's inscrutable mechanisms are then able to attach meaning to these impulses. By understanding something about the nature of speech sounds, we can begin to understand how the sound systems we design and install can affect these complex waveforms and the intelligibility of the broadcast message or announcement.

THE PARTS OF SPEECH

Let's start by taking a look at some typical speech sounds and waveforms. Figure 1 shows the waveform of the word speech. We can see that it effectively consists of three distinct components separated in time. First is the \s\ sound followed by the \pē\ and finally the \ch\. Even from these simple time traces we can see that the three components are quite different.

As an experiment, try saying the word speech slowly out loud. (You may want to try this in a private place to avoid any strange glances.) As you do, you can feel that different parts of your vocal tract and mouth are required to make the different sounds. The \s\ sound, for example, is made by opening the mouth and expelling air through the cavity made by the lips and tongue. The \pē\ sound is made by closing the lips for the \p\ and vibrating the vocal chords for the \ee\. The \ch\ comes from air being expelled from the mouth while manipulated by the tongue and lips. You can feel firsthand with this brief experiment that speech sounds are not produced in just one place in the body but in a combination of areas from the lungs to the throat to the tongue, teeth, palate and lips.

Try saying the word speech again, but this time pinch your nose closed first. It sounds very different doesn't it? Let's add the nasal passages and sinuses to the body parts that affect and help generate speech. (Okay. The experiments are over. You can come out of the closet now and carry on reading the rest of this piece normally.)

WHAT DOES SPEECH LOOK LIKE?

A visual dissection and analysis of the word speech can be quite illuminating as we progress toward some conclusions about the best parameters for speech amplification systems. The \s\ sound is a high-frequency sound, as is the \ch\, but the \pē\ has a much broader spectrum. Figure 1 contains a speech spectrograph that clearly shows that the \s\ has components at around 3 and 7kHz while the \ch\ is slightly broader with the first band of frequencies at around 1800 Hz to 3 kHz and a second band at 5 kHz. The \pē\ covers the range from around 150 Hz to 3.5 kHz.

Figure 2 shows the waveforms for three spoken letters A, E and S. (The letters were said fairly slowly so that we could better see the waveforms.) The A lasts for around 400 mS; the E, 300 mS; and the S about 500 mS. The S is formed by two component sounds: the short \e\ and the \s\. The \ā\ and the \ē\ sounds, as vowels, are produced by the vocal tract and vibration of the vocal chords. They therefore consist of a series of resonant notes or sounds. This can partly be seen by the regular structure of the waveform; however, a better view of this can be obtained by converting to the frequency domain by means of a fast Fourier transform.

Figure 3 shows a 3-D frequency spectrum for the sound of the A. The figure clearly shows the sound to be made up from a series of resonant peaks with the fundamental at around 125 Hz and subsequent harmonics (formants) at 250, 375 and 500 Hz. Over the short segment of the sound shown, the character of the sound hardly changes, with the amplitude and spectral peaks remaining essentially constant. The figure also clearly shows that the sound is not a continuous spectrum. Vowels are termed voiced sounds since the vocal chords vibrate in order to produce them.

Conversely, Figure 2 shows the spectrograph of the \s\ sound. It looks quite different. First, the energy is centered at a much higher frequency with the main peak around 7.5 kHz. Second, the spectrum is effectively continuous, not a series of discrete resonances. What is not clear from the figure is that this high-frequency, unvoiced consonant sound has considerably lower acoustic energy than the lower-frequency, voiced vowel sounds.

Figure 2 also a color spectrograph for the whole AES phrase. The figure combines the effects of time, frequency and energy by displaying time along the horizontal axis, frequency along the vertical axis and amplitude in color. Red represents the greatest amplitude (loudest) and dark blue shows the lowest amplitude (quietest). The colored striations show the resonant formant voice frequencies. Up in the far top right, the \s\ sound is seen. The high frequency and lower amplitude of this sound are immediately apparent. In fact, these high-frequency consonant components are some 20 to 30 dB lower in amplitude than the vowel formants. The implication for sound system design is that a bandwidth of at least 8 to 10 kHz is required in order to preserve these important speech components.

Another important point that can be understood by studying spectrographs is that the lowest-frequency sounds rarely contain the most energy. Often the second, third or fourth formants provide this. The spectrographs also show how the pitch or frequency of speech changes over the short term, as talkers either emphasize particular sounds or include inflection within their speech. A small indication of this can be seen in the far left-hand side of the plot about halfway between 400 and 800 Hz. Here the higher order formants can be seen to exhibit a decrease in frequency immediately after the \a\ sound begins.

WHERE DOES INTELLIGIBILITY LIVE?

These discoveries about speech are interesting for a number of reasons, and even more importantly, they have a practical use. When designing what one hopes will be a good sound system, one might wonder what the frequency spectrum is for speech over the long term, as during a several-minute talk where the speaker uses inflection and emphasis. Well, it peaks at around 200 to 400 Hz and rolls off at around 6 dB per octave above about 1 kHz. Figure 4 presents the spectrum in terms of octave bands. Here we can see that the maximum energy is in the 250Hz band with only a little less at 500 Hz. These lower-frequency bands roughly correspond to vowel sounds, whereas the higer-frequency, consonant sounds congregate around the 2 and 4kHz regions, as we've seen in our prior examples. (The \i\ vowel sound is one exception with most of its energy above 2 kHz. The \s\ sound and many sibilants occur in the 4 and 8kHz bands.)

As a rule in western languages, vowels provide the power of the voice (high energy at low frequencies) while the less powerful consonants provide intelligibility. It is interesting to note that there is around a 27 to 28dB difference in the power between the strongest vowel phoneme and the weakest consonant sound, which corresponds to a range of around 500:1. Studies demonstrate that the 125 and 250Hz bands provide little intelligibility, and the 500Hz band only provides around 12% intelligibility. However, these lower frequencies are important for talker recognition and the overall rhythm of the speech. The 2kHz band contributes the most intelligibility. Together with the 4kHz band, it provides some 57% of it; and although research has given varied results, the importance of the 2kHz band is universally acknowledged. The importance of ensuring that a sound system can adequately reproduce and radiate the 2 and 4kHz bands can be readily understood.

An interesting experiment was carried out by a Dr. Warren in 1995. He produced two widely separated and very narrow bands of speech, one centered at 370 Hz and one at 6 kHz. Subjects would try to identify words after hearing only one of these bands. The 370Hz band produced only 0.9% correct word scores; and the 6kHz band, 10.4%. However, when combined together, the score did not merely add up to 11.3% but leapt up to 27.8%. This shows that separate bands, with small intelligibility contributions in isolation, contribute significantly to intelligibility when presented in conjunction with complementary bands. This result highlights one of the prejudgments our current, simple models of intelligibility measurement need to overcome. Intelligibility is more complicated than we think. Although we can see where its frequencies are concentrated, we have to acknowledge that it extends over the entire audible range and is subtler than we now suspect.

STRIVING FOR CLARITY

To be intelligible, speech signals need to be able to withstand the degradations of the local acoustic environment and, if reinforced or amplified, the transmission through the sound system. There are many factors that can affect speech intelligibility.

PRIMARY FACTORS Building or System Issues
  • Room reverberation time (RT60)
  • Volume, size and shape of the space
  • Distance from the listener to a loudspeaker
  • Directivity of the loudspeaker
  • The number of loudspeakers operating within the space
  • The direct-to-reverberant ratio (This is directly dependent upon the previous five factors)*
  • Loudness and signal-to-noise ratio
  • Sound system bandwidth and frequency response
  • Intelligibility
Human Factors
  • Talker enunciation/rate of delivery
  • Listener acuity
SECONDARY FACTORS Building or System Issues
  • System distortion (harmonic or intermodulation)
  • System equalization
  • Uniformity of coverage
  • Presence of very early reflections (less than 2 ms)
  • Sound focusing or presence of late or isolated higher-level reflections (greater than 70 ms)
  • Direction of sound arriving at the listener
  • Direction of any interfering noise
Human Factors
  • Gender of speaker
  • Vocabulary and context of speech information
  • Talker microphone technique

It is not possible to describe the above effects in detail (you will have to wait for the book to come out when I retire), but let's look at the two most common factors to get a feel for the problem. We'll indirectly cover some of the others as well.

In order to be intelligible, received speech needs to be loud enough to be comfortably heard and to overcome the effects of any local background noise. Over the years, a number of rules of thumb have been developed to gauge the required level. Interestingly, at 0 dB SNR, speech can be well understood. For those of you familiar with RaSTI and STI, this approximates to a value of 0.5, which is roughly equivalent to 10% Alcons. This is not a particularly desirable target and would be judged as rather unsatisfactory under most conditions; however, it is surprising how many passenger aircraft P.A. systems are set up to about this level. An SNR of at least 6 dBA is required for most situations, and 10 dBA is an appropriate goal. Minor improvement can be obtained above 15 dBA, but the law of diminishing returns sets in so that no improvement generally results above 25 dBA.

The situation significantly depends on the spectrum of the interfering noise. Ideally, a spectrum analysis should be carried out and the speech signal compared to it (illustrated in Figure 5). The upper figure shows an example in which the speech signal is clearly greater than the noise over the complete frequency range, resulting in good intelligibility. The lower curve depicts a situation where noise masks the speech signal at high frequencies and will, therefore, be deleterious to intelligibility as the high-frequency consonant information is lost. Octave-band analysis of a space can help predict to a good degree of accuracy just how intelligible speech will be (assuming no other detrimental influences such as reverberation, poor speaker enunciation, etc.).

Determining the SNR of a speech signal might be thought of as a relatively trivial matter, but it is, in fact, rather complex. Normal speech incorporates a wide dynamic range, in both the short and long term. Figure 6 helps explain this. It shows sound pressure level over a 1-minute speech extract.

As the plot shows, the second-by-second SPL varies substantially. The speech peaks are around 75 to 85 dBA, but at times the level drops to 50 dBA. A statistical level analysis reveals the mean value of the signal to be just 63 dB; and the energy content of the signal as described by the Leq is 69.3 dB. The L10 (a quasi-equivalent average max level) is 73.6 dB, whereas the maximum level was 84.3 dB. In practice, the L10 seems to give a good correlation to the subjective loudness and perception of the signal, but this requires sophisticated instrumentation to measure. Leq facilities are appearing on many meters, and as this does the averaging for the operator, there is a move in some quarters to adopt it as the standard measure (although it will effectively under-measure currently established values).

The effect of reverberation can be seen in Figure 7. The old standby phrase, “One, two,” is used as the test signal. In this case the significant distortion of the waveform is very clearly observable; however, speech with this degree of reverberant distortion (RT time equals 2.4 seconds) is still intelligible, as in this case, which was measured to be 0.45 RaSTI or the equivalent of 15% Alcons.

These two examples demonstrate the robustness of a typical speech signal. While we can pretty nearly measure the effects of simple distortions on potential intelligibility, adding in other items from the extensive list of factors above causes the metrics to immediately fall apart. This just goes to show what a long way we still have to go before we can truly measure the potential intelligibility of a sound system. And it demonstrates just how mysterious and clever the human brain really is to be able to compensate for these anti-intelligibility effects so much better than our most advanced tools.


Peter Mapp is principal of Peter Mapp Associates, an acoustic consultancy practice based in Colchester, England. Peter is S&VC's sound reinforcement consultant and can be contacted at petermapp@btinternet.com.


CONSONANT PHONEMES: ANCHORS OF INTELLIGIBILITY

English and American speech is composed of about 42 basic sounds or phonemes. They are classified by where their sounds are made in the vocal tract or mouth. Most linguists reckon between 17 and 21 distinct vowel sounds, all created by air passing over the vocal chords, shaped by the mouth.

Consonants come in several varieties based on which parts of the mouth (lips, tongue, teeth, palate) and throat (soft palate, uvula, larynx, pharynx) are used and how the expelled air is manipulated by them.

Approximants, or semivowels, are speech sounds midway between a vowel and a consonant: the \w\ in “won”, the \l\ in “like”, the \r\ in “red”, and the \y\ in “yes.” In these phonemes, there is more constriction in the vocal tract than for the vowels, but less than the other consonant categories below.

Nasals are created when airflow is blocked completely at some point in the oral tract, but the simultaneous lowering of the velum allows a weak flow of energy to pass through the nose: \m\ as in “me”, \n\ as in “new”, and \ng\ as in “sing.”

Fricatives are weak or strong friction noises produced when the articulators are close enough together to cause turbulence in the airflow. Sub-classifications are: sibilants (produced through the teeth), \s\ and \z\; labiodental (produced with the teeth on the lip), \f\ and \v\; dental (produced with the tongue on the teeth) \th\ as in “thing” and \th\ as in “the”; and glottal (produced in the pharynx) \h\. Retroflex fricatives are \sh\ as in “ship” and \z\ as in “azure.”

Plosives. English has six bursts or explosive sounds produced by complete closure of the vocal tract followed by a rapid release of the closure: \p\, \t\, \k\, \b\, \d\, \g\. \P\ is considered bilabial (using both lips); \t\ is aveolar, produced by touching the tongue to the back of the teeth; and \k\ and \g\ are velar, produced with the tongue against the hard palate.

Affricates. English has two affricates, plosives released with frication: the \ch\ sounds of “church” and the \j\ and \dge\ of “judge.”


* Strictly speaking, a more complex characteristic than the simple D/R ratio should be used. Better correlation with perceived intelligibility is obtained by using the ratio of the direct sound and early reflected energy to late reflected sound energy and reverberation. This may be termed C50 or C35 depending upon the split time used to delineate between the useful and deleterious sound arrivals.



Browse Back Issues
BROWSE ISSUES
  August 2008 Sound & VIdeo Contractor Cover July 2008 Sound & VIdeo Contractor Cover June 2008 Sound & VIdeo Contractor Cover May 2008 Sound & VIdeo Contractor Cover April 2008 Sound & VIdeo Contractor Cover March 2008 Sound & VIdeo Contractor Cover  
August 2008 July 2008 June 2008 May 2008 April 2008 March 2008