There is much controversy about how we might move forward towards higher quality reproduction of sound. The compact disc standard assumes that there is
Publish date:


Sep 1, 1998 12:00 PM, David E. Blackmer

There is much controversy about how we might move forward towards higherquality reproduction of sound. The compact disc standard assumes that thereis no useful information beyond 20 kHz and therefore includes a brick wallfilter just above 20 kHz. Many listeners hear a great difference when 20kHz-band limited audio signals are compared to wideband signals. A numberof digital systems that sample audio signals at 96 kHz and above with up to24 bits of quantization have been proposed.

Many engineers have been trained to believe that human hearing receives nomeaningful input from frequency components above 20 kHz. I have read manyirate letters from such engineers who insist that information above 20 kHzis clearly useless, and any attempt to include such information in studiosignals is deceptive, wasteful and foolish. They assert further that anyright-minded audio engineer should know that 20 kHz has been acknowledgedas an absolute limitation for decades. Those of us who are convinced thatthere is critically important audio information to at least 40 kHz areviewed as misguided.

So what's going on? We must look at the mechanisms involved in hearing andattempt to understand them. Through that understanding we can then developa model of the incredible capabilities of the transduction and analysissystems in human audition and thereby work towards better standards foraudio system design.

When viewed from an evolutionary standpoint, human hearing has become whatit is because it is a survival tool. The human auditory sense is effectiveat extracting every possible detail from the world around us so that we andour ancestors might avoid danger, find food, communicate, enjoy the soundsof nature and appreciate the beauty of music. Human hearing is generally, Ibelieve, misunderstood to be primarily a frequency-analysis system. Theprevalent model of human hearing presumes that auditory perception is basedon the brain's interpretation of the outputs of a frequency-analysissystem, which is essentially a wide dynamic range comb filter wherein theintensity of each frequency component is transmitted to the brain. Thiscomb filter is certainly an important part of our sound-analysis system.Each frequency zone is tuned sharply with a negative mechanical resistancesystem. Furthermore, the tuning Q of each filter element is adjusted inaccordance with commands sent back to the cochlea by a series ofpre-analysis centers (the cochlear nuclei) near the brain stem. A number offast transmission rate nerve fibers connect the output of each hair cell tothese cochlear nuclei. The human ability to interpret frequency informationis amazing; however, clearly something is going on that cannot be explainedentirely in terms of our ability to hear tones.

What started me on my quest to understand the capabilities of human hearingbeyond 20kHz was an incident in the late 1980s. I had just acquired anMLSSA system and was comparing the sound and response of a group ofhigh-quality dome tweeters. The best of these had virtually identicalfrequency response to 20 kHz, yet they sounded different. When I lookedclosely at their response beyond 20 kHz they were visibly different. Themetal dome tweeters had an irregular picket fence of peaks and valleys intheir amplitude response above 20 kHz. The silk dome tweeters exhibited asmooth fall off above 20 kHz. The metal dome sounded harsh compared to thesilk dome. Admittedly, I cannot hear tones even to 20 kHz, yet thedifference was audible. Rather than denying what I clearly heard, I startedlooking for other explanations, and I have found surprising informationhidden in the literature about human hearing.

The inner ear is a complex device with incredible details in itsconstruction. Acoustical pressure waves are converted into nerve pulses inthe inner ear, specifically in the cochlea, which is a liquid-filled spiraltube. The acoustic signal is received by the tympanic membrane where it isconverted to mechanical forces that are transmitted to the oval window andthen into the cochlea where the pressure waves pass along the basilarmembrane. This basilar membrane is an acoustically active transmissiondevice. Along the basilar membrane are rows of two different types of haircells, usually referred to as inner and outer. The inner hair cells clearlyrelate to the frequency-analysis system described above. Only about 3,000of the 15,000 hair cells on the basilar membrane are involved intransducing frequency information using the outputs of this travelling wavefilter.

The outer hair cells clearly do something else, but what? There are about12,000 outer hair cells arranged in three or four rows, four times as manyas inner hair cells. Only about 20% of the total available nerve paths,however, connect them to the brain. Outer hair cells are interconnected bynerve fibers in a distributed network. This array seems to act as awaveform analyzer, a low-frequency transducer and a command center for thefast muscle fibers (actin) that amplify and sharpen the travelling wavespassing along the basilar membrane, thereby producing the comb filter. Italso has the ability to extract information and transmit it to the analysiscenters in the olivary complex and then on to the cortex of the brain whereconscious awareness of sonic patterns takes place. The information from theouter hair cells, which seems to be more related to waveform thanfrequency, is certainly correlated with the frequency domain and otherinformation in the brain to produce the auditory sense.

Our auditory analysis system is extraordinarily sensitive to boundaries(any significant initial or final event or point of change). One result ofthis boundary-detection process is the heightened awareness of the initialsound in a complex series of sounds such as a reverberant sound field. Thisinitial sound component is responsible for most of our sense of content,meaning and frequency balance in a complex signal. The human auditorysystem is evidently sensitive to impulse information imbedded in the tones.My suspicion is that this sense is behind what is commonly referred to asair in the high-end literature. It probably also relates to what we thinkof as texture and timbre-that which gives each sound its distinctiveindividual character. Whatever we call it, impulse information is animportant part of how we hear.

All output signals from the cochlea are transmitted on nerve fibers aspulse rate and pulse position modulated signals. These signals are used totransduce information about frequency, intensity, waveform, rate of changeand time. The lower frequencies are transduced to nerve impulses in theauditory system in a surprising way. Hair cell output for the lowerfrequencies are transmitted primarily as groups of pulses that correspondstrongly to the positive half of the acoustic pressure wave with few if anypulses being transmitted during the negative half of the pressure wave.Effectively, these nerve fibers transmit on the positive half wave only.This situation exists up to somewhat above 1 kHz with discernable half-wavepeaks riding on top of the auditory nerve signal being clearly visible toat least 5 kHz. There is a sharp boundary at the beginning and end of eachpositive pressure pulse group, approximately at the central axis of thepressure wave. This pulse-group transduction with sharp boundaries at theaxis is one of the important mechanisms that accounts for the timeresolution of the human ear. In 1929, Von Bikisy published a measurement ofthe human sound position acuity, which translates to a time resolution ofbetter than 10 ms between the ears. Nordmark, in a 1976 article, concludedthat the interaural resolution is better than 2 ms; interaural timeresolution at 250 Hz is said to be about 10 ms, which translates to betterthan 1 degrees of phase at this frequency.

The human hearing system uses waveform and frequency to analyze signals. Itis important to maintain accurate waveform up to the highest frequencyregion with accurate reproduction of details down to 5 ms to 10 ms. Theaccuracy of low-frequency details is equally important. We find manylow-frequency sounds, such as drums, take on a remarkable strength andemotional impact when waveform is exactly reproduced. Please notice theexceptional drum sounds on The Dead Can Dance album Into the Labyrinth. Thedrum sound seems to have a very low fundamental, maybe about 20 Hz. Wesampled the bitstream from this sound and found that the first positivewaveform had twice the period of the subsequent 40 Hz waveform. Apparently,one half cycle of 20 Hz was enough to cause the entire sound to seem tohave a 20 Hz fundamental.

The human auditory system, both inner and outer hair cells, can analyzehundreds of nearly simultaneous sound components, identifying the sourcelocation, frequency, time, intensity and transient events in each of thesesounds simultaneously, and it can spatially map these sounds with awarenessof each sound source, its position, character, timbre, loudness and allother identification labels that we can attach to sonic sources and events.I believe that this sound quality information includes waveform, embeddedtransient identification and high-frequency component identification to atleast 40 kHz (even if you cannot hear these frequencies in isolated form).To meet the requirements of human auditory perception, a sound system mustcover the frequency range of about 15 Hz to at least 40 kHz (some say 80kHz or more) with more than 120 dB dynamic range to handle transient peaksproperly and with a transient time accuracy of a few microseconds at highfrequencies and 1 degrees or 2 degrees phase accuracy down to 30 Hz. Thisstandard is beyond the capabilities of modern systems, but it is importantthat we understand the degradation of perceived sound quality resultingfrom compromises made in today's sound-delivery systems. The transducersare the most obvious problem areas, but the storage systems and all theelectronics and interconnections are important, too.

Mics are the first link in the audio chain, translating the pressure wavesin the air into electrical signals. Many of today's mics are not accurate,and few have accurate frequency response over the entire 15 Hz to 40 kHzrange. In most mics, the active acoustic device is a diaphragm thatreceives the acoustical waves, and like a drum head, it will ring whenstruck. To make matters worse, the pickup capsule is usually housed in acage with many internal resonances and reflections, further coloring thesound. Directional mics, because they achieve directionality by samplingthe sound at multiple points, are by nature less accurate thanomnidirectional mics. The ringing, reflections and multiple paths to thediaphragm add up to excess phase. These mics smear the signal in the timedomain.

At Earthworks, we have learned after many measurements and carefullistening that the true impulse response of mics is a better indicator ofsound quality than frequency amplitude response. Mics with long andasymmetrical impulse performance will be more colored than those with shortimpulse tails. To illustrate this point, we have carefully recorded avariety of sources using two different omnidirectional mics (EarthworksQTC1 and another well known model), both of which have flat frequencyresponse to 40 kHz within +/-1 dB. (See Figure 1.) When played back onhigh-quality loudspeakers, the sounds of these two mics is quite different.When played back on loudspeakers with nearly perfect impulse and stepresponse, the sounds of the two mics vary even more widely. The onlymeaningful and identifiable difference between these two mics is theirimpulse response. We have developed a system for deriving a mic's frequencyresponse from its impulse response. After numerous comparisons between theresults of our impulse conversion and the results of the more commonsubstitution method, we are convinced of the validity of this as a primarystandard. You will see several examples of this in Figure 2. Viewing thewaveform as impulse response is better for interpreting high-frequencyinformation. Low-frequency information is more easily understood frominspecting the step function response, which is the mathematical integralof impulse response. Both curves contain all information about frequencyand time response within the limits imposed by the time window, thesampling processes and noise. The electronics in high-quality sound systemsmust also be exceptional. Distortion and transient intermodulation shouldbe held to a few parts per million in each amplification stage, especiallyin systems with many amps in each chain. In the internal circuit design ofaudio amps, it is especially important to separate the signal referencepoint in each stage from the power supply return currents that are usuallyterribly nonlinear. Difference input circuits on each stage should extractthe true signal from the previous stage in the amp. Any overall feedbackmust reference from the output terminals and compare directly to the inputterminals to prevent admixture of ground grunge and crosstalk with thesignal. Failure to observe these rules results in transistor sound.Transistors can be used in a manner resulting in an arbitrarily lowdistortion, intermodulation, power supply noise coupling and whatever othererrors we can name and can therefore deliver perceptual perfection in audiosignal amplification. (I use perceptual perfection to mean a system orcomponent so excellent that it has no error perceptible to the best humanhearing.) My current design objective on amps is to have all harmonicdistortion, including 19 kHz and 20 kHz twin tone intermodulation products,below one part per million and to have a weighted noise at least 130 dBbelow maximum sine wave output. I assume that a signal can go through manysuch amps in a system with no detectable degradation in signal quality.

Many audio signal sources have extremely high transient peaks, often ashigh as 20 dB above the level read on a volume indicator. It is importantto have some adequate measurement tool in an audio amplification system tomeasure peaks and to determine that they are being handled appropriately.Many of the available peak reading meters do not read true instantaneouspeak levels but respond to something closer to a 300 ms to 1 ms averagedpeak approximation. All system components, including power amps andloudspeakers, should be designed to reproduce the original peaksaccurately. Recording systems truncate peaks beyond their capability.Analog tape recorders often have a smooth compression of peaks, which isoften regarded as less damaging to the sound. Many recordists even likethis peak clipping and use it intentionally. Most digital recorders have abrick-wall effect in which any excess peaks are squared off with disastrouseffects on tweeters and listeners' ears.

Compressors and limiters are often used to reduce peaks that wouldotherwise be beyond the capability of the system. Such units with RMS leveldetectors usually sound better than those with average or quasi-peakdetectors. Also, be careful to select signal processors for low distortion.If they are well designed, distortion will be low when no gain change isrequired. Distortion during compression will be almost entirely thirdharmonic distortion, which is not easily detected by the ear and is usuallyreasonably acceptable when audible. A look at the specifications of some ofthe highly rated, super-high-end, no-feedback, vacuum-tube power ampsreveals how much distortion is acceptable (or even preferable) to some wellheeled audiophiles. All connections between different parts of theelectrical system must be designed to eliminate noise and signal errors dueto power line ground currents, AC magnetic fields, RF pickup, crosstalk anddielectric absorption effects of poor wire insulation. This is critical.

Loudspeakers are the other end of the audio system in that they convert theelectrical signal back into pressure waves in the air. Loudspeakers areusually less accurate than mics. Too many of our common sound systems arebelow the capabilities of today's technology. Listen to cinema sound, forexample. Enormous improvement has been made in the delivery of high-qualitydigital sound to the theater, but cinema loudspeakers are almost alwayshorn loudspeakers. Horn loudspeakers can be quite good except when theymust also possess constant directivity, which is usually achieved by addinga sharp discontinuity in the internal horn profile. Such loudspeakers oftenhave so many nearly equal level internal reflections that no current DSPsystem could adequately correct their sound.

To make matters worse, the powers that be have decreed that any goodtheater must be equalized, which all too often means placing one or moretest mics at ear level in representative seats in the auditorium andadjusting for flat response with a 1/3-octave EQ and matching analyzer. Sowhat's wrong with this? When we listen in a reverberant space, weselectively give a strong preference to first arrival in judging soundquality. The reverberant sound field arrives later and is perceived to beless important. It is therefore beneficial to achieve good frequencyresponse in first arrival sound. Errors in reverberant sound are acceptedas normal unless there are bad standing waves in the listening space, whichmust be corrected with appropriate physical acoustic treatment. You cannotsolve room acoustics problems with filters. Time-windowed analyzers, suchas MLSSA, SMAART or TEF, should be used for all tuning of listening spaces.If a time windowed-analyzer is not available, the measurement mic should beplaced nearer to the active loudspeaker, optimally between one third andone half the distance from the loudspeaker to the mid-audience seat chosenas a reference point so that the direct loudspeaker sound dominates in themic input signal.

There has been a dramatic improvement in the tools available to measuresystems and components. Many manufacturers have improved their amps,recorders and transducers. Those of us involved in equipment design and/orsound system design must learn both to recognize and reward excellence fromour suppliers and to deliver excellent sound to our customers. I suspectthat truly excellent sound, perhaps even perceptual perfection, especiallyin large spaces, must await the development of a high-accuracy, high-power,direct-radiating 40 kHz tweeter system with inherently good impulseresponse integrated into a system that gives good impulse and step functionresponse over the entire listening area.

I have heard that the Victor Talking Machine Company ran ads in the 1920sin which Enrico Caruso was quoted as saying that the Victrola was so goodthat its sound was indistinguishable from his own voice live. In the 1970s,Acoustic Research ran similar ads (with considerably more justification)about live vs. recorded string quartets. We have come a long way sincethen, but can we achieve perceptual perfection? As a point of reference,you should assemble a test system with both mics and loudspeakers havingexcellent impulse and step response, hence nearly perfect frequencyresponse, together with low distortion amps. Test it as asound-reinforcement system and/or studio monitoring system with both voiceand music sources. You, the performers and the audience will be amazed atthe result.