How Acoustic Echo Cancelling Really WorksIf you've ever been in an electronic meeting, class, or other activity involving multiple rooms, you've undoubtedly experienced acoustic echo ? you hear your own voice coming back to you through the 3/02/2006 5:26 AM Eastern
How Acoustic Echo Cancelling Really Works
If you've ever been in an electronic meeting, class, or other activity involving multiple rooms, you've undoubtedly experienced acoustic echo ? you hear your own voice coming back to you through the audio system, sometimes with significant delay. The problem can range from a simple annoyance to a full-fledged meeting killer. That's why controlling echo is such a major consideration in most conferencing applications.
If you've ever been in an electronic meeting, class, or other activity involving multiple rooms, you've undoubtedly experienced acoustic echo — you hear your own voice coming back to you through the audio system, sometimes with significant delay. The problem can range from a simple annoyance to a full-fledged meeting killer. That's why controlling echo is such a major consideration in most conferencing applications.
Fig. 1. Acoustic echo between two rooms.
Fortunately, electronic devices known as “acoustic echo cancellers,” or AECs, are readily available. Because there's some extremely complex electronic signal processing behind these devices, including proprietary algorithms used to compute speech models, understanding their operation can seem quite complicated. Let's break down how echo cancelling really works to help clear up any confusion.Technical overview
There are many types of echo at play in AV systems, including in-room reverberation, telephone line echo, and acoustic echo.
“Reverberation” is caused by room acoustics and refers to sounds that bounce off hard surfaces in a room, creating multiple reflections of the original audio signal. The more audio reflections measured in a room, the more reverberant it is. Highly reverberant rooms can reduce intelligibility of speech in both the local room and the speech transmitted to the far-end sites — and have a direct effect on acoustic echo.
Sometimes confused with acoustic echo, “telephone line echo” refers to reflections of speech via transmission lines, such as telephone lines, and is usually caused by impedance mismatches in the transmission line. The most common example of line echo is the side tone you hear when you talk into a telephone handset. Your telephone intentionally adds some sidetone so you have some feedback that the telephone is working properly. The line echoes are removed with a line echo canceller, or LEC, which uses digital signal processing (DSP) circuitry similar to an AEC to remove the reflected speech on the line. LECs are relatively simple devices because the nature of the echo isn't very complex. You'll typically find LECs in telephone equipment, including telephone hybrids used in conferencing applications.
“Acoustic echo” heard in the remote room is defined as acoustic pickup of the remote audio in the local room and the re-transmission of that audio back to the remote room. Commonly found in situations involving open microphones and loudspeakers, such as full-duplex speakerphones, meeting rooms, or classrooms, it only occurs when two or more sites are communicating. In order for acoustic echo to occur, a microphone must pick up audio from a loudspeaker and then send that audio back to its originating site. This can be direct (straight path) acoustic pickup or pickup after one or more reflections off hard surfaces. Even if there are no hard surfaces (i.e. a “dead” room), acoustic echo can still occur via straight path pickup.
As shown in Fig. 1, audio received from the remote end (Room A) appears on loudspeakers in the local room (Room B). This audio is then picked up by microphones in the room, sometimes after bouncing around the room several times, and is then retransmitted back to the “far” end (Room A). It's important to note that acoustic echo heard by the far end is caused by conditions in the “near” or “local” room, but it only affects the participants in the far room — the ones hearing their own voices echoed back to them. In real-world applications, each room in a conferencing network will create acoustic echo that can affect all of the other rooms.
Figure 1: Acoustic echo between two rooms.
An AEC uses digital signal processing to identify audio entering the acoustic space of a local room via codec, phone line, or other connection. Before reaching the room's loudspeakers, this audio is used as a reference signal in the AEC. The device then uses this information to subtract out signals that match this “reference” signal from the audio signal picked up by the microphones. The resulting signal is then transmitted back to the remote site. The amount of attenuation is measured in dB and is referred to either as echo return loss enhancement (ERLE) or simply as AEC performance.
The higher the number in decibels, the more echo is removed from the outbound signal. While this sounds relatively simple, in reality it's much more challenging. The AEC must be able to do the following: adequately remove the echo without adversely impacting the local talker's audio that is to be transmitted to the remote site; identify and then remove multiple acoustic reflections of the reference signal; identify and then remove delayed versions of the reference signal; and quickly “adapt” to a room and then re-adapt when the room conditions invariably change as people move around or volume levels change.
As you can see in Fig. 2, by placing an AEC in the local room (Room B), audio received from the far room (Room A) isn't included in the signal being sent back to Room A from Room B.
Several companies, including Polycom, ClearOne, Tandberg, Sony, Biamp, and Symetrix, manufacture AECs for conferencing applications. Some are standalone devices; others are incorporated into audio- and videoconferencing systems. However, AEC performance and features vary from manufacturer to manufacturer, and products should be carefully compared by the consultant or integrator to ensure the appropriate match to the end-user's requirements, which are determined by room size, number of microphones, and acoustic characteristics.
Most AECs operate by means of an adaptive least mean square (LMS) or digital filter. This is a “self-learning” filter that uses one or more DSPs to continuously adapt its output based on varying input. Fig. 3 shows a simplified block diagram of an AEC. Audio entering the system via the codec or other device along with local room “program sources” (such as a DVD player) is digitized and then sampled by the DSP. This sampled audio becomes the reference signal used by the LMS filter. Audio that is to be sent to the far-end location is also digitized and then compared to the reference signal. The filter analyzes the two signals, estimates the echo, and uses its estimate to filter out echoes of the reference signal, creating the final outbound signal. The filter continually repeats this process, improving the echo estimation with each repetition and eventually achieving the proper amount of ERLE for the room. However, because the reference signal — normally human speech — is always changing, the filter must continually change its echo estimation and “keep up” with the incoming audio.
Once audio leaves the loudspeaker(s), it can be directly picked up by the microphone(s) or bounced around the room several times via walls and other hard surfaces before being picked up. Because the speed of sound is so much slower than that of electronic signals, (which essentially travel at the speed of light) direct and bounced signals will arrive at the microphone(s) at different times. As a result, the reference (incoming) audio may enter the LMS filter many milliseconds before the filter “sees” the same audio coming back. Therefore, echo cancellers provide a specification of “tail time,” which is the length of time the reference signal will be compared with the signal received via microphones and used for the echo estimation. If an AEC has a “tail time” shorter than the length of the longest echo, echoes will still be heard at the originating site.
A good formula for calculating the tail time needed by the AEC is:
T = (N + 1) x d / c
- T is the tail length of the echo canceller in milliseconds
- N is the number of reflections cancelled
- d is the longest distance between walls in meters or feet
- c is the speed of sound (343 meters per second or 1,125 feet per second at room temperature).
This equation assumes that both the microphone and the speaker are mounted on the same surface (which will give you the worst case in terms of the number of reflections that will need to be cancelled). In that case, N must be an odd integer because the even reflections travel away from the microphone.
For example, consider a 10- by 20- by 30-foot conference room with very reflective surfaces that requires five echoes to be cancelled. In such a room, a tail time of 6 x 30 / 1,125 = 160 ms would be needed. Fig. 4 shows how those reflections travel across the room.
In basic terms, the more reverberant the room, the more times the audio will bounce around before being picked up by a microphone. This directly affects tail time as the AEC must be able to identify these multiple reflections of the same signal.
Another major consideration in choosing an AEC is its “convergence rate,” or the speed at which the AEC adapts to room conditions and effectively removes echo. When first powered on, an adaptive filter has no information on the echoes that will be occurring in the room. Each time the filter runs through its process of comparing the reference signal with the signal appearing at the microphones, it learns more about the echo and is able to subtract more of the echo from the signal. The AEC is fully converged when the error at estimating echo (also known as prediction error) falls below a certain threshold. Convergence rates are measured in decibels per second. A convergence rate of 40 dB/second means the AEC will attenuate echoes by at least 40 dB within one second of receiving a new reference signal.
AEC convergence rates can often be measured using test tones or noise signals. Because a test tone is a steady-state signal, the AEC is able to converge very quickly using this measurement method. However, some systems are able to detect these test tones and simply attenuate the test signals rather than adapting — showing an apparent faster convergence. A far more accurate convergence rate specification would be one determined under real-world conditions, i.e. human speech. The real-world convergence rate is usually half that of the convergence to a steady-state signal. In other words, an echo canceller that converges at 40 dB/second on a steady-state tone may only converge at 20 dB/second on speech.
When DSP-based AECs were first introduced in the late 1980s, they used a burst of noise — sometimes quite lengthy — to give the LMS filter a jump-start in modeling the room (also known as “training” the AEC). Because the noise was completely predictable, assessing the number of echoes and their various delays was relatively easy. This method of converging to the room had two problems. First, it was extremely annoying to anyone who happened to be in the room when this “training” was taking place, so people would stay out of the room until the noise went away. The second problem occurred when people entered the room — the acoustic characteristics suddenly changed with the introduction of the bodies to the room. The AEC was no longer trained for the room's conditions. Users would find it necessary to “re-train” the system periodically during a meeting.
As technology progressed, and DSP algorithms improved, the training noise was eliminated by most manufacturers. Today, most AECs ramp up the echo cancellation by adapting on the remote talkers' speech over a few seconds' time. AECs may also use some level of attenuation on signals (approaching half duplex operation) to keep echoes at a minimum until the echo canceller converges well. AECs maintain their filter information from the prior time they were used, so echo is scarcely noticeable at the start of a meeting unless room conditions change considerably.
Manufacturers have also had to address another real-world problem in keeping the AEC converged. Participants in a meeting just won't sit still! As people get up and move around the room, the echo paths change. Therefore, the AEC can't be content with converging just one time — it must continually converge to keep bursts of echo from interfering with the meeting.
The number of microphones in the room can also present a challenge to the AEC. Multiple microphones, normally used in conjunction with automatic microphone mixers, provide better pickup of talkers and help control the overall sound of the room via gating and other functions. However, they also pick up audio reflections at different times. If the room is large enough and reverberant enough, and if the mix of all of the microphones is being fed into a single AEC, those multiple audio reflections may not be recognized by the AEC as matching the reference signal. As a result, some residual echo will be heard at the remote location.
Two manufacturers, Polycom and ClearOne, have resolved this situation by providing automatic microphone mixers that incorporate discrete AEC circuits on each of multiple microphone channels (up to 64). Biamp has two-channel AEC modules that can be added to its mixing system, and Symetrix offers a module for its DSP system that allows the user to incorporate up to eight channels of AEC. Individual channel echo cancelling has been demonstrated to provide more consistent performance in acoustically harsh conditions. However, for systems using fewer than four microphones or systems placed in rooms that have excellent acoustical treatment, a single-channel AEC behind an automatic microphone mixer can often provide satisfactory performance.
Figure 2: Acoustic echo cancellation between rooms.
When selecting an AEC for an installation, it's helpful to know how it performs under doubletalk (two or more sites are talking at the same time) conditions, how it deals with background noise in the room, and how much acoustical gain the system will handle before it breaks into feedback.
Some AECs don't deal well with doubletalk. An AEC can be in one of four different states, depending on which party is talking:
Local party quiet, remote party quiet —> idle
Local party quiet, remote party talking —> receive
Local party talking, remote party quiet —> transmit
Local party talking, remote party talking —> doubletalk
The most challenging state for the AEC is the doubletalk state, where both the local and remote parties are talking and trying to be heard. If the AEC isn't converged, there's the possibility of echo being sent back to the remote talkers. If the AEC compensates for this residual echo by attenuating the local audio to eliminate the echo, then the local talkers may not be heard by the remote talkers. There's usually a tradeoff required by the amount of full-duplex desired and the amount of residual echo (perceived loudness and duration) that's acceptable.
Most echo cancellers only adapt while in receive mode — when just the remote talkers are talking, and the local talkers are quiet. This allows the AEC to adapt very quickly.
While the four states above accurately describe the state of the echo canceller, it's important that the DSP algorithms accurately determine which state they're in. Choosing the wrong state could cause the echo canceller to diverge (increasing the echo to the remote parties) or to attenuate audio that should be sent to the remote parties (degrading the performance of the system). For best performance, choose an AEC that doesn't drop into half-duplex mode during doubletalk.
Another factor that affects AEC performance is the level of background noise in the room. An AEC needs to have a reasonably high signal-to-noise ratio (SNR) to function correctly. In order for speech to be intelligible from site to site, signals must be at least 25 dB louder than noise. There are many sources of noise in a meeting room, but the most notorious are HVAC systems and computer fans. If a microphone is placed near either of these noise sources, the noise will overpower any other audio that might be picked up by the microphone.
Figure 3: A simplified block diagram of an AEC.
In order to improve the quality of audio being sent from room to room (and improve AEC performance), several AEC manufacturers are now incorporating noise reduction algorithms (also called noise cancellation) in their products. These algorithms identify and significantly reduce steady-state signals that are picked up by microphones. Because noise reduction techniques vary from manufacturer to manufacturer, it's important to make sure that the noise reduction algorithm operates cleanly. It must not remove or distort the desired signal (audio from the local room talkers). It must also remove noise during idle periods in speech as well as active speech with no audible transitions between those states. A noise reducer that works well can make a huge difference in audio quality. Unfortunately, because these algorithms remove noise picked up by the microphones in the room so the noisy signal isn't transmitted to the remote party, the noise reducer does nothing to help the people in the local room who are stuck with the noise sources.
In noisy rooms, or in rooms where participants may have difficulty hearing, it may be necessary to increase the volume of the sound system so that local participants will be able to hear far sites more easily. Higher gain in the sound system makes conditions much more likely for acoustic coupling of loudspeakers and microphones, and can also make it more difficult for the AEC to perform, especially if the received audio (echoes) are “louder” at the AEC than local speech. If the gain becomes too high, the AEC will be overpowered. The state machine may make poor decisions such as thinking the system is in transmit mode when it should be in receive mode, and the system will allow acoustic echo to be sent to the remote participants. Some AECs are designed with a high gain-before-feedback specification. A higher acoustic gain specification on the AEC will make system design easier (loudspeaker-microphone placement will not be as critical) and will make the system more robust in real-world conditions.
Figure 4: Longest reflection path from speaker to microphone for five reflections.
Elaine Jones is principal of Elaine Jones Associates, a marketing/PR firm based in Salt Lake City. She has more than 20 years of experience working with companies that manufacture acoustic echo cancellation products. She can be reached at email@example.com.