Audience dual-microphone processor
Transcription
Audience dual-microphone processor
Speech Strategy News February 2010 1 Audience dual-microphone speech processor used in Google Nexus phone Audience announces new version and adoption by AT&T for mobile phones A common objection to the use of speech recognition in mobile phones is that there are too many times one can’t speak because of the situation or background noise. The usability of speech can be expanded to more situations if there is directionality and noise suppression, e.g., by the use of dual microphones that can help selectively emphasize speech coming from one direction (or the dominant voice). Google seems to have made this commitment to using speech as a major feature of its mobile phones. A feature seldom noted in articles on the Google Nexus mobile phone is its dual microphone. Beyond that, according to iSuppi, which took the phone apart to study its parts, the phone uses a noise suppression chip from Audience (SSN, March 2007, p. 34). iSuppli also noted the dual microphone in the Motorola Droid based on the Android platform, but apparently did not find an Audience chip. On January 26, Audience announced it would provide noise suppression solutions for AT&T wireless devices. The Audience technology will be introduced across a selection of AT&T mobile handsets from various equipment manufacturers in the first half of 2010, the company indicated. Jeff Bradley, Senior Vice President, Devices, AT&T Mobility and Consumer Markets, said, “Audience provides a unique and industry-leading solution to an ongoing challenge for mobile users— background noise—and allows clear and successful communication in nearly any environment.” Microphone placement in the Nexus The two microphones in the Nexus are placed separately, one at the bottom of the phone for speaking and one on the back of the phone for selectively detecting background noise. (See figure below, adapted from the Google Nexus user manual.) The phone also comes with a headset that has a microphone on the cord on the remote control; presumably, this mic also works with the noise cancellation microphone. Next-generation Audience processor Audience announced in January the newest version of its chip, the A1026, which is the version used in the Nexus, according to an Audience spokesperson. The use of two microphones seems to be a key aspect of the Audience processing (see figure following). Audience characterizes its chip as using the same kinds of processing of the human auditory system to “uniquely identify the primary voice in conversation and eliminate surrounding noise.” It also automatically adjusts voice volume and equalization during calls to adapt to local noise interference. The technology can improve the quality of person-to-person voice communications, but also could make speech recognition usable in more situations. For example, one should be able to speak more quietly in situations where speaking might disturb others and use speech recognition more successfully in noisy environments. People are good at picking voices out of a crowd, and Audience claims to use these same principles. The A1026 is a single, mixed-signal systemon-chip (SOC). The latest version features improved noise-cancellation performance and a more compact and energy-efficient design, the company said. The A1026 chip includes a lowpower, high-performance custom Digital Signal Processor (DSP) core. The A1026 delivers approximately 30% power reduction from previous generations in standard talk mode. The next generation A1026 chip will be available in mobile handsets as early as this quarter. Speech Strategy News February 2010 2 Google Nexus phone with two microphones (adapted from a Google drawing) Underlying technology—the “cocktail party” effect As noted, Audience indicates that its technology is modeled after the human hearing system—from the cochlea to the brainstem to the thalamus and cortex. Researchers have studied the human talent for picking out voice as Computational Auditory Scene Analysis (CASA), the grouping and processing of complex mixtures of sound. The Audience Voice Processor receives a complex mixture of sounds that often overlap at any given frequency, and organizes it into individual sources, in the same way people actually hear sounds. Regardless of whether the noise is local to the caller, or remote over the mobile network, the Audience Voice Processor uses several grouping cues to group the mixture of sound by source instantaneously, suppressing the noise and delivering the voice of interest clearly. A well-known illustration of CASA is the socalled cocktail party effect; at a busy party, one is able to follow a conversation even though other voices and background music are present. When two or more natural sounds occur at once, all the components of the simultaneously active sounds are eceived at the same time by the ears of listeners. This presents their auditory systems with a problem: Which frequency components should be grouped together and treated as parts of the same sound? Speech Strategy News February 2010 3 The Audience processor can use two microphones (Courtesy: Audience) r The grouping principles underlying CASA can be broadly categorized into sequential grouping cues (those that operate across time) and simultaneous grouping cues (those that operate across frequency). In addition, schemas (learned patterns) play an important role. The job of CASA is to group incoming sensory information to form an accurate mental representation of the environmental sounds. CASA handles both a sound source that is steady and constant or transitory and moving. In the next phase of the process, the cues of sound components are computed for grouping and stream separation. The cues include pitch, spatial location, and onset time, among others. As an example, consider pitch. The harmonics generated from a pitched sound source form distinct frequency patterns, and as a result are a useful method used to group one sound from another. For example, a male voice and a female voice can usually be separated using pitch. When two microphones are available in the system, one of the most powerful grouping cues is spatial location. The sound arrives at the two microphones at different times, and depending on microphone placement, possible at different amplitudes. Processing can locate the direction from which a sound is coming and its distance to each of the microphones. A sound source can be identified as a noise source given its displaced location in relation to the two microphones. Another grouping cue is common onset/offset time. Frequency components from a single sound source often start and/or stop at the same time. When a group of frequency components arrive at the ear at the same time, it is usually an indication that they have come from the same source. These cues are then associated with the raw Fast Cochlea Transform data as acoustic tags that are used in the subsequent grouping process. The grouping process performs a type of clustering operation such that sound components with common or similar attributes may be mutually associated into a single auditory stream, and sound components with sufficiently dissimilar attributes are associated with different auditory streams. Ultimately, the streams are tracked through time and associated with persistent or recurring sound sources in the auditory environment. Speech Strategy News February 2010 4 A selector process then allows the separated auditory sound sources to be prioritized and selected as appropriate for the given application. In telephony applications, the primary voice of interest is selected, and the other auditory sources are eliminated or suppressed. The Inverse Fast Cochlea Transform process converts the Fast Cochlea Transform data back into reconstructed, cleaned-up digital audio, which is then converted back to an analog signal.