Audience dual-microphone processor

Transcription

Audience dual-microphone processor
Speech Strategy News
February 2010
1
Audience dual-microphone speech processor used in Google Nexus phone
Audience announces new version and adoption by AT&T for mobile phones
A common objection to the use of speech
recognition in mobile phones is that there are too
many times one can’t speak because of the
situation or background noise. The usability of
speech can be expanded to more situations if
there is directionality and noise suppression,
e.g., by the use of dual microphones that can
help selectively emphasize speech coming from
one direction (or the dominant voice).
Google seems to have made this commitment
to using speech as a major feature of its mobile
phones. A feature seldom noted in articles on the
Google Nexus mobile phone is its dual
microphone. Beyond that, according to iSuppi,
which took the phone apart to study its parts, the
phone uses a noise suppression chip from
Audience (SSN, March 2007, p. 34). iSuppli
also noted the dual microphone in the Motorola
Droid based on the Android platform, but
apparently did not find an Audience chip.
On January 26, Audience announced it would
provide noise suppression solutions for AT&T
wireless devices. The Audience technology will
be introduced across a selection of AT&T
mobile handsets from various equipment
manufacturers in the first half of 2010, the
company indicated. Jeff Bradley, Senior Vice
President, Devices, AT&T Mobility and
Consumer Markets, said, “Audience provides a
unique and industry-leading solution to an
ongoing challenge for mobile users—
background noise—and allows clear and
successful communication in nearly any
environment.”
Microphone placement in the Nexus
The two microphones in the Nexus are placed
separately, one at the bottom of the phone for
speaking and one on the back of the phone for
selectively detecting background noise. (See
figure below, adapted from the Google Nexus
user manual.) The phone also comes with a
headset that has a microphone on the cord on the
remote control; presumably, this mic also works
with the noise cancellation microphone.
Next-generation Audience processor
Audience announced in January the newest
version of its chip, the A1026, which is the
version used in the Nexus, according to an
Audience spokesperson. The use of two
microphones seems to be a key aspect of the
Audience processing (see figure following).
Audience characterizes its chip as using the
same kinds of processing of the human auditory
system to “uniquely identify the primary voice
in conversation and eliminate surrounding
noise.” It also automatically adjusts voice
volume and equalization during calls to adapt to
local noise interference. The technology can
improve the quality of person-to-person voice
communications, but also could make speech
recognition usable in more situations. For
example, one should be able to speak more
quietly in situations where speaking might
disturb others and use speech recognition more
successfully in noisy environments. People are
good at picking voices out of a crowd, and
Audience claims to use these same principles.
The A1026 is a single, mixed-signal systemon-chip (SOC). The latest version features
improved noise-cancellation performance and a
more compact and energy-efficient design, the
company said. The A1026 chip includes a lowpower, high-performance custom Digital Signal
Processor (DSP) core. The A1026 delivers
approximately 30% power reduction from
previous generations in standard talk mode. The
next generation A1026 chip will be available in
mobile handsets as early as this quarter.
Speech Strategy News
February 2010
2
Google Nexus phone with two microphones (adapted from a Google drawing)
Underlying technology—the “cocktail party”
effect
As noted, Audience indicates that its
technology is modeled after the human hearing
system—from the cochlea to the brainstem to
the thalamus and cortex. Researchers have
studied the human talent for picking out voice as
Computational Auditory Scene Analysis
(CASA), the grouping and processing of
complex mixtures of sound.
The Audience Voice Processor receives a
complex mixture of sounds that often overlap at
any given frequency, and organizes it into
individual sources, in the same way people
actually hear sounds. Regardless of whether the
noise is local to the caller, or remote over the
mobile network, the Audience Voice Processor
uses several grouping cues to group the mixture
of sound by source instantaneously, suppressing
the noise and delivering the voice of interest
clearly.
A well-known illustration of CASA is the socalled cocktail party effect; at a busy party, one
is able to follow a conversation even though
other voices and background music are present.
When two or more natural sounds occur at once,
all the components of the simultaneously active
sounds are eceived at the same time by the ears
of listeners. This presents their auditory systems
with a problem: Which frequency components
should be grouped together and treated as parts
of the same sound?
Speech Strategy News
February 2010
3
The Audience processor can use two microphones (Courtesy: Audience)
r
The grouping principles underlying CASA can be broadly categorized into sequential grouping
cues (those that operate across time) and simultaneous grouping cues (those that operate across
frequency). In addition, schemas (learned patterns) play an important role. The job of CASA is to
group incoming sensory information to form an accurate mental representation of the environmental
sounds. CASA handles both a sound source that is steady and constant or transitory and moving.
In the next phase of the process, the cues of sound components are computed for grouping and
stream separation. The cues include pitch, spatial location, and onset time, among others.
As an example, consider pitch. The harmonics generated from a pitched sound source form
distinct frequency patterns, and as a result are a useful method used to group one sound from
another. For example, a male voice and a female voice can usually be separated using pitch.
When two microphones are available in the system, one of the most powerful grouping cues is
spatial location. The sound arrives at the two microphones at different times, and depending on
microphone placement, possible at different amplitudes. Processing can locate the direction from
which a sound is coming and its distance to each of the microphones. A sound source can be
identified as a noise source given its displaced location in relation to the two microphones.
Another grouping cue is common onset/offset time. Frequency components from a single sound
source often start and/or stop at the same time. When a group of frequency components arrive at the
ear at the same time, it is usually an indication that they have come from the same source. These
cues are then associated with the raw Fast Cochlea Transform data as acoustic tags that are used in
the subsequent grouping process.
The grouping process performs a type of clustering operation such that sound components with
common or similar attributes may be mutually associated into a single auditory stream, and sound
components with sufficiently dissimilar attributes are associated with different auditory streams.
Ultimately, the streams are tracked through time and associated with persistent or recurring sound
sources in the auditory environment.
Speech Strategy News
February 2010
4
A selector process then allows the separated auditory sound sources to be prioritized and selected
as appropriate for the given application. In telephony applications, the primary voice of interest is
selected, and the other auditory sources are eliminated or suppressed. The Inverse Fast Cochlea
Transform process converts the Fast Cochlea Transform data back into reconstructed, cleaned-up
digital audio, which is then converted back to an analog signal.