Information extraction from speech in Stress Situation

Transcription

Information extraction from speech in Stress Situation
Information extraction from speech in Stress Situation. Application to the Medical Supervision
in a Smart House
E. CASTELLI - D. ISTRATE
([email protected], [email protected])
Laboratoire CLIPS-IMAG, BP53 38041 Grenoble Cedex 9
V. RIALLE - N. NOURY
([email protected], [email protected])
Laboratoire TIMC-IMAG, Faculté de Médecine de Grenoble, 38706 La Tronche Cedex
Résumé
Afin d’améliorer les conditions de vie des malades et de diminuer les coûts d’hospitalisation, le monde médical
s’intéresse de plus en plus aux techniques de télémonitoring. Celles-ci permettront aux personnes âgées ou aux
malades à risques, de rester à leur domicile, tout en bénéficiant d’une surveillance médicale à distance,
automatisée et assistée, capable d’avertir les services d’urgence en cas de situation anormale (malaise) pour une
intervention rapide.
Nous développons en collaboration avec le laboratoire TIMC-IMAG, un système de télémonitoring dans un
habitat doté de capteurs de données physiologiques, de capteurs de position de la personne, et de microphones.
L’originalité de notre approche consiste à remplacer la surveillance par caméra vidéo par l’utilisation de
microphones enregistrant les sons (parole ou bruits) dans l’appartement. Ce système d’acquisition sonore
multivoies, grâce aux informations sonores couplées aux informations données par les autres capteurs, nous
permettra d’identifier si la personne se trouve dans une situation de détresse ou non. Nous décrivons la position
dans l’appartement des capteurs et la solution pratique choisie pour le système d’acquisition et nous présentons
l’enregistrement d’un corpus de situations de vie courante et de détresse.
1. Introduction
Medicine is more and more interested in the new information technologies and proposes a new form
of " remote " medicine, or telemedecine. Telemedicine is announced as a significant reform of the
medical care because it allows to improve the response time of the specialists face to the emergencies.
They could be informed about a medical emergency and react as soon as the first symptoms appear
and they could intervene without waste of time. The telemedecine also allows a significant reduction
of the costs of public health by avoiding the hospitalization of the patients for long periods of time.
Telemedecine consists in associating electronic techniques of monitoring with computers
“intelligence" and with the speed of telecommunications (established either through network or radio
connections). The system we work on is designed for the surveillance of elderly persons who can
benefit from telemedicine. This technique allows them to remain in communication with a medical
center in charge to analyze the information gathered by the telemonitoring system and to intervene if
needed (Noury 1992:1176). Its main goal is to detect serious accidents as falls or faintness (which
can be characterized by a long idle period of the signals) at any place in the apartment (Rialle 1999:
345).
We noted that the elderly had difficulties in accepting a monitoring by video camera, because they
consider that their constant recording is a violation of their privacy. The originality of our approach
consists in replacing the video camera by a system of multichannel sound acquisition charged to
analyze in real time the sound environment of the apartment in order to detect abnormal noises (falls
of objects or of the patient), calls for help or moans. These could characterize a situation of distress in
the habitat and could indicate to the emergency services that they have to intervene.
2. Presentation of the Telemonitoring System
The habitat we used for experiments is a 30 m2 apartment situated in the TIMC laboratory buildings,
at the Michalon hospital of Grenoble.
The patient carries a set of sensors which give information about his activity : vertical position
(standing) or horizontal (lying) and sudden move (falling). The localization sensors with infra-red
radiation are installed in each part of the apartment in order to establish where the person is at any
moment. These sensors communicate with the acquisition system by radio waves and by bus CAN.
The control of the activities sensors is ensured by a PC using a monitoring software programmed in
JAVA.
The sound sensors are represented by 8 microphones, their position is given in Figure 1. An acoustic
antenna composed of 4 microphones allows to monitor both the living room and the bedroom, aiming
to localize the patient inside the two rooms and also to analyze the sounds. As the other rooms are
much smaller, only one microphone per room is sufficient. The microphones used are omnidirectional, condenser type, of small size and low cost. A signal conditioning card, consisting of an
amplifier and an anti-aliasing filter is associated to each microphone. The acquisition system consists
of a multi-channels acquisition card PCI 6034E of National Instruments, installed inside a second
computer. The acquisition is made at a sampling rate of 16 KHz, a frequency usually used in speech
applications.
We programmed the entire software which controls the acquisition under LabWindows/CVI of
National Instruments. After digitalization the sound data is saved in real time on the hard disk of the
host PC in a temporary file. The two computers are connected between them by a conventional IP
network. The general outline of the acquisition system is presented in Figure 2.
3. Acoustic Localization of the Person
The exact geometrical location of the person in the apartment is important, as well to establish if a
suspect noise which is registered indicates a distress or not, and also to indicate the location of the
person to the emergency services in the event of an alarm. To locate the person, we analyze the
information from the infra-red sensors and sound sensors. The information obtained through analysis
of the data from the two sources is necessary to guarantee a better characterization of the patient’s
situation in order to avoid a false alarm.
The 4 microphones M1 to M4, located in the small parts of the apartment (hall, toilet, shower-room
and kitchen) are sufficient to evaluate the location of the noise or speech source, by a comparison of
the noise level of each microphone. On the contrary, for the larger spaces (the living room and the
bedroom), we use a mural acoustic antenna consisting of 4 microphones forming a 320x170mm
rectangle. After evaluation of several shapes (square, T, parallelogram, cross), we chose for our
acoustic antenna the shape which allows to identify most exactly the spatial position of the sound
source for the needs of our application (Teissier 1998: 252). Using our antenna and a 16KHz
sampling rate, we obtain a 0.2m precision for each of the space coordinates x, y, z sufficient for our
application.
4. Modules of Signal Preprocessing
Because the parts of the apartment have an important and variable influence on the signals collected
by the microphones and degrade them, we must use on each sound channel a preprocessing module of
acoustic reverberations cancellation in order to obtain comparable signals. We use the method CMS
(Cepstral Mean Subtraction) (Bees 1991: 977), (Chuicci 1999: 822), to eliminate the reverberations.
The acoustic sources of the apartment are usually complex and made of several sounds (speech and
noise). In our system, there is a second module of signal preprocessing which processes the signals
coming from the acoustic antenna. The theoretical base is a method of blind separation of sound
sources (Nguyen 1993: 30), (Charkani 1996: 25). The method makes no hypothesis on the nature of
the sound sources which are considered to be independent. The unknown mixture of the unknown
sources is collected by 2 microphones of the antenna. A recursive neuromimetic network with selfadapting filter gives the sound separation of sources. We obtain a gain of approximately 10 to 15 dB
on the signal to noise ratio.
5. Corpus
In order to validate our analysis strategies by data fusion, we recorded a multichannel corpus,
presenting : a simultaneous and synchronised recording of the 8 signals (WAV type files), geometrical
coordinates of position, position information from the infrared sensors, and data from the sensors
carried by the patient. Files are recorded according to the SAM norm used for speech corpus.
The corpus is made up of a set of 20 scripts reproducing a string of events, either voluntary actions of
the patient inside the apartment, or unexpected events which could characterize an abnormal or
distress situation. The voluntary actions are moves of the patient inside the apartment and everyday
gestures presenting natural speech signals, completed by other usual sounds (radio, telephone ring,
dish noise, door noise, etc… ). The unexpected events are abnormal noises (falling, glass breaking)
and abnormal speech signals such as : moans, cries and helps. Every script is registered 4 times. For
this first study, we limit the member of subjects to 5 persons. Two scripts written in XML are
presented in Figure 3.
The first step before the recording of the scripts of the corpus is the calibration of the sound recording
system. A 1KHz rectangular signal is reproduced by a speaker situated to 1 meter from every
microphone. The gains of every sound channel are then adjusted to obtain the same amplitude of the
signal.
After recording, the corpus is analysed : every event of the script (nature and time) is characterised,
the speech is labelled following the usual procedures used for speech corpus.
In figure 4 we show : the four signals recorded by the acoustic antenna (c1 to c4), the geometrical coordinates of the person and the response of the infra-red sensors. The person moving from bedroom to
living room (0s to 0.75s) and speaking while moving. In the table we present the numbers for the
infra-red sensors corresponding to each room.
6.
Conclusions and Perspectives
6.1. The Use of the Corpus
The second part of our study will be the use of the corpus in order to develop a module of sound
signals characterisation. It has two goals : first , it has to establish if the audio signal is a speech signal
or a noise, then it has to characterise the noises (everyday life noises, or abnormal noises, i.e. falling)
and the speech sounds (normal, cry for help or moans). This module is about to be produced and
validated.
To distinct between speech and noise we shall compare the speech signal which has periodical
characteristics (such as the pitch), to the spectral characteristics (large spectrum and impulse
characteristics) of the noises. The characterization of noises will be made by a noise recognition
system, based on a classical HMM method ( Rabiner 1989: 260). Speech characterization (normal or
stressed ) will be inspired from speaker characterization methods (Besacier 1998: 723), (Sayoud 2000:
224). If sounds produced by the speaker do not match the normal speech, the system will consider that
the speaker produces abnormal speech, such as : calls, cries or moans.
If speech is found to be normal, a recognition system of Word Spotting type will allow to recognize
about 20 words, chosen out of a specialized vocabulary of help calls (help, aie!, etc…), in order to
facilitate to the emergency services the decision to intervene. However the recognition system will not
do continuous speech recognition with large vocabulary, as we consider that the semantic content of
the phrases uttered by the speaker in normal situations is part of private life information and should
not be recorded.
6.2. Final Conclusion
We shall exploit the corpus of situations of everyday life and of distress situations, in order to
continue our research. We shall study and validate a set of monitoring strategies, based on a analysis
of the fusion of the data mixing the formation from the activity sensors with the one obtained by
sound environment analysis.
This study is a part of the RESIDE-HIS project, financed by IMAG Institute in collaboration with
TIMC laboratory.
7. References
Bees D., Blostein M. and Kabal P.,1991, Reverberant speech enhancement using cepstral processing,
ICASSP, 977-980
Besacier L., Bonastre J.F., 1998, Système d’élagage temps-fréquence pour l’identification du
locuteur, XXIIèmes Journées d’Etudes sur Parole, Martigny, 720-730
Charkani El Hassani A.N., 1996, Séparation auto-adaptative de sources pour des mélanges
convolutifs .Application à la téléphonie mains-libres dans les voitures., Thèse INP Grenoble
Chuicchi S., Piazza F., 1999, V-Stereo system with acoustic echo cancellation, EURASIP Conférence
Krakow, 822-830
Michaeland S.B. and al., 1997, A Closed-Form Location Estimator for Use with Room Environment
Microphone Arrays, IEEE Transactions on Acoustics, Speech and Signal processing, Vol.5, No.1, 4550
Nguyen Thi h.L., 1993, Séparation aveugle de sources à bande large dans un mélange convolutif,
application au rehaussement de la parole, Thèse INP Grenoble,
Noury N. and al., 1992, A telematic system tool for home health care, Int. Conf. IEEE-EMBS, Paris,
1992, 3/7, 1175-1177
Ogawa M. and al., 1998, Fully automated biosignal acquisition in daily routine through 1 month,
Int.Conf. IEEE-EMBS, Hong-Kong, 1947-50
Omologo M., Svaizer P., 1997, Use of the Crosspower-Spectrum Phase in Acoustic Event Location,
IEEE Transactions on Acoustics, Speech and Signal processing, Vol.5, No.3,288-292
Rabiner R.L., 1989, A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition, Proceedings of the IEEE, Vol.77, No.2, 257-286
Rialle V. and al., 1999, A smart room for hospitalised elderly people: essay of modeling and first
steps of an experiment, Technology and Health care, vol.7, 343-357
Sayoud H., Ouamour S., Kernouat N., Selmane M.K., 2000, Reconnaissance automatique du locuteur
en milieu bruité – cas de la SOSM, XXIIIèmes Journées d’Etude sur la parole, Aussois, 220-232
Teissier P., Berthommier F., 1998, Switching of Azimuth and Elevation for Speaker AudioLocalisation, IEEE Transactions on Acoustics, Speech and Signal processing, Vol.3, No.3, 250-256
Williams G. and al., 1998, A smart fall and activity monitor for telecare application, Int.Conf IEEEEMBS, Hong-Kong, 1151-1154
Information extraction from speech in Stress Situation. Application to the medical supervision
in a smart house
IS3
M5? 8
Acoustic
Antenna
Bedroom
IS4
M4
M1-8 - microphones
IS1-3 – infrared sensors
DC1 – Door contact
Sejour
Living
Room
M3
IS1
M2
IS2
Technical
Local
Room
Technique
Shower
Salle
de
Bains
Toilet
Hall
Hall
M1
Telemonitoring
System
DC1
Figure 1 – Position of the sensors inside the apartment
Information extraction from speech in Stress Situation. Application to the medical supervision
in a smart house
Ethernet network
M1
Signal conditioning
board
.1
.2
.3
.4
IS1
Slave PC
M4
Signal conditioning
board
M5
Signal conditioning
board
M8
Signal conditioning
board
Acoustic antenna
Synchronization
signal
(TCP/IP)
TV and/or
Radio
simulation
Master PC
IS2
Acquisition
card
PCI 6034E
DC
CAN BUS
interface
Echo
Cancellation
and/or
Blind
separation
of sources
Delay time
calculation
Geometrical
Coordinates
calculation
Data
analyzer
System
monitoring
Alarm
Files *.WAV
Figure 2 - Diagram of telemonitoring system
CAN
BUS
?
Coordinates (X,Y,Z)
OC
Data Base
Log File
Information extraction from speech in Stress Situation. Application to the medical supervision
in a smart house
“<Script no. 1>
<description> Normal situation (no alarm detected) </description>
<time> 0 </ time > <Position> Kitchen </Position> <Action> Particular dish noises </Action >
< time > Event 1 </ time > <Position> Living room </Position> <Action> Phone ring </Action>
< time > Event 2 </ time > <Action> Person moving from kitchen to living room </Action>
< time > Event 3 </ time > <Position> Living room </Position> <Action> Person speaking </Action>
</ Script no. 1> ”
“ <Script no. 2>
<description> Distress situation (alarm detected) </description>
< time > 0 </ time > <Position> Bedroom </Position> <Action> Person moving inside the room (gets
out of bed)</Action>
< time > Event 1 </ time > <Action> Person moving from bedroom to living room</Action>
< time > Event 2 </ time > <Position> Living room</Position> <Action> Noise of a person
falling </Action>
< time > Event 3 </ time > <Position> Living room </Position> <Action> Moans </Action>
< time > Event 4 </ time > <Position> Living room </Position> <Action> Long silence </Action>
<ALARM> Alerting the emergency services </ALARM>
</Script no. 2>”
Figure 3 – Two examples of scripts
Information extraction from speech in Stress Situation. Application to the medical supervision
in a smart house
Acoustic
antenna
Position inside the apartment
6
No.
0
1
2
3
4
5
6
5
Room
4
3
2
Room
Living room
Bedroom
Hall
Toilet
Shower
Kitchen
Door contact
Room
1
0
0
0,25
0,5 0,75
1
1,25 1,5 1,75
2
2,25 2,5 2,75
3
3,25 3,5 3,75
Time
Figure 4 – Example of the signals recorded
4
4,25 4,5 4,75
5
5,25 5,5 5,75
6
6,25 6,5
6,75
7