Information extraction from speech in Stress Situation
Transcription
Information extraction from speech in Stress Situation
Information extraction from speech in Stress Situation. Application to the Medical Supervision in a Smart House E. CASTELLI - D. ISTRATE ([email protected], [email protected]) Laboratoire CLIPS-IMAG, BP53 38041 Grenoble Cedex 9 V. RIALLE - N. NOURY ([email protected], [email protected]) Laboratoire TIMC-IMAG, Faculté de Médecine de Grenoble, 38706 La Tronche Cedex Résumé Afin d’améliorer les conditions de vie des malades et de diminuer les coûts d’hospitalisation, le monde médical s’intéresse de plus en plus aux techniques de télémonitoring. Celles-ci permettront aux personnes âgées ou aux malades à risques, de rester à leur domicile, tout en bénéficiant d’une surveillance médicale à distance, automatisée et assistée, capable d’avertir les services d’urgence en cas de situation anormale (malaise) pour une intervention rapide. Nous développons en collaboration avec le laboratoire TIMC-IMAG, un système de télémonitoring dans un habitat doté de capteurs de données physiologiques, de capteurs de position de la personne, et de microphones. L’originalité de notre approche consiste à remplacer la surveillance par caméra vidéo par l’utilisation de microphones enregistrant les sons (parole ou bruits) dans l’appartement. Ce système d’acquisition sonore multivoies, grâce aux informations sonores couplées aux informations données par les autres capteurs, nous permettra d’identifier si la personne se trouve dans une situation de détresse ou non. Nous décrivons la position dans l’appartement des capteurs et la solution pratique choisie pour le système d’acquisition et nous présentons l’enregistrement d’un corpus de situations de vie courante et de détresse. 1. Introduction Medicine is more and more interested in the new information technologies and proposes a new form of " remote " medicine, or telemedecine. Telemedicine is announced as a significant reform of the medical care because it allows to improve the response time of the specialists face to the emergencies. They could be informed about a medical emergency and react as soon as the first symptoms appear and they could intervene without waste of time. The telemedecine also allows a significant reduction of the costs of public health by avoiding the hospitalization of the patients for long periods of time. Telemedecine consists in associating electronic techniques of monitoring with computers “intelligence" and with the speed of telecommunications (established either through network or radio connections). The system we work on is designed for the surveillance of elderly persons who can benefit from telemedicine. This technique allows them to remain in communication with a medical center in charge to analyze the information gathered by the telemonitoring system and to intervene if needed (Noury 1992:1176). Its main goal is to detect serious accidents as falls or faintness (which can be characterized by a long idle period of the signals) at any place in the apartment (Rialle 1999: 345). We noted that the elderly had difficulties in accepting a monitoring by video camera, because they consider that their constant recording is a violation of their privacy. The originality of our approach consists in replacing the video camera by a system of multichannel sound acquisition charged to analyze in real time the sound environment of the apartment in order to detect abnormal noises (falls of objects or of the patient), calls for help or moans. These could characterize a situation of distress in the habitat and could indicate to the emergency services that they have to intervene. 2. Presentation of the Telemonitoring System The habitat we used for experiments is a 30 m2 apartment situated in the TIMC laboratory buildings, at the Michalon hospital of Grenoble. The patient carries a set of sensors which give information about his activity : vertical position (standing) or horizontal (lying) and sudden move (falling). The localization sensors with infra-red radiation are installed in each part of the apartment in order to establish where the person is at any moment. These sensors communicate with the acquisition system by radio waves and by bus CAN. The control of the activities sensors is ensured by a PC using a monitoring software programmed in JAVA. The sound sensors are represented by 8 microphones, their position is given in Figure 1. An acoustic antenna composed of 4 microphones allows to monitor both the living room and the bedroom, aiming to localize the patient inside the two rooms and also to analyze the sounds. As the other rooms are much smaller, only one microphone per room is sufficient. The microphones used are omnidirectional, condenser type, of small size and low cost. A signal conditioning card, consisting of an amplifier and an anti-aliasing filter is associated to each microphone. The acquisition system consists of a multi-channels acquisition card PCI 6034E of National Instruments, installed inside a second computer. The acquisition is made at a sampling rate of 16 KHz, a frequency usually used in speech applications. We programmed the entire software which controls the acquisition under LabWindows/CVI of National Instruments. After digitalization the sound data is saved in real time on the hard disk of the host PC in a temporary file. The two computers are connected between them by a conventional IP network. The general outline of the acquisition system is presented in Figure 2. 3. Acoustic Localization of the Person The exact geometrical location of the person in the apartment is important, as well to establish if a suspect noise which is registered indicates a distress or not, and also to indicate the location of the person to the emergency services in the event of an alarm. To locate the person, we analyze the information from the infra-red sensors and sound sensors. The information obtained through analysis of the data from the two sources is necessary to guarantee a better characterization of the patient’s situation in order to avoid a false alarm. The 4 microphones M1 to M4, located in the small parts of the apartment (hall, toilet, shower-room and kitchen) are sufficient to evaluate the location of the noise or speech source, by a comparison of the noise level of each microphone. On the contrary, for the larger spaces (the living room and the bedroom), we use a mural acoustic antenna consisting of 4 microphones forming a 320x170mm rectangle. After evaluation of several shapes (square, T, parallelogram, cross), we chose for our acoustic antenna the shape which allows to identify most exactly the spatial position of the sound source for the needs of our application (Teissier 1998: 252). Using our antenna and a 16KHz sampling rate, we obtain a 0.2m precision for each of the space coordinates x, y, z sufficient for our application. 4. Modules of Signal Preprocessing Because the parts of the apartment have an important and variable influence on the signals collected by the microphones and degrade them, we must use on each sound channel a preprocessing module of acoustic reverberations cancellation in order to obtain comparable signals. We use the method CMS (Cepstral Mean Subtraction) (Bees 1991: 977), (Chuicci 1999: 822), to eliminate the reverberations. The acoustic sources of the apartment are usually complex and made of several sounds (speech and noise). In our system, there is a second module of signal preprocessing which processes the signals coming from the acoustic antenna. The theoretical base is a method of blind separation of sound sources (Nguyen 1993: 30), (Charkani 1996: 25). The method makes no hypothesis on the nature of the sound sources which are considered to be independent. The unknown mixture of the unknown sources is collected by 2 microphones of the antenna. A recursive neuromimetic network with selfadapting filter gives the sound separation of sources. We obtain a gain of approximately 10 to 15 dB on the signal to noise ratio. 5. Corpus In order to validate our analysis strategies by data fusion, we recorded a multichannel corpus, presenting : a simultaneous and synchronised recording of the 8 signals (WAV type files), geometrical coordinates of position, position information from the infrared sensors, and data from the sensors carried by the patient. Files are recorded according to the SAM norm used for speech corpus. The corpus is made up of a set of 20 scripts reproducing a string of events, either voluntary actions of the patient inside the apartment, or unexpected events which could characterize an abnormal or distress situation. The voluntary actions are moves of the patient inside the apartment and everyday gestures presenting natural speech signals, completed by other usual sounds (radio, telephone ring, dish noise, door noise, etc… ). The unexpected events are abnormal noises (falling, glass breaking) and abnormal speech signals such as : moans, cries and helps. Every script is registered 4 times. For this first study, we limit the member of subjects to 5 persons. Two scripts written in XML are presented in Figure 3. The first step before the recording of the scripts of the corpus is the calibration of the sound recording system. A 1KHz rectangular signal is reproduced by a speaker situated to 1 meter from every microphone. The gains of every sound channel are then adjusted to obtain the same amplitude of the signal. After recording, the corpus is analysed : every event of the script (nature and time) is characterised, the speech is labelled following the usual procedures used for speech corpus. In figure 4 we show : the four signals recorded by the acoustic antenna (c1 to c4), the geometrical coordinates of the person and the response of the infra-red sensors. The person moving from bedroom to living room (0s to 0.75s) and speaking while moving. In the table we present the numbers for the infra-red sensors corresponding to each room. 6. Conclusions and Perspectives 6.1. The Use of the Corpus The second part of our study will be the use of the corpus in order to develop a module of sound signals characterisation. It has two goals : first , it has to establish if the audio signal is a speech signal or a noise, then it has to characterise the noises (everyday life noises, or abnormal noises, i.e. falling) and the speech sounds (normal, cry for help or moans). This module is about to be produced and validated. To distinct between speech and noise we shall compare the speech signal which has periodical characteristics (such as the pitch), to the spectral characteristics (large spectrum and impulse characteristics) of the noises. The characterization of noises will be made by a noise recognition system, based on a classical HMM method ( Rabiner 1989: 260). Speech characterization (normal or stressed ) will be inspired from speaker characterization methods (Besacier 1998: 723), (Sayoud 2000: 224). If sounds produced by the speaker do not match the normal speech, the system will consider that the speaker produces abnormal speech, such as : calls, cries or moans. If speech is found to be normal, a recognition system of Word Spotting type will allow to recognize about 20 words, chosen out of a specialized vocabulary of help calls (help, aie!, etc…), in order to facilitate to the emergency services the decision to intervene. However the recognition system will not do continuous speech recognition with large vocabulary, as we consider that the semantic content of the phrases uttered by the speaker in normal situations is part of private life information and should not be recorded. 6.2. Final Conclusion We shall exploit the corpus of situations of everyday life and of distress situations, in order to continue our research. We shall study and validate a set of monitoring strategies, based on a analysis of the fusion of the data mixing the formation from the activity sensors with the one obtained by sound environment analysis. This study is a part of the RESIDE-HIS project, financed by IMAG Institute in collaboration with TIMC laboratory. 7. References Bees D., Blostein M. and Kabal P.,1991, Reverberant speech enhancement using cepstral processing, ICASSP, 977-980 Besacier L., Bonastre J.F., 1998, Système d’élagage temps-fréquence pour l’identification du locuteur, XXIIèmes Journées d’Etudes sur Parole, Martigny, 720-730 Charkani El Hassani A.N., 1996, Séparation auto-adaptative de sources pour des mélanges convolutifs .Application à la téléphonie mains-libres dans les voitures., Thèse INP Grenoble Chuicchi S., Piazza F., 1999, V-Stereo system with acoustic echo cancellation, EURASIP Conférence Krakow, 822-830 Michaeland S.B. and al., 1997, A Closed-Form Location Estimator for Use with Room Environment Microphone Arrays, IEEE Transactions on Acoustics, Speech and Signal processing, Vol.5, No.1, 4550 Nguyen Thi h.L., 1993, Séparation aveugle de sources à bande large dans un mélange convolutif, application au rehaussement de la parole, Thèse INP Grenoble, Noury N. and al., 1992, A telematic system tool for home health care, Int. Conf. IEEE-EMBS, Paris, 1992, 3/7, 1175-1177 Ogawa M. and al., 1998, Fully automated biosignal acquisition in daily routine through 1 month, Int.Conf. IEEE-EMBS, Hong-Kong, 1947-50 Omologo M., Svaizer P., 1997, Use of the Crosspower-Spectrum Phase in Acoustic Event Location, IEEE Transactions on Acoustics, Speech and Signal processing, Vol.5, No.3,288-292 Rabiner R.L., 1989, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol.77, No.2, 257-286 Rialle V. and al., 1999, A smart room for hospitalised elderly people: essay of modeling and first steps of an experiment, Technology and Health care, vol.7, 343-357 Sayoud H., Ouamour S., Kernouat N., Selmane M.K., 2000, Reconnaissance automatique du locuteur en milieu bruité – cas de la SOSM, XXIIIèmes Journées d’Etude sur la parole, Aussois, 220-232 Teissier P., Berthommier F., 1998, Switching of Azimuth and Elevation for Speaker AudioLocalisation, IEEE Transactions on Acoustics, Speech and Signal processing, Vol.3, No.3, 250-256 Williams G. and al., 1998, A smart fall and activity monitor for telecare application, Int.Conf IEEEEMBS, Hong-Kong, 1151-1154 Information extraction from speech in Stress Situation. Application to the medical supervision in a smart house IS3 M5? 8 Acoustic Antenna Bedroom IS4 M4 M1-8 - microphones IS1-3 – infrared sensors DC1 – Door contact Sejour Living Room M3 IS1 M2 IS2 Technical Local Room Technique Shower Salle de Bains Toilet Hall Hall M1 Telemonitoring System DC1 Figure 1 – Position of the sensors inside the apartment Information extraction from speech in Stress Situation. Application to the medical supervision in a smart house Ethernet network M1 Signal conditioning board .1 .2 .3 .4 IS1 Slave PC M4 Signal conditioning board M5 Signal conditioning board M8 Signal conditioning board Acoustic antenna Synchronization signal (TCP/IP) TV and/or Radio simulation Master PC IS2 Acquisition card PCI 6034E DC CAN BUS interface Echo Cancellation and/or Blind separation of sources Delay time calculation Geometrical Coordinates calculation Data analyzer System monitoring Alarm Files *.WAV Figure 2 - Diagram of telemonitoring system CAN BUS ? Coordinates (X,Y,Z) OC Data Base Log File Information extraction from speech in Stress Situation. Application to the medical supervision in a smart house “<Script no. 1> <description> Normal situation (no alarm detected) </description> <time> 0 </ time > <Position> Kitchen </Position> <Action> Particular dish noises </Action > < time > Event 1 </ time > <Position> Living room </Position> <Action> Phone ring </Action> < time > Event 2 </ time > <Action> Person moving from kitchen to living room </Action> < time > Event 3 </ time > <Position> Living room </Position> <Action> Person speaking </Action> </ Script no. 1> ” “ <Script no. 2> <description> Distress situation (alarm detected) </description> < time > 0 </ time > <Position> Bedroom </Position> <Action> Person moving inside the room (gets out of bed)</Action> < time > Event 1 </ time > <Action> Person moving from bedroom to living room</Action> < time > Event 2 </ time > <Position> Living room</Position> <Action> Noise of a person falling </Action> < time > Event 3 </ time > <Position> Living room </Position> <Action> Moans </Action> < time > Event 4 </ time > <Position> Living room </Position> <Action> Long silence </Action> <ALARM> Alerting the emergency services </ALARM> </Script no. 2>” Figure 3 – Two examples of scripts Information extraction from speech in Stress Situation. Application to the medical supervision in a smart house Acoustic antenna Position inside the apartment 6 No. 0 1 2 3 4 5 6 5 Room 4 3 2 Room Living room Bedroom Hall Toilet Shower Kitchen Door contact Room 1 0 0 0,25 0,5 0,75 1 1,25 1,5 1,75 2 2,25 2,5 2,75 3 3,25 3,5 3,75 Time Figure 4 – Example of the signals recorded 4 4,25 4,5 4,75 5 5,25 5,5 5,75 6 6,25 6,5 6,75 7