Traitement du signal social - ISIR

Transcription

Université Pierre et Marie Curie Paris 6
Institut des Systèmes Intelligents et de
Robotique
Habilitation à Diriger
des Recherches
Spécialité "Sciences de l'ingénieur"
par
Mohamed CHETOUANI
Traitement du signal social
et robotique personnelle :
Signaux, actes de communication et
comportements
Soutenance le 8 Décembre 2011 devant le jury composé de
Dr.
Pr
Dr.
Pr
Pr.
Pr.
Pr.
Rachid Alami
Gaël Richard
Alessandro Vinciarelli
Nick Campbell
Philippe Bidaud
Jean-Luc Zarader
David Cohen
LAAS-CNRS
Institut Telecom - LTCI-CNRS
IDIAP-Université de Glasgow
Trinity College
UPMC-CNRS
UPMC-CNRS
UPMC-CNRS
(Rapporteur)
(Rapporteur)
(Rapporteur)
(Examinateur)
(Examinateur)
(Examinateur)
(Invité)
Remerciements
Les travaux présentés dans ce mémoire d'habilitation à diriger des recherches ont été réalisés avec des étudiants que j'ai encadré sans qui tout
cela n'aurait pas été possible : Fabien Ringeval, Ammar Mahdhaoui, Catherine Saint-Georges, Cong Zong, Consuelo Granata, Emilie Delaherche, Jade
Le Maître et sans oublier l'ensemble des stagiaires et visiteurs.
Je tiens également à remercier ceux qui m'ont accueilli et m'ont permis
de développer mes activités de recherche avec liberté, conance et respect :
Jean-Luc Zarader, Maurice Milgram et Philippe Bidaud. Leurs conseils ont
été utiles et grandement appréciés.
Mes activités n'auraient sûrement pas prises la même direction et encore
moins la même envergure sans le soutien de David Cohen que je tiens tout
particulièrement à remercier. J'ai ainsi eu l'honneur et la chance de bénécier
d'un cadre applicatif très riche au sein du service de psychiatrie de l'enfant et
de l'adolescent de l'hôpital de la Pitié-Salpétriére. Mes plus vifs remerciements
s'adressent à l'ensemble du personnel du service, pour nous avoir accueilli, aux
patients et leurs familles pour leurs engagements volontaires dans la recherche
et Filippo Muratori qui, par sa générosité scientique, nous a permis d'entreprendre des recherches sur des bases de données inestimables.
Je tiens à associer à ces remerciements mes collègues qui ont accompagné cette aventure : Catherine Achard, Kévin Bailly, Laurence Chaby, Xavier
Clady, Nizar Ouarti et Monique Plaza. Mes remerciements vont également à
l'ensemble des membres de l'Institut des Systèmes Intelligents et de Robotique
pour avoir crée un cadre de travail riche et rigoureux.
C'est avec grand plaisir que je remercie les rapporteurs de mon mémoire
pour la caution qu'ils ont bien voulu accorder à mes travaux : Rachid Alami,
Gaël Richard et Alessandro Vinciarelli. Mes remerciements vont également à
Nick Campbell pour m'avoir fait l'honneur de participer à mon jury.
Bon nombre de mes travaux ont trouvé leurs inspirations dans les recherches menées par des collègues plus expérimentés que j'ai rencontré lors
de congrès, d'écoles d'été ou de visites : Rachid Alami, Gérard Bailly, Nick
Campbell, Gérard Chollet, Thierry Dutoit, Anna Esposito, Marcos FaundezZanuy, Bjorn Granstrom, David House, Amir Hussain, Eric Keller, Catherine
Pelachaud, Bjorn Schuller, Alessandro Vinciarelli, ainsi que l'ensemble des
partenaires des projets collaboratifs (COST 277 et 2102, ROBADOM, Michelangelo...).
Enn, ces remerciements ne seraient pas complets sans mentionner ma
famille qui a été directement impactée par mes activités de recherche. Ce
travail est ainsi dédié à l'ensemble des membres de ma famille pour le soutien
démesuré dont ils ont fait preuve.
iii
Résumé
Les travaux présentés dans ce document concernent la caractérisation, la
détection et l'analyse de la composante sociale des signaux échangés entre un
humain et son partenaire (humain-robot-agent virtuel). Les modèles proposés
trouvent leurs fondements dans un domaine émergent : le traitement du
signal social. D'un point de vue méthodologique, nos travaux couvrent les
étapes d'analyse, de caractérisation et de prédiction de signaux sociaux en
s'appuyant sur des modèles statistiques issues du traitement du signal, de la
reconnaissance des formes et de l'apprentissage. Nous avons proposé, et tenté
de promouvoir, un domaine spécique : le traitement de signaux sociaux
atypiques. L'idée étant de faire converger, dans les traitements et les modélisations, des connaissances issues du traitement du signal, de l'apprentissage,
de la psychologie et de la psychiatrie. Les enjeux théoriques (e.g. modèles de
la dynamique des signaux échangés), applicatifs (e.g. diagnostic diérentiel)
et sociétaux (e.g. conception de systèmes d'assistance) sont multiples. Nos
contributions portent sur la caractérisation du signal de parole (identité
et aect), sur la dynamique de la communication humaine (synchronie
interactionnelle) et sur l'intelligence sociale. Les modèles développés et les
résultats obtenus permettent de dénir un programme de recherche portant
sur l'analyse, la modélisation et la prédiction des composantes multi-modale
et dynamique de l'interaction sociale.
Mots clés : extraction de caractéristiques, modélisations statistiques, trai-
tement du signal social, robotique personnelle, apprentissage, synchronie interactionnelle, coordination multi-modale, engagement, émotion.
Abstract
Works presented in this document concern the characterization, the
detection and the analysis of social components of signals exchanged between a human and his partner (human-robot-virtual agent). The proposed
models are rooted in an emerging eld : social signal processing. From a
methodological point of view, our works cover analysis, characterization and
prediction of social signals based on statistical models : signal processing,
pattern recognition and machine learning. We have proposed and tried to
promote a specic area : aytpical social signal processing. The idea is to
converge, in the processing and modeling, knowledge from signal processing,
machine learning, psychology and psychiatry. The theoretical issues (e.g.
dynamic human communication modeling), application issues (e.g. dierential
iv
diagnosis) and societal issues (e.g. design of assistive devices) are numerous.
Our contribution focuses on the characterization of speech signals (identity
and aect), dynamics of human communication (interactional synchrony)
and social intelligence. The developed models and obtained results allow to
dene a research agenda on the analysis, the modeling and the prediction of
multi-modal and dynamic components of social interaction.
Keywords : feature extraction, statistical modeling, social signal proces-
sing, personal robotics, learning, interactional synchrony, multi-modal coordination, engagement, emotion.
Table des matières
Table des gures
vii
Liste des tableaux
xii
Introduction générale
Contexte et motivations . . . . . . . . . . . . . . . . . . . . . . . .
Traitement du signal social . . . . . . . . . . . . . . . . . . . . . . .
Positionnement et thèmes de recherche . . . . . . . . . . . . . . . .
1
1
3
10
1 Caractérisation de signaux de parole : du signal au message
social
13
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Contexte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Encodage des informations dans la parole . . . . . . . . . . . .
1.2.1 Informations véhiculées . . . . . . . . . . . . . . . . . .
1.2.2 Caractérisation automatique de signaux de parole . . .
L'information locuteur . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Résidu de prédiction . . . . . . . . . . . . . . . . . . .
1.3.2 Prise en compte de la nature du résidu . . . . . . . . .
1.3.3 Résultats . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
Informations non-verbales . . . . . . . . . . . . . . . . . . . .
1.4.1 Caractérisation des dimensions temporelles et intégratives : Ancrages acoustiques . . . . . . . . . . . . . . .
1.4.2 Dynamique du signal de parole : Rythme . . . . . . . .
Emotions chez les enfants atteints de troubles de la communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Fonctionnalité grammaticale de la prosodie . . . . . . .
1.5.2 Fonctionnalité émotionnelle . . . . . . . . . . . . . . .
Apprentissage pour la caractérisation de signaux de parole en
situation réaliste . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Motherese . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Classication de données naturelles et spontanées . . .
1.6.3 Problématique de l'apprentissage semi-supervisé . . . .
1.6.4 Co-apprentissage multi-vues . . . . . . . . . . . . . . .
1.6.5 Résultats . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion générale . . . . . . . . . . . . . . . . . . . . . . . .
13
14
14
15
18
19
20
22
24
25
28
37
44
46
50
51
52
52
54
56
57
58
vi
Table des matières
2 Dynamique de la communication humaine
2.1
2.2
2.3
2.4
2.5
Contexte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Synchronie interactionnelle . . . . . . . . . . . . . . . . . . . .
2.2.1 Dénitions . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Implications dans le développement de l'enfant . . . . .
2.2.3 Implications dans les interactions sociales chez l'adulte
2.2.4 Implications pour la robotique interactive . . . . . . .
2.2.5 Caractérisation automatique de la synchronie . . . . .
Modélisation intégrative de la synchronie . . . . . . . . . . . .
2.3.1 Signes précoces de l'autisme : étude de lms familiaux
2.3.2 Modélisation computationnelle de la synchronie . . . .
2.3.3 Interprétation des résultats . . . . . . . . . . . . . . . .
2.3.4 Limites des méthodes basées sur l'annotation de comportements . . . . . . . . . . . . . . . . . . . . . . . .
Coordination multi-modale : du signal à l'interprétation . . . .
2.4.1 Synchronie et intégration multi-modale . . . . . . . . .
2.4.2 Des indices non-verbaux au degré de coordination . . .
2.4.3 Limites des méthodes basées uniquement sur des informations de bas-niveau . . . . . . . . . . . . . . . . . .
Discussion générale . . . . . . . . . . . . . . . . . . . . . . . .
3 Intelligence sociale pour la robotique personnelle
3.1
3.2
3.3
3.4
3.5
3.6
Contexte . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dynamique de l'interaction Homme-Robot . . . . . . . .
3.2.1 Dénitions . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Caractérisation automatique de l'engagement . .
Supports non-verbaux de la dynamique d'une interaction
3.3.1 Communication phatique . . . . . . . . . . . . . .
3.3.2 Modélisation de la communication . . . . . . . . .
Caractérisation du degré d'engagement . . . . . . . . . .
3.4.1 Détection de l'interlocuteur . . . . . . . . . . . .
3.4.2 Du self-talk à une métrique de l'engagement . . .
Robotique d'assistance . . . . . . . . . . . . . . . . . . .
3.5.1 Interface multi-modale . . . . . . . . . . . . . . .
3.5.2 Engagement dans une interaction physique . . . .
Discussion générale . . . . . . . . . . . . . . . . . . . . .
Projet de recherche
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
61
61
62
63
63
64
64
67
67
68
73
75
76
77
82
87
87
89
89
90
91
92
94
94
94
97
97
100
103
103
104
109
111
Dynamique de la communication . . . . . . . . . . . . . . . . . . . 111
Interfaces et intelligence sociale . . . . . . . . . . . . . . . . . . . . 114
De l'investigation clinique aux sciences sociales computationnelles . 116
Table des matières
vii
Curriculum vitæ
117
Sélection d'articles
135
Investigation on LP-Residual Representations For Speaker Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time-scale feature extractions for emotional speech characterization
Automatic intonation recognition for the prosodic assessment of language impaired children . . . . . . . . . . . . . . . . . . . . .
Supervised and semi-supervised infant-directed speech classication
for parent-infant interaction analysis . . . . . . . . . . . . . .
Do parents recognize autistic deviant behavior long before diagnosis ?
taking into account interaction using computational methods .
Generating Robot/Agent Backchannels During a Storytelling Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliographie
136
145
154
170
184
198
205
Table des gures
1
Traitement du signal social : Exploitation d'indices nonverbaux (gure adaptée de [Vinciarelli et al., 2009]) . . . . . .
1.1
1.2
Reconnaissance automatique des rôles de locuteurs . . . . . .
Processus d'encodage des informations dans la parole. Figure
tirée de [Ringeval, 2011] et initialement adaptée de [Fujisaki,
2004] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Traitements temporels et fréquentiels appliqués au résidu r. . .
Segments voisés d'un signal de parole : rôle de la durée dans la
proéminence . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Identication d'unités d'analyses par alignement "d'états émotionnels " (Viterbi) . . . . . . . . . . . . . . . . . . . . . . . .
Diversité des ancrages acoustiques et rythmiques de la parole .
Système de détection de pseudo-phonèmes dans un signal de
parole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparaison d'une segmentation phonétique manuelle vs automatique d'une phrase extraite du corpus TIMIT. . . . . . .
Evolution des mesures de durée des voyelles et des consonnes
selon les émotions . . . . . . . . . . . . . . . . . . . . . . . . .
Cascade de ltres employés par [Tilsen and Johnson, 2008] pour
l'extraction de l'enveloppe rythmique d'un signal de parole. . .
Exemple d'enveloppe rythmique extraite d'un signal de parole
Principe de caractérisation dynamique des composantes prosodiques incluant le rythme de la parole (gure extraite de [Ringeval, 2011]) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analyse basses fréquences du rythme : enveloppe rythmique et
sa transformée de Fourier . . . . . . . . . . . . . . . . . . . . .
Contribution des métriques du rythme . . . . . . . . . . . . .
Variations des mesures issues des modèles du rythme conventionnels (a), mixtes (b) et non conventionnels (c) selon les catégories d'émotions ; (d) roue des émotions de Plutchik . . . .
Prols intonatifs selon le contour du pitch . . . . . . . . . . .
Stratégie de reconnaissance des contours intonatifs . . . . . . .
Espace des caractéristiques formé par les métriques rythmiques
Distribution des données selon les trois semestres étudiés . . .
Exemple d'annotation du motherese . . . . . . . . . . . . . . .
Courbes ROC décrivant les performances de détection du motherese : Comb1 (Classieur k-nn), Comb2 (Classieur GMM)
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
1.20
1.21
4
14
15
22
26
27
29
30
31
33
34
34
38
41
44
45
47
49
51
53
54
55
x
Table des gures
1.22 Performance en classication avec diérente quantité de données étiquetées en apprentissage . . . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Signes précoces d'autisme en fonction de l'âge et des principaux
axes de développement [Saint-Georges, 2011] . . . . . . . . . .
Analyse automatique de l'interaction parent-bébé . . . . . . .
Représentation développementale des principaux modes d'interaction du bébé à devenir autistique [Saint-Georges, 2011] .
Représentation développementale des principaux modes d'interaction du bébé à devenir autistique [Saint-Georges, 2011] .
Diérentes étapes du système de caractérisation automatique
de la synchronie [Delaherche and Chetouani, 2010] . . . . . . .
Congurations expérimentales . . . . . . . . . . . . . . . . . .
Paramètres de la synchronie interactionnelle . . . . . . . . . .
Principe de reconnaissance de signaux sociaux . . . . . . . . .
Degré de coordination perçu par de juges en fonction de l'accord inter-juge (mean weighted kappa), mesuré sur l'ensemble
des dyades et des items du questionnaire [Delaherche and Chetouani, 2011a] . . . . . . . . . . . . . . . . . . . . . . . . . . .
Principe du modèle de génération de feedback non-verbaux . .
Principe de détection de l'interlocuteur : on-view + on-talk . .
Maintien du contact visuel : Implementation sur le robot Jazz
Situation triadique : cas de l'interaction patient - exercice de
stimulation - thérapeute/robot (projet ROBADOM [Chetouani
et al., 2010]) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Description du système proposé pour l'évaluation du degré
d'engagement . . . . . . . . . . . . . . . . . . . . . . . . . . .
Principe de l'interface multi-modale déployée dans un robot de
service pour personnes âgées . . . . . . . . . . . . . . . . . . .
Illustrations de travaux exploitant des signaux physiologiques .
Caractérisation de signaux physiologiques basée sur la ssion
de données . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Trajectoire du centre de masse . . . . . . . . . . . . . . . . . .
58
68
69
74
76
78
78
81
82
83
97
98
99
99
102
104
106
106
109
Liste des tableaux
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
Correspondance entre les sessions et les microphones. . . . . .
Performances en identication de locuteur dans des conditions
diverses (enrôlement lors de la session M1) avec des modélisations temporelles, fréquentielles et mixtes . . . . . . . . . . .
Caractéristiques principales des corpus de parole étudiés . . .
Comparaison des résultats en détection de pseudo-phonèmes
sur divers corpus (en %) . . . . . . . . . . . . . . . . . . . . .
Performances en détection de pseudo-phonèmes selon les styles
de production du corpus Berlin (en %) . . . . . . . . . . . . .
Taux de recouvrement des "p-centres" en % avec les autres
types d'ancrage acoustique de la parole (moyenne et écart-type)
Scores (en %) de reconnaissance des émotions sur le corpus Berlin : eet des normalisations des informations a priori. L'importance relative des segments est indiquée entre parenthèses (αi
cf équation 1.14 ) . . . . . . . . . . . . . . . . . . . . . . . . .
Scores (en %) de reconnaissance des émotions sur le corpus Berlin
Résumé des caractéristiques des métriques conventionnelles du
rythme de la parole (extrait de [Ringeval, 2011]) . . . . . . . .
Résumé
des
caractéristiques
des
métriques
nonconventionnelles du rythme de la parole (extrait de [Ringeval,
2011]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quantité de phrases disponibles selon les groupes d'analyse de
la tâche d'imitation des contours intonatifs . . . . . . . . . . .
Performances en reconnaissance de l'intonation (%) : modélisation statique, dynamique et sur la fusion des deux pour les
sujets DT . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analyse des contributions des approches statique et dynamique
dans la caractérisation de l'intonation chez les enfants . . . . .
Quantité de groupes de soue disponible pour l'analyse de la
tâche de production de parole spontanée aective . . . . . . .
Algorithme de co-apprentissage . . . . . . . . . . . . . . . . .
Algorithme de co-apprentissage automatique pour la classication du motherese . . . . . . . . . . . . . . . . . . . . . . . . .
23
24
32
32
32
35
36
37
39
40
47
49
49
50
56
57
2.1
Corrélation entre la sortie du classieur (probabilité) et le score
d'évaluation du degré de coordination. . . . . . . . . . . . . .
3.1
Quantité d'auto-verbalisation et de parole adressée au système 101
86
xii
Liste des tableaux
3.2
3.3
3.4
3.5
Scores de reconnaissance (10 folds cross-validation) . . . . . . 101
Estimation de l'eort d'interaction (degré d'engagement) . . . 103
Scores de classication à partir de quatre signaux physiologiques107
Estimation de paramètres temporels de la marche . . . . . . . 108
Dans ce mémoire sont présentés les travaux de recherche que j'ai menés
depuis ma thèse de doctorat, soutenue en décembre 2004. Ils ont été réalisés
au Laboratoire des Instruments et Systèmes d'Ile-de-France en tant qu'ATER
(2004-2005) et puis en tant que Maître de Conférences depuis Septembre 2005.
Une partie des travaux présentés dans ce mémoire a été réalisée lors de séjours
invités à l'Université Polytechnique de Mataro (Espagne) et l'Université de
Stirling (Ecosse) (printemps/été 2005). Depuis 2007, je mène mes activités de
recherche au sein de l'Institut des Systèmes Intelligents et de Robotique (ISIR
UMR 7222).
Contexte
La recherche sur les modèles computationnels permettant l'analyse de l'interaction centrée sur l'humain s'est, ces dernières années, considérablement
accélérée. L'interaction peut être dirigée vers d'autres partenaires humains
mais également vers des machines (ordinateurs, agents virtuels, robots). Les
modèles computationnels ont pour objet la caractérisation automatique des
signaux échangés avec l'humain durant l'interaction. Diverses approches sont
actuellement suivies pour l'analyse et la compréhension de l'interaction. L'une
d'entre elles vient de la psychologie cognitive et se concentre sur l'émotion
[Picard, 1997]. L'idée principale de ce concept, également appelé informatique aective (aective computing ), est que la perception des émotions d'autrui se fait sur la base de signaux stéréotypés (expressions faciales, prosodie,
gestes,...). Une autre approche, venant de la linguistique, a pour objet la sémantique des signaux communicatifs [Argyle, 1987; Kendon et al., 1975]. Plus
récemment, un nouveau domaine de recherche a été proposé pour l'étude de
l'interaction : le Traitement du Signal Social (Social Signal Processing, SSP)
[Pentland, 2007]. Le traitement du signal social se concentre sur l'analyse de
signaux sociaux en mesurant l'amplitude, la fréquence et la durée de la prosodie, des mouvements de la tête ou bien encore des gestes. Il dière des autres
approches dans le sens où il s'intéresse à des signaux non linguistiques et le plus
souvent inconscients. Comme nous le verrons dans la suite de ce document,
le traitement du signal social a pour objet la prédiction de comportements ou
d'attitudes (accord, intérêt, attention...) par l'analyse automatique de signaux
non-verbaux.
L'analyse des signaux de communication verbale et non-verbale est au
c÷ur des méthodologies développées pour l'interprétation de situations inter-
2
actives [Kendon et al., 1975; Picard, 1997; Vinciarelli et al., 2008]. Le Traitement Automatique de la Parole (TAP) apporte des outils fondamentaux
pour l'analyse et la compréhension de la composante verbale de l'interaction.
Les signaux non-verbaux, qui ont une expression qui dière selon les modalités, requièrent une méthodologie spécique. Vinciarelli et al. [2008] identient
cinq indices pour la caractérisation de signaux non-verbaux : l'apparence physique, les gestes et la posture, les comportements du visage et des yeux, le
comportement vocal, les comportements dans l'espace et l'environnement. La
combinaison de ces diérents indices permet de transmettre diverses informations comme l'émotion, l'intention. Les signaux non-verbaux jouent un rôle
fondamental dans la gestion de l'interaction et dans la transmission de messages relationnels (dominance, persuasion, intention, etc.).
Les travaux décrits dans ce manuscrit ont pour objet l'exploitation de certains indices non-verbaux pour l'analyse de l'interaction dans des situations
diverses avec un humain, un agent virtuel ou un robot et, le plus souvent,
dans des contextes d'investigation clinique et d'assistance. Dans la lignée de
nos travaux de thèse, nous nous sommes concentrés sur le signal de parole
avec la spécicité de s'intéresser au message social transmis et non le contenu
linguistique. Les indices non-verbaux associés au signal de parole incluent les
pauses silencieuses, les vocalisations (pauses remplies, rire, pleurs, etc.), les
styles de parole (émotion, intention, etc.) et les tours de parole. La communication étant par essence multi-modale et dynamique, nous avons cherché à
caractériser les signaux multi-modaux échangés avec les humains.
Les objectifs de ces travaux sont multiples et visent principalement à :
- Améliorer de la compréhension des interactions sociales : processus émotionnels et intermodaux.
- Détecter automatiquement les indices non-verbaux pertinents (porteurs
d'informations) lors d'interactions sociales.
- Développer des systèmes interactifs, sociaux et multi-modaux pour l'assistance de personnes décientes
Le traitement du signal du social associé à la robotique personnelle dans des
situations réalistes et le plus souvent avec des partenaires décients (autistes,
personnes âgées avec ou sans troubles cognitifs) constituent les ls conducteurs
de nos recherches.
3
Traitement du signal social
Dénitions
Les signaux sociaux sont dénis comme étant des signaux communicatifs
ou informatifs qui, directement ou indirectement, fournissent des informations
sur des "faits sociaux" comme les émotions ou bien encore les relations sociales.
Lorsque l'émetteur associe un sens à ces signaux, ils sont alors considérés
comme communicatifs, et si le récepteur de ces signaux y associe également
un sens, ils sont alors considérés comme informatifs. L'analyse automatique
de ces signaux est identiée comme un verrou pour l'interprétation et la synthèse de comportements sociaux et est l'objet d'un domaine émergent appelé
traitement du signal social : Social Signal Processing1 (SSP)[Pentland, 2007;
Vinciarelli et al., 2009].
Le traitement du signal social est un domaine de recherche interdisciplinaire qui consiste à analyser, interpréter et prédire les interactions sociales.
Il vise à étudier un autre volet de l'intelligence appelée intelligence sociale.
Elle se traduit par la capacité de l'humain à prédire avec succès l'état mental
d'autrui (théorie de l'esprit), de lui attribuer des intentions, des émotions...
Un défaut de ces capacités caractérise certaines pathologies (e.g. l'autisme).
Le compréhension et la gestion des signaux sociaux sont des étapes fondamentales de l'intelligence sociale. Une des dicultés réside dans les formes diverses
que peuvent prendre les signaux sociaux avec une prépondérance reconnue des
indices non-verbaux (cf. gure 1).
L'analyse ainsi que la synthèse de signaux par des méthodes issues du
traitement du signal et de l'informatique au sens large sont au c÷ur des préoccupations du traitement du signal social. Les premiers travaux initiés dans
le domaine ont montré que les signaux sociaux, généralement décrits comme
identiables par des experts de la psychologie, peuvent être actuellement traités via des capteurs standards tels que des microphones et des caméras, et
interprétés par des techniques issues du traitement du signal, des statistiques
et de l'apprentissage articiel [Pentland, 2004, 2007; Vinciarelli et al., 2009].
La synthèse n'est pas en reste car les techniques issues du graphisme et de
l'interaction homme-machine permettent actuellement de réaliser des agents
communicants animés (ACA) "réalistes" avec des comportements dits "naturels" et surtout sociaux [Swartout et al., 2006; Pelachaud, 2009].
1 http
://sspnet.eu/
4
1. MOTIVATIONS ET C ONTEXTE
Signaux sociaux
F ig. 1.3 Indices comportementaux et signaux sociaux ; figure reproduite de [VIN09]20.
peuvent
la forme
constellations
complexes !"#$!#%&'(
non-verbaux
du comporteFig. 1:prendre
Traitement
du designal
social : Exploitation
d'indices
non-verbaux
(gure
ment
(e.g.,deexpressions
la prosodie, des gestes, de la posture, etc.) accompagnant
adaptée
[Vinciarellifaciales,
et al., de
2009])
les interactions Homme-Homme ou Homme-machine, cf. Fig. 1.3.
)"#$*+,-*(!&(."/$/.0'&(!&'('#1$/23('4%#/23(&'*(#55&$'&(62#'72"&..&(6&,5&*(!&(%4$*,#82&,(9(
la fois à la compréhension des informations véhiculées par les codes sociaux, mais également
à la tâche « !"#$%&'()&*(+' » des systèmes communicants. En effet, une interprétation précise
Au sein de la communauté naissante en traitement du signal social, un
de ces signaux par les machines leur permettrait de prendre en compte les paramètres sociaux
consensus se dégage autour de l'aspect encore exploratoire du domaine. Ceprésents dans les interactions Homme-machine. Toutefois, les signaux non-verbaux présentent
pendant, la caractérisation de signaux sociaux permet dès à présent de traiter
une forte dynamique et une interdépendance qui complexifient la tâche de caractérisation des
d'applications aussi diverses que 22le
développement
de systèmes
de dialogue
#$:4,5/*#4$'(72"#.'(;+<#%2.&$*(=>?@ABC
. Des
études ont cependant
54$*,+(72"#.(&'*(!+D9(64smulti-modaux ou bien encore l'analyse de prols d'utilisateurs23de
téléphone...
24
'#8.&(!"#!&$*#:#&,(./(6&,'4$$&(!45#$/$*&(!/$'(2$&(%4$;&,'/*#4$([ARA10]
et [CAM09]
, ou
25 domaine
Cette de
section
a rétroactions
pour vocation
la interactions
présentation
de l'état actuel
de ce
encore
gérer les
dans les
Homme-machine
[ALM09]
, cf. Fig. 1.4.
Emergence d'un domaine
émergent par l'exposé de quelques exemples jugés signicatifs.
1.5.
E njeux théoriques et applicatifs
Dénition de travail
Les enjeux théoriques de cette thèse concernent à la fois le domaine du TAP orienté émotion La
et ledénition
SSP ; puisque
ces dernières
font parties traités
des interactions
sociales. Notreduétude
a eu
même
des phénomènes
par le traitement
signal
pour
objectif
!"#!&$*#:#&,(
.&'(
!#::+,&$*'(
6/,/5E*,&'(
#$*&,;&$/$*(
!/$'(
./(
%4552$#%/*#4$(
4,/.&(
social est en cours de précision. En introduction, nous avons évoqué une dédes
émotions.
N4'(*,/;/23(4$*(*42*(!"/84,!(%4$'#'*+(9(définir
m+*<4!&'(6&,5&**/$*(!"idennition
dite de
travail : "Un signal social est un signal des
communicatif
ou informatifier automatiquement les supports temporels sur lesquels les informations sont ancrées dans
tif qui, directement ou indirectement, fournit des informations sur des "faits
sociaux" comme les émotions ou bien encore les relations sociales". Cette dé22
2
M. Argyle,
The Psychology
Behaviour , dans SSPNet
Penguin, 1967.
place la communication
nition
proposée
parofleInterpersonal
réseau d'excellence
23
O.
Aran
et
D.
Gatica-Perez,
FFusing
audio-visual
nonverbal
cues
to
detect
dominant people in small group
avec autrui au centre des modélisations et des traitements.
conversationGH(!/$'(6,4%I( ICPR, Istanbul, Turkey, Aug. 23-26 2010, pp. 3687J3690.
24
Pentland [2008] propose
lesto measuring
"signaux
honnêtes"
(honest
signals)
dont la
KI(L/568&..H(F>$(/2!#4-visual
approach
discourse
synchrony
in multimodal
conversati4$(!/*/GH(
dans proc. Interspeech
Brighton
, UK, Sep. 6-10
2009,
2159J2162. hard to fake that they can
dénition
initiale, est
: "behaviors
that
arepp.suciently
25
S.
Al
Moubayed,
M.
Baklouti,
M.
Chetouani,
T.
Dutoit,
A.
Mahdhaoui, J. C. .Martin,
S. Ondas, C.
Pelachaud,
form the basis for a reliable channel of communication"
Ces signaux
peuvent
J. Urbain et M. Yilmaz, FGenerating robot/agent backchannels during a storytelling experimentG, dans proc.
IE2E E Inter. C. on Rob. and Automation , Kobe, Japan, May 12-17 2009, pp. 2477J2482.
http ://sspnet.eu/
5
5
bien évidemment avoir un volet social et ont été exploités pour la prédiction
de comportements dans des situations diverses comme la prédiction de l'issu
d'une négociation [Curhan and Pentland, 2007].
La composante interactive ou sociale des traitements apporte un angle
d'analyse diérent. Par exemple, les émotions sociales dièrent des émotions
dites individuelles. Ces dernières correspondent le plus souvent aux émotions
primaires telles que la tristesse ou la joie. Elles sont considérées comme individuelles car non dirigées vers autrui. Les émotions sociales ont pour objet même
la production d'un eet chez le partenaire comme c'est le cas de l'admiration
ou de la compassion. Les émotions sociales ont un rôle régulateur dans l'interaction et sont souvent exprimées via des émotions dites non-prototypiques.
Nous reviendrons dans ce document sur l'intérêt de cette distinction ainsi que
sur les enjeux de la dénition des signaux sociaux.
Quelques travaux signicatifs
La prédiction de comportements humains est un des enjeux du traitement
du signal social. Et comme tout dé, il est légitime d'en étudier les limites caractérisées ici par la prédictibilité. Song et al. [2010] ont mené une étude visant
à étudier la prédictibilité des comportements humains en exploitant le téléphone mobile des utilisateurs comme source d'informations (géolocalisation).
La méthodologie mise en ÷uvre pour la modélisation des comportements est
basée sur des mesures d'entropie combinées à des modèles statistiques d'informations de géolocalisation [Song et al., 2010]. Sur un ensemble de 50 000
utilisateurs, les auteurs montrent que la prédictibilité de la mobilité sature
à 93% pour tous les trajets réguliers de plus de 10km. Ce résultat montre
que, sous une apparence aléatoire, les comportements sont caractérisés par
une certaine régularité.
Les modèles statistiques trouvent leurs fondements dans le traitement du
signal, la reconnaissance des formes ou l'apprentissage. Ces modèles exploitent
la régularité pour l'extraction d'informations de haut-niveau. Les exemples
présentés dans cette section ont pour objectifs d'illustrer les méthodologies
mises en ÷uvre pour la caractérisation de signaux sociaux et d'identier les
dés majeurs de ce domaine de recherche.
Activité humaine
Eagle and Pentland [2009] ont proposé de représenter
la structure des comportements quotidiens par une analyse en composantes
principales d'informations de géolocalisation. Les vecteurs propres de cette
analyse sont appelés "Eigenbehaviors" en référence aux Eigenfaces (traitement des expressions faciales). Les activités quotidiennes d'un individu sont
expliquées par les composantes principales. Les n premiers axes de l'analyse
6
caractérisent les comportements répétitifs (être à la maison, au travail...).
Les autres axes décrivent des comportements moins précis et surtout moins
typiques de l'individu. Cette analyse ne des comportements peut-être également étendue à des groupes d'individus. Pentland a ainsi proposé un domaine
de recherche appelé "fouille de la réalité" (reality mining) et considéré par le
Technology Review Magazine comme une des dix technologies qui changeront
notre façon de vivre.
L'idée principale de la fouille de la réalité est d'exploiter la régularité des
comportements. On retrouve une idée similaire dans la fouille de données (data
mining). L'analyse sémantique latente ore une modélisation souvent jugée
comme pertinente pour la fouille de documents. Les facteurs favorisant le rapprochement de l'analyse de comportements humains à celles des documents
sont (1) une représentation des données souvent assez simples (histogrammes,
sac de mots...) (2) une décomposition permettant une interprétation sémantique de haut-niveau, et pour certains modèles (3) la combinaison de méthodes
supervisées et non-supervisées et (4) la décomposition probabiliste permettant une décomposition dite souple. Chapitre 2, nous présenterons quelques
contributions dans l'analyse et la découverte de structures régulières dans les
comportements humains sur des données longitudinales (plusieurs mois) et
dans une perspective interactive (inuence mutuelle des partenaires).
Les approches inspirées de la fouille de données exploitent des méthodes
de décomposition pour la dénition d'espace sémantique permettant l'identication et la prédiction de comportements. Un des dés consiste à proposer des
méthodes combinant l'identication de structure, souvent basée sur des approches non-supervisées, et la prédiction basée sur des approches supervisées.
Farrahi and Gatica-Perez [Aug. 2010] ont récemment élégamment combiné ces
approches en exploitant des modèles thématiques (topic model) probabilistes
(LDA, Latent Dirichlet Analysis).
Tout simplement l'activité vocale...
L'interaction est intrinsèquement
liée à la production verbale, et l'activité vocale est le reet des stratégies de
chaque intervenant. Dans [Campbell, 2010], on retrouve une analyse intéressante de situations interactives (dialogue, réunion) basée sur l'activité vocale
sans identication du contenu linguistique. Campbell suggère que la synchronie de l'activité vocale de plusieurs locuteurs peut être utilisée dans la caractérisation de l'interaction : relations entre les locuteurs, phases de propositions,
accord, intérêt... L'activité vocale ne traduit pas seulement la prise de parole
d'un individu mais également les moments de silence (incluant les pauses) qui
jouent un rôle dans la coordination. Dans un contexte multi-modal, Campbell
a également montré que l'activité vocale d'un individu était corrélée avec la
quantité de mouvement [Campbell, 2008, 2010]. Le rôle des participants dans
7
une interaction inue directement la dynamique de l'activité vocale (tours de
parole). Vinciarelli [2009] exploite la proximité temporelle des interventions
en vue de la construction d'un réseau d'aliation sociale (Social Aliation
Nework). Ce réseau est un graphe constitué de deux types de n÷uds représentant les acteurs et les événements. Les acteurs sont les participants et les
événements sont dénis, dans ce travail, par des fenêtres temporelles de durée uniforme. La participation d'un acteur à un événement permet de lier
les événements. Il en découle une représentation simple des relations entre les
participants. Chaque individu est représenté par un vecteur comptabilisant les
participations à chaque événement. La caractérisation de la proximité temporelle des interventions a été appliquée avec succès à la reconnaissance de rôles
et à la caractérisation de groupes sociaux...[Vinciarelli, 2009]. La modélisation
de la dynamique temporelle des signaux sociaux est une étape commune à un
grand nombre de problématiques en traitement du signal social. Les chapitres
1 et 2 présenteront certains aspects de la caractérisation temporelle de signaux
sociaux pour l'interprétation de comportements humains.
Agent sensitif
Dans un contexte d'interaction avec un agent virtuel ou
un robot, la dynamique des échanges dépend de l'interprétation des signaux
sociaux émis par l'humain mais également de la capacité de l'agent à produire
des réponses adéquates.
Un cadre théorique intéressant pour étudier cette dynamique est la synthèse d'auditeur dit actif (Active Listening). La situation consiste à développer un agent attentif aux signaux émis par le partenaire humain. L'attention
s'exprime par la production de rétroactions (e.g. hochement de tête, vocalisations). L'approche généralement suivie consiste à apprendre des règles de
communication à partir de situations interactives n'impliquant que des humains (data-driven approach). Les modèles computationnels sont enrichis par
des connaissances issues de la pragmatique (ajustements verbal et non-verbal).
L'extraction d'informations de bas niveau (indices acoustiques, prosodiques,
mouvements de la tête ou bien encore la direction du regard) est une étape requise dans la modélisation de l'interaction [Morency et al., 2008; Al Moubayed
et al., 2009].
Un des dés est la dénition de méthodes d'apprentissage dotées de capacité (1) d'extraction des signaux pertinents incluant souvent une phase de
sélection de caractéristiques, (2) de capacité de généralisation à des situations
interactives complexes. En eet, même si la pragmatique et la psychologie
de l'interaction orent une base de connaissance exploitable (et à exploiter),
les travaux de recherche visant à apprendre la dynamique d'une interaction
montrent qu'il n'est pas simple d'identier le rôle des signaux. De plus, les
modèles issus d'apprentissage de situations face-à-face avec uniquement des
8
humains tendent à reéter des comportements individuels et souvent stéréotypés du fait du contexte expérimental, et limitent d'autant plus la capacité
de l'agent à interagir dans de nouvelles situations.
A titre d'exemple, citons le projet SEMAINE qui est à l'origine d'un système intégré basé sur un agent conversationnel animé (ACA) dont la spécicité
est l'adaptation en temps-réel du dialogue et de l'état de l'ACA en fonction de
la perception de signaux sociaux [Schroder et al., 2011]. Un des dés majeurs
est la conception de systèmes interactifs "enrichis" de capacité de perception
et d'adaptation à des signaux sociaux permettant ainsi une régulation plus
aboutie (durée, qualité, uidité...).
Pathologies, états psychologiques, psychiatrie
Comme nous le verrons
dans ce mémoire, le traitement de signaux sociaux "atypiques" résultant le
plus souvent de pathologies ore un cadre privilégié en se focalisant sur la
recherche de marqueurs diérentiels. Cohn [2010] présente un état des travaux actuels en traitement automatique du visage pour l'objectivation d'états
sociaux, psychologiques ou encore pathologiques. Les méthodes d'analyse d'expressions faciales ont ainsi été exploitées pour la caractérisation de la synchronie émotionnelle mère-enfant, la détection de la douleur ou encore l'estimation
du degré de sévérité de la dépression. Cohn [2010] insiste, à juste titre, sur le
potentiel de ces méthodes en recherche clinique.
Dans un contexte général de compréhension du développement de l'enfant,
Meltzo et al. [2009] ont proposé un cadre théorique et expérimental, appelé Social Learning, regroupant la psychologie, les neurosciences, les sciences
de l'éducation ainsi que l'apprentissage articiel. L'apprentissage social, Social Learning, a pour objectif l'introduction d'une composante sociale dans
les modèles (computationnels ou non). Les auteurs identient trois compétences requises : l'imitation, l'attention conjointe et l'empathie. L'imitation
requiert l'observation et permet, le plus souvent, un apprentissage plus rapide
qu'une approche basée uniquement sur la découverte individuelle. L'attention
conjointe permet le partage d'informations, la concentration sur les éléments
pertinents de l'interaction (e.g instant clef, objet). L'empathie et les émotions
sociales régulent les actes interactifs. Ces composantes sont mises en perspective d'une convergence de domaines disciplinaires diérents an d'améliorer les
connaissances sur l'apprentissage social. A noter que les robots sont employés
comme outil d'investigation en exploitant leurs capacités d'apprentissage et
plus généralement d'agentivité [Meltzo et al., 2010]. Le développement de
méthodes computationnelles pour l'étude et l'analyse de l'interaction, notamment en pathologie, est directement inuencé par des domaines tels que la
psychologie et les neurosciences. Au cours de nos recherches, nous avons tenté
de promouvoir cette convergence comme en témoignent les travaux présentés
9
dans ce manuscrit.
Enjeux théoriques et applicatifs
Le traitement du signal social ore de nouvelles perspectives pour l'analyse et la synthèse automatique de comportements en proposant des problèmatiques de recherches innovantes et focalisées sur l'interaction. En plaçant le
contexte social au centre des études, le traitement du signal social fait évoluer
les connaissances sur l'interaction (e.g. émotions individuelles vs. sociales).
Comme indiqué dans [Vinciarelli et al., 2011], la notion même de comportements a changé ces dernières années ; la problématique, initialement centrée
sur la détection d'actions simples (marche, geste), se focalise actuellement
sur des composantes sociales, aectives, et plus généralement psychologiques
des actions. Les recherches menées sur la détection, le suivi ou bien encore
la reconnaissance d'actions simples restent bien évidemment d'actualité mais
le contexte social permet, dans de nombreux cas, de lever des ambiguïtés et
d'améliorer les performances globales d'interprétation.
Les enjeux tant théoriques qu'applicatifs ont été décrits dans une récente
revue de la littérature [Vinciarelli et al., 2011]. Ils correspondent à l'étude
de l'intelligence sociale à la fois pour l'analyse et pour la synthèse de comportements dans des contextes interactifs. Parmi les applications mentionnées
on notera l'indexation enrichie (émotions, rire), les téléphones de dernière
génération (géolocalisation, analyse de conversation), l'interaction médiatisée
(visioconférence), le marketing (prols d'utilisateurs), les mondes virtuels (e.g.
Second Life), les agents conversationnels animés et la robotique sociale.
Les travaux que nous avons menés portent sur un sous-ensemble du traitement du signal social à savoir l'analyse et l'interprétation de signaux et de comportements. L'ambiguïté des signaux sociaux et la subjectivité de l'évaluation
souvent associées rendent la caractérisation complexe. Sur ce point, nous verrons que la recherche clinique ore un cadre singulier favorisant une démarche
rigoureuse. Les modèles proposés d'analyse et d'interprétation de signaux sociaux exploitent la régularité des signaux et de comportements humains. Cette
régularité s'exprime lorsque les comportements humains sont contextualisés :
temps, espace et également sociaux. Vinciarelli et al. [2011] mentionnent des
dés importants du domaine : multimodalité, fusion et contexte :
Multimodalité
La communication étant multimodale, l'exploitation de
plusieurs sources d'informations doit permettre d'aner les traitements. Cependant, la compréhension des contributions individuelles des signaux dans le
succès de l'interaction multi-modale est identiée comme un verrou majeur.
10
Fusion
D'un point de vue méthodologique, la fusion est au centre du traitement du signal multimodal. Les problématiques sont similaires à d'autres
domaines comme la biométrie [Faundez-Zanuy, 2005], et requièrent d'étudier le
niveau de fusion (e.g. espace des caractéristiques, décision...), les corrélations
existantes (e.g. complémentarité, redondance) [Kuncheva, 2004] mais également les échelles temporelles (e.g. fenêtre d'analyse, contingence) [Chetouani
et al., 2009d].
Contexte Du fait de la nature des signaux sociaux, l'analyse et l'interprétation sont souvent dépendantes du contexte. Un des verrous majeurs en
traitement du signal social consiste à passer d'un contexte W4 (where, what,
when, who), traitant uniquement d'informations apparentes, à un contexte
W5+ (where, what, when, who, why, how). Ce niveau d'informations est requis pour l'analyse de comportements, d'états psychologiques et cognitifs. La
prise en compte d'un tel niveau de contexte par les systèmes interactifs est au
c÷ur de l'intelligence sociale.
Positionnement et thèmes de recherche
Traitement de signaux sociaux atypiques
Nos activités de recherche portent sur l'analyse, la caractérisation, la reconnaissance, la modélisation de signaux et de comportements sociaux. La
richesse et la complexité des signaux de communication et des comportements
imposent des modèlesations non-linéaires, adaptatives et contextualisées (personne, environnement, tâche, état cognitif/aectif...). Nos travaux ont été motivés par la compréhension des mécanismes fondamentaux de la communication, exprimés le plus souvent par l'échange de signaux. Nous avons porté une
attention toute particulière aux modications de ces échanges qui se traduisent
par des troubles de la communication (autisme, troubles cognitifs légers).
Le traitement du signal social a, par essence, un lien avec la psychologie, et nos activités de recherche nous ont amenées à tisser des liens avec
la psychologie et la psychiatrie. Nous avons de ce fait proposé ou du moins
tenté de promouvoir un domaine spécique : le traitement de signaux sociaux
atypiques.
Nos collaborations avec la clinique ont grandement contribué à aner notre
vision du traitement du signal social. Ce document a pour vocation de présenter cette vision orientée vers la caractérisation, l'analyse et la prédiction
de signaux et de comportements sociaux.
11
Positionnement
D'un point de vue méthodologique, mes travaux couvrent des étapes importantes du traitement du signal social (Machine sensing and understanding
of social behaviors), de la robotique personnelle (socially assistive robotics)
en s'appuyant sur des méthodes statistiques issues du traitement du signal,
de la reconnaissance des formes et de l'apprentissage. Mes contributions sont
ici traduites en trois pôles jugés représentatifs des travaux menés :
1. La caractérisation des composantes sociales du signal de parole (depuis 2005), autour duquel subsistent un certain nombre de verrous, notamment en ce qui concerne l'extraction de caractéristiques reétant le
message social, et leur classication dans des contextes variés : interaction homme-robot, interaction avec des personnes décientes ou encore
naïves.
2. Dynamique de la communication humaine (depuis 2007), les dés scientiques résident dans la reconnaissance, la modélisation et la prédiction
des comportements non-verbaux humains par essence non-linéaires, dynamiques et asynchrones.
3. L'intelligence sociale pour la robotique personnelle (depuis 2008), la reconnaissance et la gestion de signaux sociaux complexes tels que les tours
de parole, l'engagement, l'attention conjointe... forment les éléments fondamentaux de l'intelligence sociale. Doter les systèmes robotiques de ce
volet de l'intelligence permet d'explorer des aspects nouveaux de la robotique personnelle en permettant une adaptation continue aux comportements de personnes non-expertes et/ou décientes.
Chapitre 1
Caractérisation de signaux de
parole : du signal au message
social
1.1
Contexte
La parole, signal social par excellence, véhicule des informations nécessaires à l'établissement et la régulation des interactions et de ce fait joue un
rôle prépondérant dans les systèmes interactifs. Les recherches menées par
la communauté du Traitement Automatique de la Parole (TAP) permettent
actuellement de caractériser et d'extraire des informations riches allant audelà de la simple transcription. En eet, les informations véhiculées par le
signal de parole sont multiples et concernent le contenu phonétique, la langue,
l'identité du locuteur, les vocalisations non-linguistiques, les états aectifs...
La transcription enrichie vise à extraire des informations complémentaires
(métadonnées) à celles extraites par un système de reconnaissance de la parole. La connaisance de ces métadonnées a un eet structurant (e.g. recherche
d'informations, résumé de documents).
La composante sociale des métadonnées extraites diérencie le traitement
du signal social de l'indexation. Pour mieux situer les problématiques de ce
domaine de recherche, reprenons les grandes lignes d'un exemple présenté
en introduction de ce manuscrit décrivant l'exploitation de l'activité vocale
pour la reconnaissance de rôles [Vinciarelli, 2009]. Le principe du traitement
consiste 1) à extraire le temps de parole de chaque locuteur, 2) à construire
un réseau social d'aliation et 3) à catégoriser des caractéristiques extraites
du réseau (proximité temporelle, temps de parole) à l'aide d'un classieur (cf.
gure 1.1).
La transcription enrichie, telle qu'actuellement traitée dans les campagnes
NIST (Rich Transcription puis TREC Video Retrieval Evaluation) [Smeaton
et al., 2006], porterait sur l'extraction de l'information locuteur (segmentation
de locuteurs). Le traitement du signal social vise à extraire des informations
de plus haut-niveau : le rôle d'un participant, l'identité des locuteurs est une
donnée d'entrée du problème.
14
Caractérisation de signaux de parole
Flux audio
Segmentation
du flux audio
Extractions de
différents groupes
sociaux
Reconnaissance des
rôles
Rôles
Extractions des caractéristiques
Fig. 1.1:
Reconnaissance automatique des rôles de locuteurs
La plus value du traitement du signal social réside, du moins dans ce travail, dans le développement d'une méthodologie de traitement automatique
incluant des connaissances issues des sciences humaines et sociales pour la
modélisation de l'interaction : réseaux sociaux d'aliation [Vinciarelli, 2009].
Bien évidement, l'ecacité même de cette méthodologie nécessite une détection robuste et performante de signaux caractéristiques de l'activité humaine
(segmentation en locuteurs).
Les travaux présentés dans ce chapitre portent sur la caractérisation de la
composante sociale du signal de parole. Il s'agit d'un problème fondamental
et dicile visant à déterminer, à partir d'un signal, des informations aussi
diverses que l'identité du locuteur, son état aectif et pathologique ou bien
encore son intention. Les sections suivantes introduisent la problématique et
apportent des éléments de justication sur notre positionnement scientique.
1.2
Encodage des informations dans la parole
1.2.1 Informations véhiculées
Le signal de parole est le support d'informations multiples qui peuvent
être décomposées en trois catégories : 1) linguistique, 2) para-linguistique et
3) extra-linguistique. La transmission d'informations linguistiques est considérée comme l'objectif premier de la parole. Les informations linguistiques sont
de plusieurs types (e.g. phonèmes, mots) et sont représentés par un ensemble
ni et discret de symboles. Les informations para-linguistiques modient et
ajoutent des éléments utiles à la compréhension et à l'interprétation du message. Elles peuvent être délibérément ajoutées par le locuteur an de modier
ou d'enrichir les informations linguistiques. Elles caractérisent la modalité de
la phrase ou bien encore l'intention du locuteur. L'état aectif et l'attitude du
locuteur sont des facteurs enclins à augmenter la variabilité du signal de parole. Plus l'interaction est personnelle, plus les informations para-linguistiques
1.2. Encodage
la parole
2. LA des
PRO S informations
O DIE, S UPPORT D Edans
S IN F ORMATION
S D E LA C OMMUNICATION 15
Informations en
Entrée
Règles de
Grammaire
Règles de
Prosodie
Contraintes
Physiologiques
Contraintes
Physiques
Linguistiques
# Lexicales
$
% Syntaxiques %
& Sémantiques '
% Pragmatiques %
(
)
Organisation
Organisation
du Message
de la Phrase
Génération de
Commandes
Motrices
Production
des Sons de
la Parole
Caractéristiques
Segmentales et
Suprasegmentales
de la Parole
ParaLinguistiques
#% Intentionnelles $%
& Attitudinales '
(% Stylistiques )%
ExtraLinguistiques
! Physiques
Emotionnelles"
373839
F ig. 1.5 Processus par lesquels des informations de types variés se manifestent dans les caractéristiques segmentales et suprasegmentales de la parole ; figure reproduite de [FUJ04]39.
Processus d'encodage des informations dans la parole. Figure tirée de
2.2.3. A ffective
[Ringeval, 2011] et initialement adaptée de [Fujisaki, 2004]
Fig. 1.2:
La prosodie affective possède une fonction plus globale que celles desservies par les deux
précédentes [PAU05a]33. Elle exprime l'état général affectif d'un locuteur [WIN88]40 et comdeviennent
et nécessitent une caractérisation spécique. Les
prend lesprépondérantes
changements de !"#$%&!"'()!%*+"'(,)-'./!("'0'1$223!"-&%'&4."%'15$-&"!()6+&"+!%'7"8#89'
pairs, de jeunes enfants
ou des personnes
de statut
social l'âge
plus élevé).
Ses fonctionnalités
traitsnos
d'individualité
du locuteur
comme
le genre,
ou bien
encore la persont
donc
:
(i)
extérieures
au
discours,
(ii)
concernent
les
intentions
et
les
attitudes
du locuteur
sonnalité impactent les informations extra-linguistiques.
face à ses semblables et (iii) ont pour objectif de desservir les interactions sociales.
La gure 1.2 rappelle la complexité du processus d'encodage et la nuance
des frontières
existantes
entre les dans
catégories.
2.3. E ncodage
des informations
la parolePar exemple, dans le modèle
proposé par [Fujisaki, 2004], l'information émotionnelle est considérée comme
Les informations exprimées par la parole peuvent être décomposées en trois catégories :
non-linguistique
alors que dans d'autres dénitions cette information sera ca(i) linguistiques, (ii) para -linguistique, et (iii) non-linguistique. Bien que leurs frontières ne
Keller, 2004].
tégorisée
para-linguistique
soientcomme
pas toujours
très claires [FUJ04] ,[Campbell,
cf. Fig. 1.5 ; (i)2007;
les informations
linguistiques sont
par un ensemble
fini et discret du
de symboles
de règles
pour requiert
leurs combinaiLareprésentées
caractérisation
automatique
signal etde
parole
l'étude
sons
;
(ii)
les
informations
para-linguistiques
sont
définies
par
celles
qui
ne
peuvent
être
inféet l'exploitation d'informations linguistiques, para-linguistiques et extrarées par la partie écrite et qui sont délibérément ajoutées par le locuteur pour modifier ou suplinguistiques.
convergence
de .laElles
caractérisation
deset continues,
informations
liées à
plémenter lesLa
informations
linguistiques
sont à la fois discrètes
e.g., modala linguistique
et
à
l'identité
des
locuteurs
est
un
des
éléments
qui
a
favorisé
($&3%' 1$%6!:&"%' 1"' (/' .;!/%"' "&' 6)-&$-++<' 15$-&"-&$)-%' )+' 15/&&$&+1"s du locuteur face au disl'émergence
de la transcription enrichie. Dans la même lignée, après une thèse
37
D. v/-'=/-6>"!9'?8'@/-&"!'"&'?8'A"!B"">9'C?$%/<B$#+/&$)-')2'1$&!).$6'%"-&"-6"%D'E6)+%&$6'/-1'.;)-"&$6'6+"%F'
de doctorat
principalement
sur
caractérisation du contenu
dans, J. of Sayant
peech andportée
Hearing Res.
vol. 24, no. 3, pp. 330G335,
Sep.la
1981.
38
V. M. Quang, !"#$%&'('&%)*+,*$(*#-%.%+&,*#%/-*$(*.,01,)'('&%)*,'*$2()($3.,*(/'%1('&4/,*+,.*.&0)(/"*+,*# alinguistique
signalInstitut
de National
parole,
je me suis
intéressé
aux informations pararole, Thèsedu
de Doctorat,
Polytechnique
de Grenoble,
2007.
39
H.
Fujisaki,
CInformation,
prosody,
and
modeling
G
H$&;'"<.;/%$%')-'&)-/('2"/&+!"%')2'%.""6;F,
dans Speech
linguistiques. Ce chapitre résume le cheminement intellectuel entrepris
allant
Prosody, Nara, Japan, Mar. 23-26 2004, invited paper.
40
E. Winner, 56,*#%&)'*%7*8%-+.9*:6&$+-,)2.*/)+,-.'()+&)0*%7*1,'(#6%-*()+*&-%)3,
dans Cambridge, Harvard
de la caractérisation
de l'information locuteur à celle de l'information
aective.
41
41
University Press, 1988.
H. Fujisaki, CInformation, prosody, and modeling G H$&;'"<.;/%$%')-'&)-/('2"/&+!"%')2'%.""6;F, dans Speech
Prosody, Nara, Japan, Mar. 23-26 2004, invited paper.
9
1.2.2 Caractérisation automatique de signaux de parole
Les diérentes informations véhiculées par le signal de parole (e.g. linguistique, identité, état aectif...) introduisent de la variabilité. Les traitements
automatiques visent, en fonction de l'application (e.g. reconnaissance de la
parole, du locuteur ou de l'émotion), à identier une de ces sources de variabilité. Pour ce faire, les systèmes de reconnaissance sont généralement conçus
16
autour de trois étapes : 1) l'acquisition et les pré-traitements, 2) l'extraction
de caractéristiques (ou de paramètres) et 3) la classication.
En reconnaissance de formes, la distinction entre extracteur de caractéristiques et classieur n'est pas si aisée. Dans [Duda et al., 2000], il est ainsi
expliquer que, dans un cadre Bayésien, l'opération de classication la plus
simple peut se réduire à la fonction "max" à condition d'avoir une estimation
able des probabilités :
x ∈ Ci ⇐⇒ p(Ci |x) > p(Cj |x) ∀j 6= i
(1.1)
où p(Ci |x) est la probabilité a posteriori que la classe correcte soit Ci lorsque
l'on observe x.
Les probabilités forment alors le vecteur caractéristique et l'opérateur de
comparaison, le classieur. Cette formalisation est un des fondements des approches hybrides comme les modèles neuro-markoviens (estimation de probabilité par réseau de neurones) et de manière plus intégrée des TRAPs (TempoRAl Patterns) proposés par Hermansky and Sharma [1999].
Problématique
La caractérisation automatique de signaux, telle que nous
la dénissons dans nos travaux, requiert des modèles avancés d'extraction et
de classication de caractéristiques. Les justications de cette démarche sont
les suivantes :
Extraction de caractéristiques : Présente en amont des phases de reconnaissance et souvent directement confrontée à des signaux brutes issus
de capteurs, l'extraction de caractéristiques permet la prise en compte
de la nature des signaux (contenu spectral, bruit, linéarité, stationnarité...), tout en produisant une représentation compacte de ces derniers.
Les méthodes développées, du moins en traitement de la parole, font
appel à des techniques de traitement du signal.
Classication : Les enjeux actuels, tant théoriques qu'applicatifs, des
systèmes interactifs imposent d'enrichir les informations extraites par
des méta-données (identité du locuteur, états psychologiques...). La reconnaissance des formes, et le plus souvent l'apprentissage articiel,
jouent un rôle prépondérant dans la modélisation et la classication
des caractéristiques. En eet un grand nombre de caractéristiques identiques peuvent être employées pour des tâches diérentes : acoustiques
pour la reconnaissance de la parole et du locuteur, prosodiques pour les
émotions et les intentions... et dans ce cadre la classication joue le rôle
d'extraction d'informations de plus haut-niveau.
D'un point de vue formel, l'extraction de caractéristiques consiste à déterminer et à appliquer une fonction F à un signal de parole s an d'extraire un
1.2. Encodage des informations dans la parole
jeu de N caractéristiques f :
f = F(s)
17
(1.2)
Selon les approches, la fonction F peut-être dénie par des techniques de
traitement du signal (e.g. Linear Predictive Coding), par apprentissage (e.g.
TRAPS, Neural Predictive Coding cf. travaux de thèse [Chetouani, 2004]),
par projection (e.g. Analyse en composantes principales), par sélection (e.g.
Sequential Feature Selection).
Les verrous scientiques généralement identiés portent sur la réduction
1) de la complexité de l'extracteur, 2) de la dimension de la représentation, et
en ce qui concerne la classication, il s'agit de contrôler 3) la généralisation
et 4) le pouvoir de modélisation et/ou de discrimination.
La classication a également un eet de réduction de la dimension des
données en associant à chaque jeu de caractéristiques une information d'appartenance à une ou plusieurs classes. La classication consiste à attribuer au
vecteur f la classe Ci où i ∈ {1, ..., N } :
Ci = C(f )
(1.3)
Où C est la fonction de classication dont l'instanciation varie selon les méthodes exploitées : directes, structurelles, statistiques... Parmi les algorithmes
concernés, les k-plus-proches voisins, les réseaux de neurones, les modèles à
base de mélange de gaussienne ou bien encore les machines à vecteur support
sont les plus populaires.
Au-delà d'un choix d'algorithme, les performances des systèmes de reconnaissance sont sujets à l'optimisation des paramètres (e.g. sélection de modèles), à l'évaluation des performances (e.g. n-folds cross-validation, bootstrap) ainsi qu'aux bases de données exploitées (e.g. variabilité). Dans un
contexte de traitement de signaux de communication, la constitution de bases
de données est une diculté majeure et requiert une approche rigoureuse an
de ne pas inuencer les performances. Les méthodes mises en ÷uvre font généralement appel à l'annotation manuelle de corpus produits par des acteurs
(émotions prototypiques et souvent exagérées) ou bien extraits de scènes réalistes [Devillers et al., 2005]. Les données non actées sont diciles à recueillir
et souvent sujettes à des contraintes éthiques. Dans le cadre de l'informatique
aective, Schuller et al. [2010] proposent une étude intéressante portant sur la
reproductibilité des performances en reconnaissance d'émotions en mixant des
données de corpus diérents. Le traitement de signaux sociaux produits dans
des contextes réalistes est un des plus grands dés actuels de l'interaction.
Positionnement scientique
Nos travaux de recherche portant sur la caractérisation de signaux de parole ont pour motivation l'extraction de la com-
18
posante sociale du signal de parole. Cette composante inclut l'identité du
locuteur ainsi que les états aectifs et communicatifs.
Nos contributions s'inscrivent dans le domaine de l'extraction de caractéristiques et concernent plus précisément :
La nature statistique des signaux de parole : apport d'une modélisation
adaptée à la complexité du signal (gaussien vs non-gaussien, stationnarité...).
Les unités temporelles d'analyse : au lieu d'optimiser un vecteur caractéristique donné, nous avons opté pour une approche consistant à
optimiser et multiplier les points d'ancrage (quand extraire des caractéristiques ?)
La dynamique de la parole : Le rythme est une notion complexe et sousexploité en caractérisation de la parole aective. Nous avons proposé un
ensemble de méthodes permettant de la prise en compte du rythme dans
les systèmes de reconnaissance.
La subjectivité de l'annotation de signaux sociaux : La collecte et l'annotation de données naturelles sont des tâches requises mais complexes
car consommatrices de temps et surtout nécessitant l'intervention d'experts pas toujours disponibles. De plus les données annotées ne sont pas
toujours ables. L'apprentissage semi-supervisé permet de tirer prot
des données déjà annotées et de renforcer les capacités de prédiction des
classieurs.
Nous avons abordé la problématique de la caractérisation de signaux de
parole à travers deux applications : la reconnaissance de locuteurs (nature statistique des signaux) et la reconnaissance de la parole aective (ancrages et
modèles dynamiques). Nos activités de recherche se concentrent actuellement
sur cette dernière application dans un contexte lié à l'investigation clinique :
diagnostic diérentiel, signes précoces, évaluation... Le volet expérimental de
nos travaux nous a conduit à nous confronter au traitement de données diverses : actées, imitées, naturelles et spontanées.
Les sections suivantes présentent les contributions en caractérisation des
composantes sociales du signal de parole.
1.3
L'information locuteur
Deux approches sont généralement suivies pour la caractérisation de l'information locuteur et elles se diérencient par le niveau d'abstraction des
composantes du signal de parole : caractéristiques de haut et bas niveaux [Reynolds et al., 2003]. La composante bas-niveau véhicule des informations sur la
structure du conduit vocal alors que la composante haut-niveau exprime des
1.3. L'information locuteur
19
informations comportementales telles que la prosodie, la phonétique, la structure de la conversation, etc. La dimension temporelle représente une diérence
fondamentale entre les deux composantes. Les informations de bas-niveau sont
estimées à partir de fenêtres d'analyse de courte durée (<30ms) alors que les
informations de haut-niveau requièrent souvent une durée allant au-delà de la
seconde.
Les techniques de caractérisation de l'information locuteur sont largement
dominées par des approches bas-niveau telles que les codages MFCC (Mel Frequency Cepstral Coding) et LPCC (Linear Predictive Cepstral Coecients)
généralement associées à des caractéristiques auxiliaires comme l'énergie du
signal ou bien encore des paramètres dynamiques (vitesse ∆ et accélération
∆∆). Cependant, plusieurs initiatives cherchent à remettre en cause cette suprématie. L'action Européenne COST 277, coordonnée par Marcos FaundezZanuy (EUPMT) de 2001 à 2005, portant sur le traitement non-linéaire de la
parole est une des ces initiatives visant à remettre en question les méthodes
traditionnelles de traitement de la parole [Faundez-Zanuy et al., 2005]. A noter que les discussions menées en son sein ont amené à reformuler la question
du traitement non-linéaire en une question plus ambitieuse portant sur la
proposition de méthodes non-conventionnelles [Chetouani et al., 2009b]. Les
activités de cette action Européenne sont à l'origine du congrès ISCA NOLISP (Non-Linear Speech Processing) dont le premier événement NOLISP'03
a été organisé par Frédéric Bimbot (IRISA), où pendant ma thèse, j'ai pu faire
connaissance avec de nombreuses personnes qui ont, directement ou indirectement, inuencées mes recherches. En 2007, j'ai été l'organisateur principal
de NOLISP'07 à Paris [Chetouani et al., 2009c], qui fût le premier événement
après la n de l'action et donc sans nancement spécique. Une communauté
s'est ainsi formée permettant l'organisation de NOLISP'09 (University of Vic,
Espagne) et de NOLISP'11 (University Las Palmas de Gran Canaria).
Après ma thèse de doctorat, mes contributions dans ce domaine ont porté
sur l'exploitation de statistiques d'ordre supérieur dans les modèles d'extraction de caractéristiques. Des justications d'une telle approche sont proposées
dans [Chetouani et al., 2009a] et sont basées sur les travaux de Kubin [1995].
Ces justications portent sur le dilemme entre les modélisations non-linéaire
et non-gaussienne [Little, 2011].
1.3.1 Résidu de prédiction
Il est généralement admis que le signal de parole est le résultat de l'excitation du conduit vocal par une source (périodique ou non) formant ainsi la
base du modèle source-ltre. Dans l'analyse par prédiction linéaire, le conduit
vocal est modélisé par un ltre prédictif linéaire LPC (Linear Predictive Co-
20
ding) et l'excitation par le résidu de prédiction. L'analyse consiste à estimer
les coecients LPC par minimisation de l'erreur de prédiction. L'échantillon
sb est estimé par combinaison linéaire des p derniers échantillons [Atal and
Hanauer, 1971] :
p
X
sb(k) = −
ai s(k − i)
(1.4)
i=1
Les coecients LPC ai sont à relier au conduit vocal, et de ce fait, à la
caractérisation partielle de l'individualité du locuteur. Les coecients LPCC
(Linear Predictive Cepstral Coding), dérivés des LPC, sont exploités avec
succès en reconnaissance de locuteur. Le paramètre p (ordre du ltre) joue un
rôle important dans la modélisation par son implication dans les performances
du prédicteur mais également dans la dimension du vecteur de caractéristiques
utilisé par le classieur.
Dans l'analyse par prédiction linéaire, le résidu est obtenu en estimant
l'erreur entre l'échantillon courant s(k) et l'échantillon prédit sb(k) :
r(k) = s(k) − sb(k)
(1.5)
Théoriquement, le résidu est décorrélé du signal de parole et doit représenter l'excitation. Le résidu est supposé véhiculé des informations dépendantes
du locuteur. Plusieurs chercheurs ont proposé l'exploitation de ce signal pour
l'amélioration des systèmes automatiques de reconnaissance [Thévenaz and
Hugli, 1995; Faundez-Zanuy, 2007; Yegnanarayana et al., 2001; Mahadeva Prasanna et al., 2006; Zheng et al., 2007; Chetouani et al., 2009a]. Les méthodes
généralement mises en ÷uvre exploitent l'orthogonalité théorique entre les
modèles (LPC et modélisation du résidu) [Thévenaz and Hugli, 1995]. Mahadeva Prasanna et al. [2006] utilisent un réseau de neurones auto-associatif pour
la caractérisation du résidu et montrent qu'il est possible d'atteindre des performances relativement élevées en exploitant uniquement des caractéristiques
extraites du résidu.
1.3.2 Prise en compte de la nature du résidu
De nombreux travaux ont souligné l'importance d'une modélisation plus
ne du résidu. Faundez-Zanuy [2007] propose, par exemple, une analyse par
prédiction non-linéaire (réseau de neurones prédictifs). Les travaux de Thyssen et al. [1994] et de Tao et al. [2004] soulignent le caractère non-linéaire du
résidu. Thyssen et al. [1994] montre qu'il est nécessaire d'exploiter une cascade de prédicteurs linéaires an de supprimer toute composante linéaire du
résidu. Notre démarche consiste à exploiter et à compléter les extracteurs de
caractéristiques de l'état de l'art (MFCC et LPCC) par une caractérisation
adaptée du résidu.
21
Contributions Nous sommes partis du constat que l'analyse prédictive linéaire est basée sur des statistiques d'ordre deux (covariance, auto-corrélation)
qui, par dénition, ne sont pas adaptées à la modélisation de processus nongaussien. La conception de modèles prédictifs non-linéaires constitue une des
voies possibles [Mahadeva Prasanna et al., 2006] (cf. travaux de thèse : Neural
Predictive Coding [Chetouani, 2004]). Nous avons par la suite opté pour une
autre approche en tentant non pas de remettre en cause l'ensemble de l'étage
d'extraction de caractéristiques (LPCC ou MFCC), mais en proposant des caractéristiques complémentaires. Notre approche considère l'analyse prédictive
linéaire comme une étape de décomposition du signal permettant d'obtenir (1)
une composante modélisée par le ltre LPC et (2) le résidu porteur également
d'informations (cf. equation 1.5). L'ordre p du ltre, les algorithmes choisis
ou bien encore le bruit sont autant de facteurs inuençant la nature du résidu
et donc la modélisation requise.
Le modèle que nous avons proposé est basé sur cette approche et consiste à
modéliser par des prédicteurs linéaires le résidu. Comme l'illustre la gure 1.3,
nous avons exploité deux modélisations basées sur des statistiques du second
et troisième ordres. La première approche exploite une modélisation prédictive
du résidu de parole à l'image du codage LPC. Un modèle auto-régressif est
ainsi estimé :
ρ
X
rb(k) = −
αi r(k − i)
(1.6)
i=1
où r et ρ représentent, respectivement, le résidu et l'ordre du ltre. A l'instar de du codage LPC, les coecients αi ne sont pas utilisés directement
par l'étape de classication. Nous exploitons une dérivation cepstrale similaire à celle permettant d'obtenir les paramètres LPCC (LPC->LPCC). Les
paramètres obtenus sont nommés R-SOS-LPCC du fait qu'ils sont issus d'une
modélisation du résidu basée sur des statistiques d'ordre 2, suivie d'une dérivation dans le domaine cepstral.
La deuxième modélisation exploite des statistiques d'ordre supérieur (cf.
gure 1.3). Les coecients d'un prédicteur linéaire sont estimés par la résolution des équations Yulke-Walker [Atal and Hanauer, 1971] liant les coecients
ai à la matrice d'auto-corrélation R (statistique d'ordre deux). Une extension
aux ordres supérieurs est possible moyennant une estimation contrainte des
statistiques concernées. L'estimation des paramètres d'un modèle AR par un
cumulant d'ordre trois, noté C , a été proposée par Paliwal and Sondhi [1991]
et résulte dans les équations Yulke-Walker suivantes :
p
X
ai Ci (l, m) = 0
(1.7)
i=0
Avec les contraintes suivantes 1 ≤ l ≤ p, 0 ≤ m ≤ l.
22
Résidu
de prédiction
Signal
de parole
Analyse par
prédiction
linéaire
Analyse par modèle
statistique du second
ordre
Coefficients
R-SOS LPCC
Analyse par modèle
statistique du troisième
ordre
Coefficients
R-HOS LPCC
Analyse fréquentielle
Fig. 1.3:
Coefficients
R-PDSS
Traitements temporels et fréquentiels appliqués au résidu r.
Le cumulant d'ordre trois d'un signal s étant déni par :
Ci (l, m) =
V
X
sv−i sv−l sv−m
(1.8)
v=p+1
Où V est la dimension de la fenêtre d'analyse.
Suite à cette formulation du problème, un algorithme standard d'estimation des coecients du modèle AR est utilisé [Paliwal and Sondhi, 1991]. Notre
contribution a consisté à appliquer cette modélisation non pas au signal de
parole s (équation 1.4), mais au résidu r (équation 1.6). Les paramètres obtenus via une transformation cepstrale sont appelés R-HOS-LPCC (cf. gure
1.3 ). Une seconde contribution a consisté à mettre en ÷uvre une analyse fréquentielle du résidu à l'aide d'une mesure de la platitude du spectre (Spectral
Flatness Measure) :
R − P DSS(i) = 1 −
N1
S(n) i
Q
Hi
n=Li
1
Ni
PHi
n=Li
S(n)
(1.9)
où S(n) est la densité spectrale de puissance estimée dans des bandes de
fréquences spéciques (banc de ltre avec distribution linéaire ou non-linéaire
des fréquences centrales). Le principe de l'analyse est décrit dans [Chetouani
et al., 2009a] (gure 1.3).
1.3.3 Résultats
L'évaluation des performances de l'ensemble des paramètres proposés (RSOS-LPCC, R-HOS-LPCC et R-PDSS) est menée dans le cadre d'une tâche
d'identication de locuteurs. Cette tâche est composée des étapes suivantes :
23
1. Phase d'enrôlement : apprentissage d'un modèle statistique (GMM) par
locuteur
2. Phase d'identication : détection d'un locuteur parmi N locuteurs.
Les expériences sont décrites dans [Chetouani et al., 2009a] en ce qui
concerne l'évaluation sur deux bases de données, Gaudi (N=49) et NTIMIT
(N=630), et dans [Monte-Moreno et al., 2009] en ce qui concerne la mise en
÷uvre de méthodes de compensation du canal de transmission, de fusion d'informations à la fois pour l'identication et la vérication de locuteurs. Nous
présentons par la suite uniquement les résultats obtenus sur la base Gaudi.
L'intérêt de la base Gaudi [Ortega-Garcia et al., 2000] réside non pas dans
le nombre de locuteurs (seulement 49) mais dans le large éventail des congurations possibles : langue, intervalles entre les sessions d'acquisition et type de
microphone. Nous avons eu accès à cette base de données lors de notre séjour
à l'Université Polytechnique de Mataro et ce dans le cadre du programme
COST 277. Le corpus est composé de :
− 49 locuteurs.
− 4 sessions avec diérentes tâches : lecture de chires isolés, connectés
et de texte ainsi que de la parole conversationnelle.
− Pour chaque session, les enregistrements sont réalisés en Catalan et
en Espagnol avec trois diérents microphones : MIC1 (micro-cravate
unidirectionnel à ≈ 10 cm du locuteur), MIC2 (cardioïde à ≈ 30 cm) et
MIC3 (micro-casque).
Les correspondances entre les sessions et les microphones sont renseignées
table 1.1.
Correspondance entre les sessions et les microphones.
Ref.
M1
M2
M3
M3
M5
M6
Session
1
1
2
2
3
3
Microphone MIC1 MIC2 MIC1 MIC2 MIC1 MIC3
Tab. 1.1:
Un modèle statistique (GMM) par locuteur est estimé lors de la session
d'enrôlement M1. Le tableau 1.2 regroupe les résultats expérimentaux. Les
méthodes de l'état de l'art (MFCC et LPCC) obtiennent les meilleurs scores.
A noter que les modélisations basées uniquement sur le résidu (R-SOS-LPCC,
R-HOS-LPCC et R-PDSS) atteignent des performances non négligeables. L'intérêt majeur de notre contribution réside dans la combinaison des paramètres
traditionnels avec ceux dérivés du résidu et ceci dans le cas de congurations
diérentes entre l'apprentissage et le test (de M2 à M6) : diérence de langues,
de microphones et une augmentation de l'intervalle entre les sessions. D'autres
24
résultats publiés dans [Chetouani et al., 2009a; Monte-Moreno et al., 2009]
illustrent l'importance et la complémentarité de la modélisation du résidu.
Performances en identication de locuteur dans des conditions diverses
(enrôlement lors de la session M1) avec des modélisations temporelles, fréquentielles
et mixtes
Paramètre
M1
M2
M3
M4
M5
M6
Moy.
LPCC
94.78 73.7 74.60 66.213 55.33 52.15 69.46
R-SOS-LPCC
87.98 63.72 60.32 59.18 44.45 43.99 59.94
R-HOS-LPCC 83.45 55.33 57.14 50.79 42.40 33.10 53.70
LPCC
97.5 81.86 79.82 71.43 56.92 62.81 75.05
+R-SOS-LPCC
LPCC
97.96 80.04 80.04 70.521 58.05 59.64 74.37
+R-HOS-LPCC
MFCC
97.50 76.64 78.23 72.34 57.59 62.36 74.11
Mixte
Fréq.
Temporelle
Tab. 1.2:
R-PDSS
82.09 59.86 62.36
60.99
45.35 42.18
58.80
LPCC
+ R-PDSS
99.77 82.54 85.26
83.22
66.43 67.35
80.76
1.3.4 Conclusion
L'intérêt principal de cette étude a été de montrer le gain en robustesse
(langue, canal de transmission et intervalle entre les sessions) par l'introduction d'une méthode d'extraction de caractéristiques dont l'objectif est de suppléer la modélisation traditionnelle par une caractérisation statistique du résidu. Pour obtenir ce résultat, nous avons proposé une approche orientée vers
la prise en compte de la nature du signal de parole :
Le résidu comme support de l'information identité du locuteur. Nous
avons montré que le résidu de prédiction linéaire est porteur d'informations dépendantes du locuteur. Le modèle source-ltre étant sujet à
débat, nous avons proposé non pas de remettre en cause toute la caractérisation mais d'exploiter les informations non-modélisées : traitements
temporel et fréquentiel du résidu.
Modélisation statistique du résidu. Dans le cadre du traitement nonlinéaire de la parole, nous avons proposé de formaliser la problématique
de caractérisation par l'introduction d'un modèle prédictif basé sur des
statistiques d'ordre supérieur, et d'une méthode spectrale exploitant une
mesure de la platitude.
1.4. Informations non-verbales
25
Par souci de cohérence, nous avons choisis de ne pas présenter d'autres travaux
réalisés en collaboration avec l'ancienne équipe Signal du LISIF et notamment
ceux portés par Christophe Charbuillet (ancien doctorant de l'équipe, thèse dirigée et encadrée par Bruno Gas et Jean-Luc Zarader) pendant les campagnes
NIST et ESTER [Charbuillet et al., 2009].
1.4
Informations non-verbales
Le processus d'encodage de la parole, décrit précédemment (cf. gure 1.2),
est multi-dimensionel et implique des échelles temporelles diérentes. La caractérisation de la composante segmentale du signal de parole est généralement
considérée via des fenêtres de courtes durées (2-3 périodes du pitch : 10-30ms).
La composante supra-segmentale se caractérise par une durée plus longue
avec plusieurs périodes du pitch (100-300ms). Bien que reposant majoritairement sur une approche segmentale, certains systèmes de reconnaissance de
locuteurs exploitent des informations supra-segmentales [Mahadeva Prasanna
et al., 2006; Reynolds et al., 2003]. Les informations supra-segmentales sont
exploitées avec succès en reconnaissance d'émotions [Schuller et al., 2007a].
Le traitement automatique des émotions ore un cadre tout à fait particulier pour l'étude de la dimension temporelle des signaux de parole. Une des
particularités de ce domaine est que la majorité des systèmes exploite un seul
type de support temporel (les segments voisés) pour extraire des mesures de
natures diérentes (e.g. acoustique, prosodique) et regroupées dans un unique
vecteur de caractéristiques. L'unité standard est le tour de parole (speaker turn
level) et les traitements consistent à appliquer un ensemble de fonctionnelles
(statistiques) à des paramètres tels que la fréquence fondamentale, l'énergie
ou bien encore les formants du signal de parole puis à exploiter des classieurs
statiques (e.g. SVM). Cette approche fait l'hypothèse que l'état émotionnel ne
varie pas pendant le tour de parole d'un locuteur. Bien que cette approche ait
prouvée son ecacité, d'autres unités temporelles existent et visent à exploiter
l'aspect dynamique des émotions.
La recherche d'unité élémentaire support de l'aect est un des dés actuel
du traitement automatique des émotions [Schuller et al., 2011]. Batliner et al.
[2010] les nomment "ememe " par analogie aux phonèmes et morphèmes. La
dénition d'unités temporelles suivent deux approches formalisées dans [Chetouani et al., 2009d] et que nous reprenons par la suite. La première est basée
sur des connaissances concernant le signal de parole (Data-driven units) et
l'autre sur des techniques d'apprentissage (Machine learning based units).
26
Unités guidées par les données Cette approche cherche à exploiter des
connaissances diverses sur les signaux de parole pour la dénition d'unités
d'analyse. Les segments voisés font oces d'unités de référence. La nature
de ces segments (caractérisés par la présence de fréquence fondamentale) motive leur utilisation dans les chaînes de traitement automatique des émotions
[Picard, 1997; Shami and Verhelst, 2007].
Fig. 1.4:
nence
Segments voisés d'un signal de parole : rôle de la durée dans la proémi-
La stratégie de reconnaissance d'émotion, basée sur une segmentation en
zones voisées, consiste à combiner un ensemble de décisions locales [Shami and
Verhelst, 2007; Vlasenko et al., 2007; Schuller et al., 2007b]. Si l'on considère
un cadre probabiliste, le processus de décision pour la phrase Ux impose une
estimation de probabilité a posteriori d'appartenance à une classe d'émotion
Cm pour chaque unité Fxi , la décision nale s'appuie sur une fusion pondérée
de ces N probabilités (N segments voisés) :
P (Cm |Ux ) =
N
1 X
P (Cm |Fxi )
N x =1
(1.10)
i
Le principe de maximum a posteriori (MAP) est généralement utilisé pour la
prise de décision. Shami and Verhelst [2007] proposent d'intégrer les informations liées à la durée de ces unités (length(Fxi )) :
P (Cm |Ux ) =
N
X
P (Cm |Fxi ) × length(Fxi )
(1.11)
xi =1
L'approche SBA (Segment Based Approach) a pour eet d'accorder un poids
aux probabilités qui est proportionnel à la durée des segments voisés sur lesquelles elles sont extraites. L'intérêt majeur de cette approche est l'introduction d'une notion de proéminence des unités, qui est ici qualiée par la durée
27
(cf. gure 1.4). Les résultats expérimentaux montrent l'intérêt de cette approche pour la caractérisation d'énoncé de courte durée [Shami and Verhelst,
2007].
La thèse de doctorat de Fabien Ringeval porte sur la proposition d'ancrages acoustiques et prosodiques reétant notamment la proéminence. Nous
retrouverons également une étude portant sur la notion de proéminence dans
le cadre de la régulation de l'interaction homme-robot au Chapitre 3.
Unités dénies par apprentissage
Une approche simple mais en cohérence avec les procédures du TAP (reconnaissance de la parole ou de locuteur)
consiste à considérer des fenêtres d'analyse de durée arbitraire (<30ms) associée à un codage de type MFCC et une modélisation statistique (GMM,
HMM). Comme nous le verrons par la suite, cette approche peut se révéler
ecace, et notamment en combinaison avec une caractérisation prosodique,
pour la reconnaissance d'états émotionnels dans des bases de données réalistes.
A l'instar de l'alignement phonétique en reconnaissance de la parole par
exemple, une unité temporelle (chunk) à valence émotionnelle est identiée
par des modèles de Markov cachés préalablement entraînés sur des catégories
émotionnelles (cf. gure 1.5). Dans [Schuller et al., 2007b], une segmentation
par l'algorithme de Viterbi permet d'identier des segments homogènes et
reétant la dynamique émotionnelle. Ces unités atteignent de meilleures performances en comparaison à celles obtenues par une segmentation syllabique.
Une des raisons avancées par les auteurs est que la durée des segments, plus
longue dans le cas des chunk, impacte la caractérisation.
Fig. 1.5:
(Viterbi)
Identication d'unités d'analyses par alignement "d'états
émotionnels
"
28
Contributions
La segmentation en unités d'analyse est la première étape
des systèmes de reconnaissance d'émotions et, de ce fait, impacte directement
les performances. Nous sommes partis du constat que la dénition d'unités
temporelles porteuses d'informations émotionnelles impliquait plusieurs facteurs :
- La variabilité de la durée et de la nature de l'unité : Les segmentations
obtenues par alignement "d'états émotionnels " englobent des zones voisées, et souvent des zones connexes non-voisées pouvant aller au-delà de
la syllabe. Il en découle une variabilité des segments obtenus (nature et
durée). Malgré l'intérêt que l'on peut porter à une approche par alignement, elle est sujette à la dépendance de l'apprentissage de modèles de
Markov cachés sur des classes émotionnelles préalablement dénies.
- La proéminence de l'unité : Le caractère non-homogène des unités
montre que la valence émotionnelle d'un segment de parole a une traduction complexe qu'une simple segmentation temporelle ne peut complètement reéter (e.g. voisé vs non-voisé). Clavel et al. [2008] ont, par
exemple, souligné le potentiel des segments non-voisés pour la reconnaissance d'émotions réalistes.
L'originalité de notre approche repose sur la notion d'ancrage (e.g. voisés,
non-voisés, voyelles, consonnes...). L'idée étant d'exploiter la valence émotionnelle portée par chacun des ancrages (caractérisation et décision locales) puis
par fusion d'informations d'inférer une décision sur le tour de parole. Une
autre originalité de l'approche est l'introduction de la notion de dynamique
des ancrages caractérisée par le rythme de la parole (dans une version généralisée).
1.4.1 Caractérisation des dimensions temporelles et intégratives : Ancrages acoustiques
La thèse de Fabien Ringeval [Ringeval, 2011] a porté sur la dénition et
l'exploitation d'ancrages acoustiques de la parole pour la reconnaissance de
parole aective. Les ancrages naturels de la parole sont multiples : voisé, nonvoisé, voyelle, consonne, syllabe, etc. (cf. gure 1.6) et constituent des supports
d'informations. Les voyelles ou les syllabes sont des exemples d'ancrages associés à la phonation. Les zones voisées et le rythme sont des traits dénis par la
perception de la parole. Une des contributions de la thèse de Fabien Ringeval
a été la proposition d'une approche pseudo-phonétique pour la caractérisation
de la parole émotionnelle. Cette approche repose sur le constat que la contribution de chaque phonème au message aectif n'est pas identique [Leinonen
et al., 1997; Pereira and Watson, 1998; Lee et al., 2004; Schuller et al., 2007b].
Lee et al. [2004] montrent, dans une expérience consistant à apprendre des
Fig. 1.6:
29
Diversité des ancrages acoustiques et rythmiques de la parole
modèles HMMs d'émotions pour cinq catégories phonétiques (voyelle, semivoyelle, nasale, occlusive, fricative), que les voyelles sont des supports robustes
dans la transmission de l'aect. Les approches diérenciées exploitants plusieurs supports d'informations (e.g. segmental, supra-segmental) ont été appliquées avec succès à la reconnaissance d'émotions [Vlasenko et al., 2007; Kim
et al., 2007].
L'approche développée dans la thèse de Fabien Ringeval consiste (1) à détecter un ensemble d'ancrages, (2) à extraire des caractéristiques puis pendre
une décision locale et enn (3) à fusionner les diérentes contributions pour
une prise de décision sur le tour de parole (ou la phrase).
Détection automatique d'ancrages
La segmentation en zones voisées et non-voisées fournit un premier ensemble d'ancrages dont la fusion se révèle robuste pour la reconnaissance
d'émotions spontanées [Clavel et al., 2008; Mahdhaoui et al., 2008]. Dans
[Ringeval and Chetouani, 2008; Ringeval, 2011], nous avons choisis de nous
focaliser sur deux types d'ancrages : phonétique (voyelle) et rythmique (pcentre), avec comme objectif une détection robuste et indépendante de la
langue. Un système de reconnaissance de la parole pourrait être exploité pour
la détection de voyelles. Cette solution pose néanmoins plusieurs problèmes :
(1) l'émotion impacte la production des phonèmes et met souvent en défaut
les systèmes de reconnaissance traditionnels, (2) l'adaptation à des contextes
diérents (langue, conditions d'acquisition, rire, pleurs...) est une diculté
additionnelle. Nous avons donc opté pour des techniques issues du traitement
du signal. Ces techniques ne nécessitent pas de connaissances a priori et sont
30
jugées plus adaptées à la diversité des situations analysées dans nos travaux.
!"#$%&$'!!!"#$"%&'&()%*'+&)$'&(,+"*-.+%*/(#%'0*-"*1'2)0"*1'2*0"*345!"!#$%&'!#()*%!+,-./-,!01!1)0(.#0!
/'+&*6*0.(%/&'%&*!!2#3!.-!'*!!!4!"!2567!+#1'!/.-!.8.9$:.3;!:.!'*<1#:!.'-!-%#*-,!+#1'!:.!'.1'!#$%&'$!(!2=3!>!
Pseudo-phonèmes L'approche
adoptée consiste à extraire des segments
/(*+%*%)+7"'+*/'+&*"/&*-8&"9&8*6*0.(%/&'%&*)
(!:.!'*<1#:!.'-!-%#*-,!+.$0*'!/.-!*1'-#1-!+#1'!:.!'.1'!*+!'$!(!
(' non-parole
quasi-stationnaires
puis à les catégoriser en voyelle, consonne ou
:9;*<*/(%)%=*0"*&2'(&"$"%&*2"9)$$"%9"*-"1+(/*0.(%/&'%&*!,>!?*<0%.!.8-%#*-.!+.!@ABCDDE
F!
(cf. gure 1.7). Une segmentation du signal de parole en zones quasi-
Signal
de
parole
Segmentation
DFB
Détection
:%./("0"(1#0&/.$)
Fusion
Extraction de
coefficients MFSC
Calcul de la
fonction REC
Localisation des
noyaux vocaliques
Pseudo-voyelles,
Pseudo-consonnes
Fig. 2.9 Système de détection de pseudo-phonèmes dans un signal de parole.
2324
Fig. 1.7:
Système de détection de pseudo-phonèmes dans un signal de parole
!"#$%&'#'&()#! le nombre de segments issus de la segmentation DFB et #$ & '$ & ) & $ , la
% (
*+
stationnaires" par l'algorithme DFB (Divergence Forward Backward)
[Andre25
suite
de
ces
se*+)'(,-#$%./("0"(1#0&/.$)
peut
être
définie
par
un
seuil
sur
la
variance
des
Obrecht, 1988] permet d'obtenir des pseudo-phonèmes dont
les supports tem.
porels$/sont
porteurs
reconnaissance
dedula pasegments
[4]. Les
segmentsd'informations
dont la variance estexploitables
inférieure à -.enpeuvent
être vus comme
role [Andre-Obrecht and Jacob, 1997] ou de la langue [Rouas et al., 2005].
silence. La valeur de la constante 0 0.2(#(34"52)+)'(#67#89"'#:)#$"+"();#:%10)'(2)$,#)99)(,#:)#
Les segments de parole (détecteur d'activité vocale) sont ensuite catégorisés
<&;:#$"1,#=#:),#4>1'&+?'),#(;.',"(&";),#@)7*7-#&,/"$$.("&'#:%.+&;(",,)+)'(A#&2#=#:),#:1/.$.*),#
en voyelle ou non-voyelle. Ces derniers sont considérés, par défaut, comme
)'(;)#$),#9;&'("?;),#:1()/(1),#)(#/)$$),#",,2),#:2#,"*'.$#:%&;"*"')-#$.#0.;".'/)#),(#/.$/2$1)#,2;
étant des consonnes. La détection du noyau vocalique repose sur la fonction
Energy
Cumulating)
dont l'objectif est de mesurer l'adéquauneREC
portion(Reduced
centrée $/1 des
segments
$/ , cf. Fig. 2.10.
tion entre la distribution spectrale d'une trame du signal de parole et la strucLes voyelles sont identifiées par les proéminences de la dérive spectrale. Cette mesure est
ture formantique propre aux segments vocaliques :
obtenue par la fonction « reduced energy cumulating
» B REC [PEL98]19. La fonction REC a
24
EBF (k) X
REC(k) =
αi Ei (k) − Ē(k)
(1.12)
E
23
T (k) i=1
Certains segments voisés peuvent présenter des pics dans la dérive spectrale qui sont considérés, à tort, comme
étant
des knoyaux
vocaliques
; cela est notamment
le casE
pour
les semi-voyelles /l/ et les sonorantes /n/.
avec
indice
de trame
de parole,
i l'énergie contenue dans le ltre no. i,
24
Ē pseudo-phonèmes
l'énergie moyenne
à travers
ltres Mel, EBF l'énergie moyenne
Les
consonantiques
ont tendancetous
à être les
sur-,)*+)'(1,#4.;#$%.$*&;"(>+)#CDE7#F)/"#),(#:G#.2#
fait
que la modélisation
LPCltres
du signalde
de parole
est adaptée
aux voyelles,àmais
beaucoup
pour les concontenue
dans les
fréquence
inférieure
1kHz
et Emoins
totale.
T l'énergie
sonnes.
Cesledernières
effetltre
de fortesi.non-linéari(1,#/&':2",.'(#=#2'#,4)/(;)#:%1');*")#;)$.("0)+)'(#
αi est
poidsprésentent
aectéenau
plat dans
les
hautes
fréquences.
Un
spectre
parfaitement
plat est par
définition
non prédictible
modèles de
La détection de zone voyelle
est basée
sur
le seuillage
(1) par
deslesmaxima
AR.
la courbe REC et (2) du coecient d'autocorrélation du signal (zone quasi40 périodique) [Ringeval, 2011]. Il est ainsi possible d'optimiser les performances
de la détection en jouant sur l'étape de seuillage. Nous avons fait le choix de
ne présenter que les résultats d'un détecteur générique (même seuil pour l'ensemble des corpus). Un exemple de détection de pseudo-phonèmes est donné gure 1.8. Une des particularités de la méthode de détection est que les maxima
Signal
de
parole
Signal
de
parole
31
reètent des zones voyelles proéminentes (en caractéristiques spectrales selon
C HAPITRE 2. ANCRAGE S AC O U STIQU E S D E LA PAROLE
la fonction REC).
F ig. 2.11 !"#$%&%'(")*+,-).*(./#.)0%0ion phonétique manuelle vs. automatique (Références vs.
Détectés). Les données sont issues du corpus TIMIT ; le code de couleur est le suivant : bleu 1 silence,
rouge
1 consonne et vert
1 voyelle. manuelle vs automatique
Fig. 1.8: Comparaison d'une
segmentation
phonétique
d'une phrase extraite du corpus TIMIT.
67
"%& $$%
5
[5]
!"#$$% ,
1 ( 2") $$% 3 "' $$%4
Nous avons évalué les performances
de pseudo-phonèmes sur
"' $$% de) détection
)89
unavec,
ensemble de bases de données décrites
! brièvement dans le tableau 1.3 avec
au total! plus
de
700k
phonèmes
(400k
consonnes
et 300k voyelles). L'évaluation
:
indice de trame du signal de parole
"# $!% :
énergie
contenue
le filtre numéro
no.(Vowel
&
des performances
repose
surdans
la métrique
VER
Error Rate) qui regroupe
'
$!%
"
:
énergie
moyenne
à
travers
tous
les
filtres
Mel
deux types d'erreur à savoir les non-détections Nnondet et les insertions Nins :
"() $!% : énergie moyenne contenue dans les filtres de fréquence inférieure à 1kHz
"* $!% : énergie totale contenue dansN
lanondet
trame ! + Nins
VER
= 100
(1.13)
+# :
poids affecté
au filtre
numéro & (nonN
utilisés dans cette%étude, i.e., +# , -)
voy
-
avec Nvoy le nombre total de voyelles
(transcription).
*+ , ,-./$!"#%
-:
[6]
Les performances sont et,
reportées tableaux 1.4 et 1.5 avec un taux d'erreur
moyen inférieur à 30%. L'algorithme
DFB
*0 ,
;:<sur-segmente les consonnes ce qui
!
se traduit par des taux d'erreur de détection des consonnes (Consonant Error
corrélation1du signal. La valeur moyenne de ce coefficient, qui est fourni toutes les 10ms par
très importants. Cependant, cela ne remet pas en cause l'approche
Detection)
2,%2/"&'03#.*Snack, doit alors être supérieure à un seuil ./ fixé expérimentalement à 40% [6].
diérenciée car l'ensemble des segments est exploité
dans la décision nale. De
Les segments de parole issus du DFB sont considérés comme des « pseudo-voyelles »
plus, l'approche que nous proposons repose sur les proéminences du signal dont
2"&(4-,-)* "-* $2-('.-&(* pics sont détectés simultanément sur la fonction REC. La détection
le doit
support
fait intervenir le noyau vocalique, qui lui est bien mieux détecté.
toutefois être confortée par deux critères pour être validée : (i) le sommet principal des
Lesdoit
performances
enseuil
détection
bien du
évidemment
de la qualité
pics
être supérieur au
.0 et (ii) dépendent
la valeur moyenne
coefficient +,%-0"5"&&62%0'")*
des
enregistrements
(studio
ou
téléphone)
mais
également
de
l'aect
(tableau
du segment correspondant doit être supérieur au seuil ./ 7*8%*9'/7*:7;;*&.$&6(.)0.*2,6<"2-0'")*
1.4).
résultats
expérimentaux
montrent
de laLes
fonction
REC sur
une phrase du corpus
TIMIT : que
« Sheles
haddégradations
your dark suit in introduites
greasy wash
water
all year ». On constate que la mesure de dérive spectrale détermine des lobes générale1 Calcul
identique au VER (équation 1.13)
#.)0*5.)0&6(*(-&*2.(*<"=.22.(*.0*+")0*2%*3%-0.-&*.(0*2'6.*>*2,6).&/'. du signal.
42
32
Caractéristiques principales des corpus de parole étudiés
TIMIT
NTIMIT
Berlin
Bute-TMI
Aholab
Parole
Lue
Lue
Aective
Aective
Aective
Qualité
Correcte
Téléphone
Correcte
Correcte
Correcte
Classes
8 régions
8 régions
7 styles
8 styles
7 styles
d'info.
dialectales
dialectales
aectifs
aectifs
aectifs
Phonèmes
52
52
37
27
35
20 V / 32 C 20 V / 32 C 16 V / 21 C 8 V / 19 C 5 V / 30 C
Locuteur
630
630
10
37
1
Phrase
10
10
10
3
702
Durée
≈ 6h
≈ 6h
≈ 25min
≈ 38min
≈ 8h
Tab. 1.3:
Comparaison des résultats en détection de pseudo-phonèmes sur divers
corpus (en %)
Taux TIMIT NTIMIT Berlin Bute-TMI Aholab
VER
20.3
26.9
29.0
32.3
24.6
CER
106
69.9
75.0
70.2
79.6
Tab. 1.4:
par ces deux sources sont du même ordre. Nous avons exploré l'impact des
diérents styles de production aectifs. Ne sont reportés dans ce document
que les résultats obtenus sur le corpus Berlin (cf. tableau 1.5)2 . Les variations
du VER sont très importantes allant de 14.3% pour la "Joie " à 38.2% pour
la "Peur ". L'eet des émotions sur la production de la parole est indéniable
mais surtout il dière d'un état à un autre [Leinonen et al., 1997; Pereira and
Watson, 1998; Lee et al., 2004]. Notons également que pour le style "Neutre ",
le taux d'erreur se révèle important. Des résultats similaires sont obtenus pour
d'autres bases de données [Ringeval, 2011], et une des raisons avancées est que
dans un contexte de parole émotionnelle actée, la production même du style
neutre est à remettre en cause. Cette ambiguïté du style neutre est souvent
rencontrée dans d'autres domaines connexes (e.g. la biométrie).
Performances en détection de pseudo-phonèmes selon les styles de production du corpus Berlin (en %)
Taux Colère Peur Joie Tristesse Dégoût Ennui Neutre
VER 28.6 38.2 14.3
35.6
26.4
37.0
35.7
CER 74.4 70.9 100
66.3
72.6
75.5
75.0
Tab. 1.5:
Nous avons proposé d'étudier et d'exploiter la variabilité aective de la parole par une prise en compte de la composante temporelle des ancrages. Nous
2 De
plus amples expériences sont décrites dans [Ringeval, 2011]
33
avons pour cela étudié la durée des ancrages (cf. gure 1.9) et la composante
rythmique de la parole qui sera décrite section 1.4.2. La gure 1.9 présente un
espace de caractéristiques formé par les mesures de durée des voyelles et des
consonnes. Cet espace permet de discriminer certaines émotions. La séparabilité des catégories émotionnelles est plus importante pour la base de données
AHOLAB, résultat qui se justie pleinement par le nombre de locuteurs (cf.
table 1.3). Nous reviendrons sur l'exploitation d'espaces de caractéristiques
basées sur la dimension temporelle de la parole en proposant un lien avec les
modélisations dimensionnelles
de l'émotion.
!"#$%&'()*+')%&#,#-,#(.$/.($/.#0+(#-1.2%')%&
F ig. 1.10 Variations des mesures de durée des voyelles et des consonnes selon les émotions contenus
dans les corpus Berlin (figure de gauche) et Aholab (figure de droite). La position de la croix dans
!"#$%&'#()#$()*+,#$(#-(),.#+/0-#(!#$(1&!#*+$(/23#--#$4(.&-)0$(5*#(!&(6&*.#*+(#.(!&(!&+7#*+('2++#$%2nFig. 1.9: Evolution des mesures de durée des voyelles et des 112
selon les
)#-.(&*8(1&!#*+$()",'&+.-type ; figure extraite de [RIN08c] consonnes
.
émotions
)*(%0.'64()#(!",-#+70#(#.()#(!&(5*&!0.,(12'&!#4(&!2+$(5*#(!#(rythme a été exclusivement modélisé
par des mesures reposant sur le débit de parole ou sur la durée segmentale. La littérature
montre néanmoins que la nature complexe du rythme ne peut être capturée par de telles meLe puisque
centrebeaucoup
de perception
Une méthode a été récemment proposée pour
sures
trop réductrices.
extraire
automatiquement
l'enveloppe
d'un signal de
parole [Tilsen
Par conséquent,
nous proposons
dans cetterythmique
thèse )"#8%!20.#+(diverses
techniques
de traitement
du signal et
de reconnaissance
des formes
pour
définir unau
système
dede
reconnaissance
and Johnson,
2008].
Cette méthode
permet
d'accéder
centre
perception
)",/2.02-$(
qui se
place dans
une optique
bien conçueles
» plutôt
que rythmiques
« force brute »perçus.
[BATde la parole,
appelée
"p-centre",
qui «représente
instants
113
99]
.
Notre
stratégie
repose
sur
le
principe
de
«
diviser
pour
mieux
régner
»
:
nous
combiLa méthode d'extraction de l'enveloppe rythmique exploite un ensemble de
nons les informations fournies par des supports temporels et des paramètres complémentaires
ltres inspirés de la perception humaine (cf. gure 1.10).
(e.g., voyelle / consones, acoustique / prosodie) pour caractériser les émotions. Cette approche
Ledesignal
obtenu
permet de
des proéminences
rythmiques
(cf.
permet
quantifier
la contribution
descaractériser
différents paramètres
intervenant dans
la caractérisagure
1.11).
FabiendeRingeval
a proposé de dénir des ancrages ayant pourAinsi,
suption
des états
affectifs
la parole &*(/23#-()#$(.#'6-05*#$()#(9*$02-()"0-92+/&.02-$.
auport
lieu le
)"exploiter
segments
définis
façon arbitraire
(e.g.,
les 500ms)
ou de fap-centre.des
Trois
niveaux
de de
seuillage
adaptatif
detoutes
l'enveloppe
rythmique
çon
unique
(e.g., segments
voisés)
pourde
extraire
les caractéristiques
du signal
de parole,
nous
sont
introduits
: niveau
1 (1/3
l'amplitude
maximum),
niveau
2 (1/4)
et
préférons
exploiter
)099,+#-.$(
%20-.$(
)"&-'+&7#$
acoustiques
complémentaires
des
informaniveau 3 (1/6). De part cette dénition, la proéminence rythmique inuera
tions (e.g., voyelle, consonne, syllabe, « p-centre », etc.). De nombreuses études ont en effet
donc directement le nombre d'ancrages.
montré que la durée des phonèmes est liée aux variabilités affectives de la parole [LEE04]114,
Nous
avons
exploré
les ces
corrélats
phonétiques
des ancrages
dénis sur
la
115
116
[BUL05]
, et
[KIS10]
, et que
variabilités
peuvent également
être dépendantes
de la
112
117
base du
dulocuteur
"p-centre".
L'étude
a consisté
à étudier
langue
[RIN08c]
et [GOU10]
, cf. Fig.
1.10. les recouvrements entre les
Ensuite, et au-)#!:()*(9&0.(5*#(!&(/&;2+0.,()#$($3$.</#$()#(+#'2--&0$$&-'#(0$$*$()#(!",.&.de-!"&+.(#-(=>?(2+0#-.,(,/2.02-(07-2+#(!#('2-.#8.#()#(%+2)*'.02-(!2+$()#(!",.&%#()"#8.+&'.02-(
112
@A( B0-7#1&!( #.( CA( D6#.2*&-04( E>( 12F#!( G&$#)( &%%+2&'6( 92+( &'.#)( #/2.02-( +#'27-0.02-H4( )&-$( %+2'A Interspeech, Brisbane, Australia, Sep. 22-26 2008, pp. 2763I2766.
113
>A( J&.!0-#+4( KA( J*'L2F4( BA( M*G#+4( NA( O&+-L#4( PA( QR.6( #.( MA( Q0#/&--4( E?+2$2dic feature evaluation: brute
92+'#(2+(F#!!()#$07-#)SH4()&-$(%+2'A(14th ICPhS, San Francisco, (CA), USA, Aug. 1999, pp. 2315I2318.
114
C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso4( TA( U#-74( VA( W##( #.( VA( Q&+&3&-&-4( EP/2.02-(
+#'27-0.02-( G&$#)( 2-( %62-#/#( '!&$$#$H4( )&-$( %+2'A( Interspeech, Jeju Island, Korea, Oct. 4-8 2004, pp. 205I
211.
34
C HAPITRE 2. ANCRAG E S AC O U STIQ U ECaractérisation
S D E LA PAROLE de signaux de parole
Signal de parole
Filtre passe-bande
Filtre passe-bas
Sous-échantillonnage
700 ! 1300Hz
F c = 10Hz
F e = 80Hz
Pondération
par une
fenêtre de Tukey
Normalisation
Enveloppe rythmique
F ig. 2.14 "#$%&'()*+*,--)**#.*+&/'$#*/*+01,&+)2%&'#&)+$3)45)$100)+&6%78#9,)+:3,4+*#(4'$+:)+0'&1$);
les instants
rythmiques
à la fois
un Johnson,
locuteur et2008]
un auditeur.
Une
Cascade
de ltres
employés
par perçus
[Tilsenpar
and
pour l'ex8/%71:)+0)&8)%%'4%+:3)2%&'#&)+$3)45)$100)+&6%78#9,)+:3,4+*#(4'$+:)+0'&1$)+'+/%/+0&101*/)+0'&+
traction de l'enveloppe
rythmique d'un signal de parole.
EXPERIMENTATION
Tilsen et al. [TIL08b]29. Les proéminences issues de cette enveloppe4.permettent
de localiser S
les « p-centres ». Un ensemble de filtres numériques, supposés représenter les processus em0$16/*+ 0'&+ $3<188e pour percevoir le rythme de la parole, est pour cela employé, cf. Fig.
2.14.
U4+.#$%&)+:)+=,%%)&>1&%7+)*%+%1,%+:3'?1&:+'00$#9,/+*,&+$)+*#(4'$+:)+0'&1$)+:'4*+$'+?'4:)+:)
fréquence 700 ! 1300 Hz ; cette bande de fréquence a été identifiée comme étant celle du « pcentre » [CUM98]30. La seconde étape du filtrage consiste à extraire l'enveloppe du signal au
816)4+:3un passe-bas de type Butterworth (ordre 4 et fréquence de coupure égale à 10Hz). Le
signal est ensuite sous-échantillonné à la fréquence de 80Hz et une correction de 45ms est
appliquée pour le retard de phase introduit par les filtres, i.e., la somme des retards de phase
des filtres dans leur bande-passante;+ @3)45)$100)+ rythmique est enfin pondérée par une fenêtre de Tukey (r = 0.1) et normalisée par sa moyenne. @3)2)80$)+de la Fig. 2.15 montre que
les proéminences rythmiques du signal de parole ne correspondent pas forcément avec celles
:)+$3/4)&(#)+'-1,*%#9,).
@3)45)$100)+&6%78#9,)+0)&8)%+:)+$1-'$#*)&+$)*+-)4%&)*+:)+perception de la parole. Les ancrages « p-centre » peuvent par exemple être obtenus en définissant des seuils (e.g., 1/3, 1/4,
ABCD+*,&+$3'80$#%,:)+:)+$3)45)$100)+&6%78#9,)+:,+*#(4'$+:)+0'&1$) [RIN09]31. Ces seuils permettent de représenter, de façon artificielle, différents niveaux de perception de la proémi4)4-)+&6%78#9,)+E+4#5)',+AF+*),#$+G+ABH+:)+$3'80$#%,:)F+4#5)',+IF+*),#$+G+ABJ+:)+$3'80$#%,:)F+)%+
F ig. 2.15 !"#$%&#'()#*+#&,%%#'-./0$123#'#"/-41/#'53-'3*'516*4&'(#'%4-,&#7
4#5)',+HF+*),#$+G+ABC+:)+$3'80$#%,:)F+-.;+"#(;+I;A6. Les « p-centres » détectés au niveau 1 sont
ainsi Fig.
sit,/*+',+*188)%+:)+$3/-7)$$)+:)+0&1/8#4)4-)+&6%78#9,);+@)*+:),2+',%&)*+4#5)',2+#4%è1.11: Exemple d'enveloppe rythmique extraite d'un signal de parole
(&)4%+9,'4%+K+),2+0$,*+:3#4.1&8'%#14*+*,&+$'+*%&,-%,&)+:)+$'+0&1/8#4)4-)F+);(;F+$'+:#**68/%&#)+
0&/*)4%)+:'4*+$3'5'4%-dernier segment « p-centre » :)+$3)2)80$)+:)+$'+"#(;+I;AC+43'00'&'L%+9,)+
sur le niveau 3.
Ce point
désigne
Fig.
1.10:
ancrages phonétiques issus de la transcription (référence) et ceux issus des
détections
automatiques précédemment décrites. Le tableau 1.6 regroupe les
29
S. Tilsen et K. Johnson, MLow-frequency Fourier analysis of speech rhythmN, dans J. of Acoust. Soc. of A mer. ,
Express Letters
124, no.(voir
2, pp. 34!39,
Aug. 2008.
résultats
de, vol.
l'étude
[Ringeval,
2011] pour d'autres résultats). Les ré30
F. O,88#4*+)%+P;+Q1&%F+MP76%78#-+-14*%&'#4%*+14+*%&)**+%#8#4(+#4+R4($#*7NF+:'4*+Journal of Phonetics, vol. 26,
sultats
montrent que les voyelles et les segments voisés sont les principaux
pp. 145!171, 1998.
31
supports
rythmiquestransform
et ceux
quelque-7'&'-%)&#T'%#14+
soit le seuil
de percepF. Ringeva$+des
)%+ S;+ancrages
O7)%1,'4#F+ M<#$?)&%-Huang
for non-$#4)'&+
1.+ *0))-7+
&76%78NF+
dans proc. NOLI S P , Vic, Spain, Jun. 25-27 2009.
tion. Les segments consonantiques et non voisés ne contribuent que très peu
44
aux
ancrages rythmiques. Ce résultat expérimental montre que les ancrages
dénis à partir du seuillage de l'enveloppe rythmique sont potentiellement
porteurs
d'informations
aectives
à »l'image
des ancrages
et1,voyelles.
F ig. 2.16
Niveaux
de perception des
« p-centres
selon le degré
de seuillagevoisés
; niveau
seuil = 1/3 de
&)4$%&1/3(# ; comme
niveau 2,le
seuil
8'9:;'(#'&)4$%&1/3(#
<'#/'*1+#43'=>'5#31&'8'9:?'(#'&)4$%&1/3(#7
Néanmoins
montre
le recouvrement
avec les voyelles détectées, les
ancrages "p-centre" permettent de caractériser des informations diérentes
4.
E xpérimentations
@#//#'5#A/1,*'%-B5#*/#'&#5'-B53&/4/5'()3*#'B/3(#'+154*/'C'A4-4A/B-15#-'&#'A,$%,-/#$#*t des
systèmes précédents sur des données incluant différents types de parole (e.g., lue et affective).
Les scores en détection de pseudo-phonèmes ainsi que les structures phonétiques (i.e., taux de
recouvrement des voyelles et des consonnes) des ancrages rythmiques « p-centres » ont pu
être calculés grâce aux transcriptions contenues dans les corpus. Ces derniers sont présentés
dans la sous-section 4.1 et les résultats sont décrits dans les deux sous-sections suivantes.
35
qui seront exploitées par la suite dans la conception d'un système de reconnaissance multi-ancrage. Ces ancrages ont pour objectifs d'être générique et
applicable à des productions verbales non-linguistiques (e.g. rires, pleurs) souvent présentes dans les corpus naturels et spontanés comme ceux étudiés dans
ce manuscrit (interaction parent-bébé, enfants atteints de troubles de la communication, robotique).
Taux de recouvrement des "p-centres" en % avec les autres types d'ancrage acoustique de la parole (moyenne et écart-type)
Corpus Seuil VREF CREF VDET CDET SegV OI SegN V OI
TIMIT 1/3 74 (15) 14 (10) 64 (15) 28 (14) 75 (23)
2 (3)
1/4 74 (11) 17 (9) 62 (12) 32 (13) 79 (19)
2 (3)
1/6 72 (9) 21 (8) 59 (11) 38 (11) 82 (16)
4 (5)
Berlin 1/3 68 (17) 19 (14) 64 (15) 28 (14) 70 (22)
4 (7)
1/4 65 (15) 24 (14) 62 (13) 33 (12) 74 (18)
6 (7)
1/6 61 (13) 30 (12) 58 (10) 38 (10) 75 (16)
9 (8)
Tab. 1.6:
Reconnaissance des émotions : système multi-ancrage
Nous avons cherché à valider le concept d'ancrage en l'exploitant dans une
tâche de reconnaissance d'émotions. La détection d'ancrages est associée à une
étape d'extraction de paramètres acoustiques (MFCC). Et suivant l'approche
dénie par [Vlasenko et al., 2007; Shami and Verhelst, 2007], des probabilités
a posteriori sont estimées pour chaque ancrage. La décision nale exploite la
fusion de ces probabilités.
Lors de nos expériences avec le corpus Berlin, les ancrages apparus comme
les plus pertinents pour extraire des caractéristiques acoustiques corrélées à
l'aect ont été les segments voisés (score de reconnaissance 66%), les voyelles
issues de la transcription (65%), les pseudo-voyelles (62%) et les "p-centres"
de niveau 3 (62%). Selon les protocoles d'évaluation, les normalisations et
les bases de données (nombre de locuteurs, données actées ou non), les diérences entre ces scores peuvent se révéler signicatives ou pas. La fusion de
ces ancrages permet d'investiguer les contributions individuelles des ancrages
en exploitant un cadre probabiliste :
P (Cm |Ux ) = αP (Cm |ACi ) + (1 − α)P (Cm |ACj )
(1.14)
Avec P (Cm |ACi ) probabilité a posteriori de reconnaître la classe Cm à partir
de l'ancrage ACi . Il s'agit d'une généralisation de l'approche SBA dénie par
[Shami and Verhelst, 2007] (cf. équation 1.11).
36
Les stratégies de normalisations (souvent basées sur le z-score) des paramètres acoustiques impactent directement les performances des systèmes de
reconnaissance. Nous avons étudié plusieurs variantes : (1) Raw : aucune normalisation des données, (2) Z-tout : normalisation sans information a priori,
(3) Z-genre : normalisation selon le genre du locuteur, (4) Z-locuteur : normalisation selon le locuteur et (5) Z-phrase : normalisation selon la phrase.
Au-delà de l'amélioration des scores, ces normalisations avec introduction d'informations a priori permettent d'analyser l'impact de l'identité, du genre du
locuteur ou bien encore du contenu phonétique de la phrase.
L'évaluation des performances a été menée dans un cadre expérimental rigoureux selon deux conditions : (1) indépendance des classes (cross-validation
stratiée), (2) indépendance au locuteur et ceci avec plusieurs types de classieur (k-ppv, GMM et SVM). Sont reportés dans ce document uniquement
les résultats des diérentes normalisations en cross-validation stratiée (tableau 1.7). L'identité et le genre du locuteur sont des informations permettant d'améliorer les performances des systèmes de reconnaissance reétant
ainsi l'individualité de l'aect.
Scores (en %) de reconnaissance des émotions sur le corpus Berlin : eet
des normalisations des informations a priori. L'importance relative des segments est
indiquée entre parenthèses (αi cf équation 1.14 )
Fusion des ancrages Raw Z-tout Z-genre Z-locuteur Z-phrase
voisé / non-voisé
63 9/1 63 9/1 64 9/1
68 9/1
62 10/0
voyelle / consonne 61 5/5 62 5/5 62 7/3
68 7/3
6010/0
pseudo-voyelle /
62 1/8 61 8/2
626/4
67 10/0
60 10/0
pseudo-consonne
phonèmes voyelles / 59 2/8 59 0/10 59 7/3
65 4/6
60 6/4
consonnes
Tab. 1.7:
La table 1.8 regroupe les détails des scores de reconnaissance par émotion :
la "Colère " et la "Tristesse " sont bien mieux reconnues que la "Joie " ou
"l'Ennui ". Une analyse détaillée des scores fait apparaître, selon les points
d'ancrage, des diérences importantes dans les performances obtenues.
Les expériences menées montrent que les ancrages proposés dans la thèse
de Fabien Ringeval sont des supports du message aectif avec bien évidemment
des contributions diérentes (cf. tableau 1.8). Ces travaux réalisés avec un corpus d'émotions actées sont étendus à l'analyse de données naturelles et spontanées chez les enfants atteints de troubles de la communication (cf. section
1.5). L'incorporation d'informations liées à des ancrages pseudo-phonétiques
et rythmiques est une des voies suivie pour la caractérisation robuste du message aectif tout en produisant des unités interprétables notamment par des
37
Scores (en %) de reconnaissance des émotions sur le corpus Berlin
Fusion des ancrages
Colère Peur Ennui Dégoût Joie Neutre Tristesse
voisé / non-voisé
79
74
59
74
61
48
81
voyelle / consonne
79
74
65
46
59
49
87
pseudo-voyelle /
79
68
52
63
62
52
87
pseudo-consonne
phonèmes voyelles /
87
53
51
78
59
43
92
consonnes
"p-centres" de niveau 3
81
56
35
67
54
56
94
Tab. 1.8:
orthophonistes et des psycho-linguistes.
1.4.2 Dynamique du signal de parole : Rythme
La prosodie joue un rôle fondamental dans le traitement de la parole affective (cf. gure 1.2). La gure 1.12 rappelle les composantes perceptuelles et
physiques de la prosodie. Le pitch, l'intensité perçue et la qualité vocale sont
les composantes les plus étudiées via leurs corrélats physiques : la fréquence
fondamentale, la pression acoustique et les formants [Batliner et al., 2011].
L'extraction robuste de ces signaux constitue un dé en soi.
Les systèmes de l'état de l'art en reconnaissance des émotions exploitent généralement une multitude de descripteurs acoustiques et prosodiques (LLDs : Low Level Descriptors) auxquels sont appliqués un ensemble
de fonctionnelles ou statistiques (e.g. max, kurtosis, ecart inter quartiles, position relative du minimum....) [Schuller et al., 2009, 2007a]. Il en résulte un
super-vecteur composé de centaines voire de milliers de paramètres dont la
dimension est généralement réduite par une phase de sélection de caractéristiques.
Nos travaux sur la caractérisation de la parole aective se sont focalisés
sur la dimension temporelle des événements acoustiques et/ou perceptuels,
généralement sous-exploitée dans les systèmes de reconnaissance. L'approche
SBA intègre de manière explicite la durée des ancrages dans la prise de décision
SBA (cf. équation 1.11). Cependant, la durée est un phénomène complexe et
subjectif comme le souligne les travaux de recherche en psycho-acoustique
[Zwicker and H., 1990]. Nous avons proposé d'aller au-delà de l'intégration
d'informations temporelles dans la fusion en menant des investigations sur la
dynamique des ancrages (cf. gure 1.12). L'étude de cette dynamique repose
sur la caractérisation du rythme.
Le rythme possède la particularité d'être déni par l'alternance ou la répétition d'événements espacés dans le temps. La nature de ces événements
C HAPITRE 4. RE C ONNAISSANC E PRO S O DIQ U E D E LA PAROLE A F F E CTIVE ACTE E
38
Principe de caractérisation dynamique des composantes prosodiques
incluant le rythme de la parole (gure extraite de [Ringeval, 2011])
Fig. 1.12:
est souvent diverse. Le chapitre 3 sera l'occasion de revenir sur l'étude du
rythme appliqué à des signaux biologiques (rythme cardiaque, respiration ou
bien encore la marche) dans un contexte de robotique interactive. La composante rythmique du signal de parole est indéniable [Cummins, 2002, 2008;
Tilsen and Johnson, 2008]. Pellegrino [2011] indique que le rythme est associé
à l'organisation temporelle de la parole, et résulte de l'interaction de :
La nature des constituants considérés (ici les ancrages),
L'alternance entre des constituants plus ou moins proéminents,
Le modèle de régularité pour le regroupement des constituants dans des
39
unités plus longues.
En partant de la notion d'ancrages, nous avons proposé plusieurs méthodes
modélisant l'interaction entre nature, alternance et durée des ancrages.
Méthodes conventionnelles
Les méthodes conventionnelles exploitent la distribution des durées des
segments vocaliques et intervocaliques. Ramus et al. [1999] ont proposé une
mesure rythmique basée sur le pourcentage d'intervalles vocaliques (%V) et
l'écart-type des intervalles consonantiques (∆C). La variabilité temporelle de
paires d'intervalles phonétiques successifs est le fondement des mesures globales proposées par [Grabe and Low, 2002] : rPVI et nPVI.
Plusieurs autres méthodes conventionnelles sont passées en revue dans la
thèse de Fabien Ringeval. Leurs caractéristiques sont regroupées gure 1.9. Les
métriques proposées dans la littérature n'intègrent pas la notion d'interaction
entre la nature et la durée des ancrages car trop globales (statistiques sur
l'ensemble du tour de parole). Un des paradoxes des mesures globales est
qu'inverser l'ordre temporel des événements ne change pas les résultats.
Résumé des caractéristiques des métriques conventionnelles du rythme
de la parole (extrait de [Ringeval, 2011])
C HAPITRE
Tab.
1.9: 4. RE C ONNAI SS ANC E PRO S O DIQ U E D E LA PAROLE A F F E CTIVE ACTE E
T able 4.1 Résumé des caractéristiques des métriques conventionnelles du rythme de la parole.
M étrique
Paramèt re
C alcul
+,-./0
pourcent V
et delta C
Varco
coefficient
de variation
"
!
périodicité
moyenne
RR
rhythm
ratio
proportion et
écart-type des
intervalles vocaliques et consonantiques
rapport de
()*+,'--type sur la
moyenne des
intervalles
écart-type de la
distribution statistique circulaire
des intervalles
rapport de durée
entre deux intervalles consécutifs
rP V I
raw pairwise variability index
nP V I
normalized
pairwise
variability
index
différence absolue entre deux
intervalle consécutifs
de même que
pour r PVI mais
avec une normalisation au débit
Domaine
!"#$$%&'#(&)*
phénomènes de
réduction
mesure globale
phénomènes de
réduction
mesure globale
phénomènes de
compensation
mesure globale
phénomènes de
variations dans
les intervalles
mesure locale
de même que
RR
mesure locale
de même que
RR
mesure locale
A vantage(s)
Inconvénient(s)
calcul très
simple à mettre
!"#$%&'!, nécessite au min. 2
unités pour % V
prend en compte
les moments
statistiques
.)/'.'!#0#!-#1
fait ressortir des
irrégularités dans
les intervalles de
durée
étudie les enchainements à
court terme des
intervalles
de même que
RR
dépend du débit, ne prend pas
en compte les
phénomènes
locaux
mêmes inconvénients que
% V et /0
de même que
R R, prend en
compte le débit
nécessite au
min. 3 unités
de même que
% V et /0,
nécessite au
min. 3 unités
dépend du débit, nécessite au
min. 3 unités
de même que
RR
processus cognitifs de la parole pour le Japonais ; notons que cette horloge est sujette à la controverse, cf. sous-section 1.2. Une de leurs récentes études [BRA06]39, conduite sur les phénomènes de compensation des mores en relation avec les intervalles de durée séparant les
attaques des syllabes voisées du Japonais, les ont amené à considérer ces unités comme des
cibles importantes pour une horloge servant de référence, ou un mécanisme de planification
de la parole. Le phénomène de compensation temporelle a été évalué au moyen de mesures
2-,-32-34%!2# +3'+%(,3'!25# 6"# 272-89!# :!'9!--,"-# .)3.!"-3;3!'#automatiquement les attaques des
27((,<!2#&/32*!2#2%'#%"#23=",(#.!#:,'/(!#,#-/%-#.),</'.#*-*#.*&!(/::* (filtrage du signal ), cf.
Fig. 4.5. Une fois ces segments identifiés, une onde sinusoïdale a ensuite été générée avec une
période fixée à la valeur moyenne des intervalles séparant les segments. Leur position respec-
40
Contributions à la caractérisation du rythme
Le rythme de la parole est le reet de la structure temporelle des ancrages. La thèse de Fabien Ringeval a porté sur le développement de modèles
du rythme. Quatre modèles ont été proposés : (1) caractéristiques spectrales
(entropie, fréquence moyenne et barycentre) de la transformée de Fourier de
l'enveloppe rythmique du signal de parole "p-centre" [Tilsen and Johnson,
2008], (2) estimation de l'amplitude et de la fréquence instantanée des intervalles entre les ancrages à l'aide de la transformée d'Hilbert-Huang (THH),
(3) calcul dérivé du PVI pour quantier le changement dans le coecient de
variation (Varco) et (4) caractérisation de la dynamique et de la nature des
ancrages par la distance de Hotelling.
Le tableau 1.10 décrit les modèles étudiés qualiés de non-conventionnels
car allant au-delà de l'information globale extraite par les métriques de [Ramus
et al., 1999] et [Grabe and Low, 2002].
Résumé 2.des
des
métriques
MOcaractéristiques
D ELI S ATION S PRO
S O DIQ
U E S D E non-conventionnelles
LA PAROLE A F F E CTIVEdu
rythme de la parole (extrait de [Ringeval, 2011])
Tab. 1.10:
T able 4.2 Résumé des caractéristiques des métriques non-conventionnelles du rythme de la parole.
M étrique
C alcul
M esure de
sonorité
divergence de KullbackLeibler sur des coefficients spectraux et entre
des trames consécutives
Domaine
!"#$$%&'#(&)*
variations à court
terme dans le
spectre du signal
de parole
A nalyse
basses f réquences de
F ourier
entropie, barycentre et
fréquence moyenne
spectrale calculés sur la
TF du « p-centre »
variations à long
terme dans
!"#$%!&'()*+)(+
signal « p-centre »
F réquence
et amplitude
instantanées
THH sur des signaux
créés par des intervalles
de durée séparant une
unité donnée
enveloppe et fréquence instant#,-*.+)"(,*+(,&'-+
donnée
V ariabilité
prosodique
calcul du r PVI sur le
coefficient de variation
)"un LLD prosodique et
normalisation au débit
Distance
prosodique
distance de Hotteling à
travers les LLDs et entre
)*.+%#&0*.+)"(,&'-.+/2nsécutives, normalisation
au débit et inclut les
corrélations des LLDs
variations dans la
)&.%*0.&2,+)"(,+
LLD à travers des
%#&0*.+)"(,&'-.+
consécutives
variations dans la
distribution des
LLDs à travers des
%#&0*.+)"(,&'-.+
consécutives
A vantages
Inconvénients
ne nécessite pas de
segmentation en
voyelle / consonne
ignore le contexte de production, dépend
du cadre segmental
ne prend pas en
compte les
phénomènes
locaux
de même que précédemment, prend
en compte les aspects en perception
du rythme
fournit beaucoup
de valeurs pour
)-/0&0*+!"*,1*!2%%*+
et la fréquence du
rythme
intègre les infor$#'&2,.+)"(,+334
prosodique, prend
en compte le débit
intègre les informations de tous les
LLDs prosodiques,
prend en compte le
débit et les intercorrélations
nécessite au
min. 3 unités,
estimation couteuse en temps
de calcul
nécessite au
min. 3 unités et
la présence
)"(,+334+%0osodique
de même que
précédemment
et peut requérir
la présence de
plusieurs LLDs
selon la config.
représenter les processus de %*0/*%'&2,+ )(+ 05'6$*7+ 82$$*+ !#+ 920$*+ )"2,)*+ )*+ /*+ .&:,#!+ *.'+
plutôt stationnaire, nous pouvons exploiter la transformée de Fourier pour estimer les valeurs
)"*,'02%&*;+ )*+ <#05/*,'0* *'+ !#+ 90-=(*,/*+ $25*,,*+ )*+ !"*,1*!2%%*+ 05'6$&=(*+ )(+ .&:,#!, cf.
Fig. 4.8 [26]. Ces paramètres permettent de décrire de façon globale, la structure
Cetterythmique
méthode
contenue dans un signal de parole à travers la courbe décrivant les valeurs du « p-centre », cf.
repose sur l'analyse fréquentielle de l'enveloppe rythmique extraite par la méchapitre 2, sous-section 2.2.2.
Analyse basse fréquence de l'enveloppe rythmique
thode proposée par Tilsen and Johnson [2008] (cf. section 1.4.1). L'analyse
F réquence et amplitude instantanées
Nous avons proposé dans [RIN09]55 )"('&!&.*0+!#+'0#,.920$-*+)">&!<*0'-Huang (THH) pour
extraire les composantes rythmiques de la parole [HUA98]56. Des signaux SUI (Speech Unit
Intervals) ont pour cela été générés avec les intervalles de durée séparant les segments consé/('&9.+*'+&..(.+)"un même ancrage acoustique, cf. Fig. 4.9. Puisque le nombre de segment disponible par phrase est relativement 9#&<!*+?.2(1*,'+*,+)*..2(.+)"(,*+)&@#&,*A;+,2(.+#12,.+)B+
sur-échantillonner les signaux SUI avant de calculer la THH. Nous avons notamment utilisé
des splines cubiques avec une fréquence )"-/6#,'&!!2,,#:*+ F e égale à 32Hz. Nous avons
55
F. Ringeval et M. Chetouani;+ C>&!<*0'-Huang transform for non-!&,*#0+ /6#0#/'*0&@#'&2,+ 29+ .%**/6+ 065'6$D;+
41
consiste à extraire des informations globales du spectre : entropie, barycentre
et
fréquence
moyenne (cf.
gurePRO
1.13).
L'objetDEétant
d'analyser
variabilité
CHAPITRE
4. RECONNAI
SSANCE
S ODIQUE
LA PAROLE
AF F la
ECTIVE
ACTEE
de l'enveloppe rythmique du signal de parole.
!
!"#$%&'( ! " # )*$ +%,% &)*'!
!
-.$/0("#$( !
!
!
*( (&)* 1')*
!
1 ( )*
*$+23("0(,4%/(""( !
*( (&)* 1'&)* % '
!
( )* %
1
[26]
!
!
!
!
!
"#$%!&%' Figure du haut !"#$%&'(")*+,-$./0"1'##0"2)3./0&40#"$##/0"56/&"#$%&'("50"7')8(0 ; Figure du
bas : spectre fréquentiel du signal rythmique.
Analyse basses fréquences du rythme : enveloppe rythmique et sa transformée de Fourier
choisi cette valeur 78/)"./60((0"#8$+"0&"'448)5"'904 le plus faible intervalle de durée qui puisse
être présent dans nos données, ce qui est le cas des ancrages phonétiques et 58&+"(6'-7($+/50"
fréquentielle varie
de 1Hz à 16Hzinstantanées
[DRU94]57. La valeur
de méthode
Fe que nous
choisidu
corresFréquence
et amplitude
Cette
estavons
inspirée
pond donc
au doublegénéralement
du plus grand appelé
écartement
fréquentiel
être àobservé
entre
rythme
cardiaque
signal
R-R enpouvant
référence
la durée
dedeux
l'intervalle
entre les i.e.,
ondes
R du
ECG. alors
Nousde
avons
proposé
de générer
segments consécutifs,
16Hz.
Lasignal
THH permet
fournir
des données
fiablesun
sur les
signal,
SUI (Speech
Unit
Interval),
avec
les intervalles
durée
signaux appelé
SUI générés
à partir des
78$&+#"
56'&4)'%0s
acoustiques
étudiésdedans
cettesépathèse, cf.
rant
les
segments
consécutifs
et
issus
d'un
même
ancrage
acoustique
(voyelle,
4,'7$+)0":;"<8+8&#"./6/ne plus grande valeur de Fe risquerai+"56'778)+0)"50s artefacts en raiconsonne,
"p-centre"...). Pour
tour dedes
parole
donné,
nombre
de ./0
segments
#8&" 56/&" #/)-échantillonnage
trop un
important
#$%&'/="
>?@A"le56'/+'&+"
7(/#"
des erreurs
étant généralement faible (<10), nous avons procédé à un ré-échantillonage
apparaissent déjà avec une valeur de Fe fixée à 32Hz, cf. Fréquence instantanée, Fig. 4.9.
des signaux SUI (Fe=32Hz) en prenant en compte les données de la littérature
La première étape de la THH consiste en une décomposition par mode empirique (EMD).
sur le débit syllabique notamment (variant de 1Hz à 16Hz)3 . Les détails techLa méthode
est une
approche
conduite
par Ringeval
les données,
dans laquelle2009].
une série de
2011;
andet Chetouani,
niques
sont EMD
présentés
dans
[Ringeval,
valeurs
-&.' est décomposée
en un ensemble
fini 568#4$(('+$8&#"$&5$9$5/0((0#"4')'4+3)$#+$./0#"
Le caractère
non-stationnaire
des signaux
SUI, dû à la variabilité phoappelées
fonctions
à
mode
intrinsèque
(IMFs)
[HUA98].
sont extraites
à travers une
nétique et rythmique, rend la tâche d'estimationLes
de IMFs
paramètres
instantanés
représentation
signal
-&.'des
, quiméthodes
est considéré
commedeétant
issu de ('"#8--0"56/&0"
dicile.
Nouslocale
avonsduopté
pour
avancées
traitement
du signal
&.' B paren
exploitant
la transformée
d'Hilbert-Huang.
étapelocale
de la0THH
composante
oscillante
/&.' B partie
hautes fréquences La
B etpremière
56une tendance
consiste
en
une
décomposition
en
mode
empirique.
Il
s'agit
de
décomposer
tie basses fréquences. Les IMFs sont itérativement obtenues par un processus de tamisage
le
signal40"en
ensemble
nisoient
d'oscillations
appelées
C/#./6D"
./0"un50/="
conditions
satisfaites :individuelles
(i) une moyenne
nulle etmodes
(ii) un innombre
trinsèques
(IMFs).
Les
IMFs
sont
des
signaux
à
bandes
de
fréquences
limitées
$50&+$./0"560=+)0-'"0+"50"7'##'%0#"7')"E3)8A"8/"/&0"5$223)0&40"50"/&;"F0"#$%&'("-&.' est alors
permettant ainsi lors de la seconde étape de la THH l'application de la transreprésenté par la somme de N IMFs /1 et des composantes résiduelles finales 21 [27].
3 Des expériences de re-synthèse de signaux permettent d'évaluer les erreurs d'échanG/$#./6/&" &8-1)0" +)87" $-78)+'&+" 56$+3)'+$8&#" 70/+" 48&5/$)0" D" /&0" #/)-décomposition du
tillonnage
(cf. [Ringeval,
signal, Flandrin
et al. ont2011])
proposé un nouveau critère pour stopper le processus de tamisage
$&4(/+" 5'&#" (6HIJ" KL@FMNO58;" P0" 50)&$0)" 0#+" '(8)#" $+3)3" +'&+" ./6/&0" 28&4+$8&" 5639'(/'+$8&" !
reste en-50##8/#" 56/&" #0/$(" "1 pour une fraction (1-#Q" 50" ('" 5/)30" +8+'(0" 0+" 0&" 50##8/#" 56/&
deuxième seuil "2 78/)"('"2)'4+$8&"+0-78)0((0")0#+'&+0;"F'"28&4+$8&"5639'(/'+$8&"! est définie
par le rapport entre la moyenne 0&.' et la composante oscillante /&.'. Cette approche permet
56'##/)0)"50"2'R8&"%(81'(0, de petites fluctuations dans la moyenne tout en tenant compte des
Fig. 1.13:
42
formée d'Hilbert (TH). L'amplitude et la fréquence instantanées sont ainsi
estimées pour former une analyse temps-fréquence adaptative du signal SUI.
Variabilité prosodique
Nous avons proposé d'étendre la métrique PVI
(Pairwise Variability Index) [Grabe and Low, 2002] dans laquelle nous remplaçons la mesure d'intervalle de durée par celle de la dispersion relative de
descripteurs prosodiques. En exploitant comme descripteur le coecient de variation cv = σµ (de la fréquence fondamentale ou de l'énergie), nous obtenons
la métrique suivante :
N −1
1 X
dk dk+1 Ik
|cv − cvk+1 |
P − PV I =
N − 1 k=1 dk + dk+1 + Ik k
(1.15)
Avec dk durée de l'ancrage k et Ik intervalle entre les ancrages k et k + 1.
Une des particularités de cette métrique est d'être nulle si les dispersions
(coecient de variation) mesurées sur deux segments consécutifs sont identiques. Dans le cas contraire, la diérence est pondérée par la durée des ancrages et l'intervalle entre ces ancrages. La métrique caractérise une certaine
forme de proéminence qui peut, en fonction des informations traitées, traduire
des changements de fréquence fondamentale ou d'énergie.
Distance prosodique
La motivation principale de cette métrique est la
prise en compte explicite de l'interaction entre la fréquence fondamentale,
l'énergie et la durée des ancrages. La métrique repose sur la distance de Hotelling (T 2 ) (HD). L'idée étant de mesurer la distance entre deux ancrages,
chacun étant modélisé par une seule Gaussienne, par un calcul similaire à
la distance de Mahalanobis. La distance de Hotelling intègre une normalisation par la durée des deux ancrages analysés. L'approche consistant à ne pas
comparer des caractéristiques mais des modèles s'est révélée pertinente dans
d'autres tâches en TAP comme par exemple la segmentation de locuteurs
où il s'agit de déterminer des zones homogènes de parole aectées à un seul
locuteur.
Il est ainsi possible de déterminer la ressemblance entre deux segments
consécutifs (caractéristiques f0 et/ou énergie) :
HDij =
di dj (µi − µj )T Σ−1
i∪j (µi − µj )
di + dj
(1.16)
Avec i ∪ j l'union des données issues de deux ancrages consécutifs i et j , di et
dj la durée respective de ces ancrages, et Σ−1
i∪j la matrice de covariance interne
estimée sur les deux ancrages.
43
La métrique proposée dans [Ringeval, 2011] peut intègrer des informations
liées à l'intervalle Ik entre les ancrages :
P HDij = k (µi − µj )T Σ−1
(1.17)
i∪j (µi − µj )
dd I
i j k
avec k = di +d
j +Ik
A l'aide de la distance de Hotelling, il est possible de déterminer la distance entre deux ancrages en prenant en compte l'interaction entre la fréquence fondamentale et l'énergie. Cette interaction est ici modélisée par le
type de matrice de covariance (diagonale ou pleine). La distance de Hotelling
est basée sur une comparaison de modèles permettant l'intégration d'autres
composantes de la prosodie comme la qualité vocale. Elle se place dans une
perspective multi-dimensionnelle de caractérisation du rythme.
Caractérisation des émotions
Nous avons mené, durant la thèse de Fabien Ringeval, un grand nombre d'expériences sur des bases de données variées. La gure 1.14(a) présente les contributions des composantes prosodiques
dans le score de reconnaissance d'émotion (base de donnée Berlin en crossvalidation stratiée). La composante rythmique regroupe toutes les métriques
précédemment mentionnées. Le rythme joue un rôle complémentaire mais non
nécessairement prépondérant dans la caractérisation des émotions. Le pitch
étant le support dominant du message aectif.
Les métriques introduites dans la thèse de Fabien Ringeval, et qualiées
de non-conventionnelles, contribuent signicativement plus que les méthodes
traditionnelles (%V,∆C VarCo, PVI...) au score de reconnaissance globale
(gure 1.14(b)). Un élément qui nous semble encore plus intéressant, et que
nous continuerons sûrement à investiguer par la suite, est l'espace de caractéristiques formé par les métriques rythmiques. La gure 1.15 présente les espaces formés par les métriques conventionnelles (%V,∆C), mixtes (%V,Fmoy 4 )
et non-conventionnelles (Fmoy ,A-PHD5 ). Les métriques non-conventionnelles
permettent non-seulement de séparer les émotions, mais aussi de dénir un
continuum de valeurs entre les catégories d'émotions. Le continuum ainsi obtenu est à mettre en regard avec celui de la roue de Plutchik (espace de proximité perceptuelle des émotions). Les métriques proposées orent un cadre
pertinent et novateur pour la caractérisation d'émotions notamment dans
la perspective d'analyse non pas de catégories individuelles et prototypiques
(Joie, Peur...) mais de descriptions dimensionnelle et continue des émotions.
Ce dernier point forme le dé actuel du traitement automatique des émotions
[Schuller et al., 2011].
4 Fréquence
5 Distance
moyenne de l'enveloppe rythmique "p-centre"
de Hotelling avec matrice de covariance diagonale
44
(a) Contribution des composantes prosodiques (poids de fusion) selon les ancrages
(b) Contribution des composantes prosodiques (poids de fusion) selon les ancrages
Fig. 1.14:
1.5
Contribution des métriques du rythme
Emotions chez les enfants atteints de
troubles de la communication
Nous présentons dans cette section une application sur des données naturelles dans un contexte de diagnostic diérentiel d'enfants atteints de troubles
du langage. Il s'agit d'une collaboration initiée en 2006 avec le département de
Psychiatrie de l'Enfant et de l'Adolescent de l'hôpital de la Pitié-Salpêtrière
(dirigé par David Cohen) et celui de l'hôpital Necker-Enfants Malades (dirigé
par Bernard Golse). Nous avons été confronté à plusieurs dicultés qui, après
le recul que nous avons, sont dues à un manque de connaissance réciproque
des deux mondes que sont, la psychiatrie (recherche clinique) et l'ingénierie.
Depuis nous avons fait un bout de chemin ensemble et avons même créé ensemble un groupe de recherche inter-disciplinaire au sein de l'ISIR nommé
1.5.
Emotions chez les enfants atteints de troubles de la
communication
45
(a)
(b)
(c)
(d)
!"#$%&$'& Variations des mesures issues des modèles du rythme conventionnels (a), mixtes (b) et nonconventionnels (c) selon les !"#$%&'()*+,-$.&#(&/*0+1"+2&*(#(&/+,)+3"+!'&(4+,"/*+3-)*2"!)+,)*+paramètres en détermine les valeurs moyennes, tandis que la hauteur et la largeur correspondent aux va69
Fig. 1.15: Variations
des mesures
des modèles
du[PLU80]
rythme
3)5'*+,-$!"'#-type
; (d) roueissues
des émotions
de Plutchik
. conventionnels
(a), mixtes (b) et non conventionnels (c) selon les catégories d'émotions ; (d) roue
5. émotions
Conclusion
des
de Plutchik
Nous avons présenté différentes théories du rythme dan*+ 3-(/#'&,5!#(&/+ ,)+ !)+ !6"2(#').
Cette première partie a montré que le rythme véhicule des phénomènes complexes dont leur
IMI2S
(Intégration
Multimodale,
Interaction
et Signal
Social).
La cepratique
caractérisation
ne peut reposer
sur des mesures
simples telles
que le débit,
puisque
dernier
qui
s'est révélée ecace dans nos travaux les
interdisciplinaires
a été
de former
)/+)*#+#&5#+*(.23).)/#+75-5/)+!&.2&*"/#)0+Comme
phénomènes du rythme
peuvent
être à
3-&'(%(/)+,)s
émotions
procurées
par
la
musique,
/&5*+"8&/*+2'&2&*$+9!&..)+,-"5#')*+"5#)5'*+
des "binômes" de jeunes chercheurs (ingénierie + recherche clinique), il ne
3-&/#+:"(#+"52"'"8"/#;+,)+:"(')+3)+3()/+)/#')+3)*+2'&2'($#$*+,)+3"+.5*(75)+)#+,)+celles de la pas'agit
donc pas de former un chercheur aux deux disciplines mais de favoriser
role. En effet, le rythme apparaît clairement comme sous-modélisé dans les systèmes issus de
des
enrichissements mutuels. CommeNous
nousavons
le verrons
dans ledesreste
du docu3-$#"#+,)+3-"'#+)/+')!&//"(**"/!)+,-$.&#(&/*.
donc développé
métriques
nonconventionnelles
pour capturer
du rythme
la parole.
Différentes
techniques
ment,
nous (David
Cohenlesetphénomènes
moi-même)
avonsdepromu
cette
approche
de la
ont alors été exploitées : (i) les m esures spectrales *5'+3-)nveloppe estimée par la méthode de
recherche,
et, avec toute l'humilité requise pour cette entreprise, contribué à
Tilsen, (ii) !"#$%#!&''# et la fréquence instantanées calculées au moyen de la THH, (iii) les
nos domaines de recherche respectifs.
69
R. Plutchik, Emotion: A Psychoevolutionary Synthesis, dans Harper & Row, New York, 1980.
Les travaux présentés dans cette section ont pour objectifs : (1) de quantier de manière objective les caractéristiques prosodiques et émotionnelles
les productions verbales d'enfants atteints de troubles de la communication,
(2) d'appliquer et surtout d'enrichir les méthodes de caractérisation en les
confrontant à des données naturelles et ceci en collaboration avec des cliniciens (orthophonistes, psychologues).
122
Les troubles envahissant du développement (TED) sont souvent considérés
comme formant un spectre de décits rendant dicile le diagnostic. Un des
46
dés réside dans l'objectivation des troubles et surtout dans une dénition
plus ne des catégories. En l'absence de critères spéciques, les troubles envahissant du développement non spéciés (TED-NOS) forment une catégorie
de diagnostic par défaut. Nous avons entrepris d'enrichir les connaissances sur
le spectre autistique par un travail interdisciplinaire portant sur le langage,
la prosodie et les émotions. Les sections suivantes décrivent brièvement les
travaux réalisés et publiés dans les deux domaines de recherche : traitement
automatique de la parole [Ringeval et al., 2011] et recherche clinique [Demouy
et al., 2011].
1.5.1 Fonctionnalité grammaticale de la prosodie
Les systèmes de caractérisation du message aectif dans un signal de parole exploite tout particulièrement la composante prosodique. L'incapacité à
exploiter les fonctionnalités de la prosodie (grammaticale, aective ou encore
pragmatique) pour communiquer, est une caractéristique centrale des individus atteints de trouble du langage, de la communication et de l'interaction sociale. Avant même de concevoir un système d'objectivation des émotions chez
les enfants atteints de troubles de la communication, il nous a paru impératif
d'évaluer et de comprendre les caractéristiques prosodiques de ces enfants.
La collaboration avec les services de psychiatrie s'est concrétisée par le
co-encadrement de Julie Demouy pendant son mémoire d'orthophonie (20092010). Les travaux ont porté sur la dénition d'une épreuve permettant (1)
l'évaluation des enfants par les orthophonistes, (2) la collecte et l'analyse
automatique des données et ceci dans l'objectif de caractériser dans un premier
temps la fonctionnalité grammaticale de la prosodie (accent lexical, frontière
de phrase).
Recrutement
Les recrutements ont été réalisés par nos collègues pédopsychiatres dans deux départements de psychiatrie de l'Enfant et de l'Adolescent : Hôpital La Pitié Salpêtrière / UPMC, Hôpital Necker / Université
René Descartes. Les caractéristiques de ces enfants sont décrites en détail dans
[Ringeval, 2011; Ringeval et al., 2011; Demouy et al., 2011], nous ne rappelons
ici que les catégories diagnostiques des 35 sujets monolingues (Français) âgés
de 6 à 18 ans :
Troubles autistiques (TA) : 12 sujets ; 10 G - 2 F
Troubles envahissants du développement-non spéciés (TED-NOS) : 10
sujets ; 9 G - 1 F
Troubles spéciques du langage (TSL) 13 sujets ; 10 G - 3 F
Ainsi qu'un groupe contrôle (Développement typique) composé de 70 sujets
monolingues (Français) appariés en âge et en genre (ratio : 2 DT / 1 TC)
1.5. Emotions chez les enfants atteints de troubles de la
communication
47
et recrutés dans le lycée privée Hermitage (Maisons-Latte, Hauts-de-Seine).
Les sujets ont reçu une évaluation6 portant sur le langage et la communication
(cf. [Demouy et al., 2011]).
Epreuve d'imitation de contours intonatifs
Il s'agit d'une tâche
contrainte consistant à demander à l'enfant de répéter 26 phrases (présentées dans un ordre aléatoire) représentant diérents types de modalité (e.g.
déclarative, interrogative...) et quatre types de prol intonatif (cf. gure 1.16).
Le corpus contient 7 heures d'enregistrement pour environ 3000 phrases (cf.
tableau 1.11).
Fig. 1.16:
Prols intonatifs selon le contour du pitch
Quantité de phrases disponibles selon les groupes d'analyse de la tâche
d'imitation des contours intonatifs
Intonation
DT TA TED-NOS TSL
Descendante 580 95
71
103
Tombante
578 94
76
104
Flottante
291 48
40
52
Montante
432 70
60
78
Toutes
1881 307
247
337
Tab. 1.11:
Système automatique de reconnaissance des contours intonatifs
Une analyse de l'état de l'art en reconnaissance de l'intonation [Ringeval et al.,
6 Evaluation
menée par Julie Demouy en collaboration avec les services cliniques
48
2011], montre que les trois facteurs importants pour la conception d'un système ecace sont : (1) les échelles temporelles d'analyse (e.g. tour de parole,
contenu phonétique...), (2) les descripteurs prosodiques et/ou acoustiques sélectionnés et (3) la stratégie de reconnaissance (e.g. fusion d'informations,
sélection de caractéristiques...).
Le système de reconnaissance développé se propose d'analyser des descripteurs bas niveau de la prosodie (f0, énergie, ∆, ∆∆) par deux approches :
Statique : application de fonctionnelles (statistiques) aux descripteurs
sur un tour de parole7 pour former un super-vecteur de dimension 162.
La classication en catégories intonatives s'eectue par des classieurs
types SVM ou k-ppv.
Dynamique : modélisation directe des descripteurs par un modèle de
Markov Caché.
L'approche statique ore une modélisation globale de la prosodie alors que
l'approche dynamique, en considérant une suite d'états, impose une structure
précise de la prosodie8 . L'analyse de la fusion des approches permet d'étudier
plus nement les caractéristiques prosodiques des groupes pathologiques (Qstatistics [Kuncheva, 2004]9 ).
Un des objectifs de ces travaux est la proposition de marqueurs basés
sur le langage et la prosodie pour le diagnostique. De ce fait, les scores en
eux-mêmes ne nous intéressent que partiellement. Le point essentiel est la
diérenciation entre les groupes étudiés. Nous avons proposé une méthodologie
pour l'apprentissage et le test de systèmes reconnaissance dans le contexte du
diagnostique diérentiel. La méthodologie, présentée gure 1.17, consiste à
considérer les performances "cibles" comme étant celles du groupe contrôle
(développement typique) : apprentissage sur les données du groupe contrôle
et test sur les données des groupes pathologiques.
Les performances du système de reconnaissance des intonations produites
par le groupe contrôle (Développement Typique) sont présentées tableau 1.12.
Les intonations les moins ambigües (cf. gure 1.16) obtiennent les meilleurs
scores (montante et descendante). Le résultat le plus pertinent réside dans la
complémentarité, variable selon la modalité, des approches statique et dynamique, motivant ainsi les approches diérenciées.
D'autres résultats sont présentés dans [Ringeval et al., 2011]. Nous nous
concentrons ici que sur les contributions des deux approches étudiées selon les
7 Phrase
prononcée par l'enfant
dynamique a été développée dans le cadre d'une collaboration avec le laboratoire d'acoustique de la parole de Budapest (Klara Vicsi). Plusieurs échanges étudiants
ont eu lieu, et notamment j'ai encadré György Szaszak (post-doctorant) pendant son séjour
à l'ISIR en 2009
9 −1 ≥ Q ≥ 1, Q = 0 classieur indépendant, Q > 0 reconnaissance d'objets identiques
8 L'approche
1.5. Emotions chez les enfants atteints de troubles de la
communication
C HAPITRE 5. EMOTIONS ET TRO UBLE S D E LA C OMMUNICATION
49
F ig. 5.9 Stratégies de reconnaissance du contour intonatif.
Fig. 1.17:
Stratégie de reconnaissance des contours intonatifs
partitions de test des données des sujets TC ont donc été traitées 10 fois, i.e., avec chaque
!"#$%$%&'( )*"!prentissage des DT. L'ensemble des paramètres discriminants qui ont été
Tab. 1.12: Performances en reconnaissance de l'intonation (%) : modélisation staidentifiés
(approche statique) par la méthode de reconnaissance bottom-up sur les DT a été
tique, pour
dynamique
et sur
fusion produits
des deux
les Le
sujets
DT
utilisé
caractériser
les la
contours
parpour
les TC.
poids
optimal pour la fusion des
Intonation
Statique
Dynamique
Fusion
stat,dyn
classifieurs a, quant à lui, été estimé pour chaque groupe, i.e., DT,QTA,
TED-NOS et TSL.
Descendante
61
55
64
0.17
Cela permet de faire ressortir )*+,-'$.-//-0()%11+#-'2-0(-'$#-(les groupes
dans la contribution
Tombante
55 du contour48
55 et dynamique
0.38
des deux systèmes
de reconnaissance
intonatif : statique
.
3.3.
Flottante
49
Montante
93
Résultats expérimentaux
Toutes
67
71
95
64
72
95
70
0.67
0.27
0.42
Les analyses -11-2$.+-0(0.#(/*+!#-.,-(!"#$#%&%#'()!*+),'(%'-.+)#(%'(&%#/+ ont été divisées
en deux étapes : (i) une analyse statistique de la durée des phrases et (ii) une utilisation des
diérents groupes étudiés. Le tableau 1.13 montre que l'approche dynamique
systèmes de reconnaissance qui ont été décrits dans les paragraphes précédents. Les scores de
devient majoritaire pour la plupart des groupes pathologiques montrant ainsi
reconnaissance obtenus par les enfants à DT sont considérés comme des valeurs cibles pour
la nécessité d'une modélisation plus ne pour ces groupes.
les sujets atteints de TC. Notons que la stratégie de reconnaissance proposée exploite les
caractéristiques des sujets DT pour reconnaître /*%'$&'"$%&'(des sujets pathologiques, cf. Fig.
5.93(4'()5".$#-0($-#6-07(/-(8%"%0(%'$#&).%$(!"#(/-0(-'1"'$0(9(:;()"'0(/"($<2=-()*%6%$"$%&'("(+$+(
Tab. 1.13: Analyse des contributions des approches statique et dynamique dans
inclus dans la configuration du système de reconnaissance. Tout écart significatif par rapport
la caractérisation de l'intonation chez les enfants
à ce biais sera considéréMesure
dans cetteDT
étude comme
lié à une déficience
TA TED-NOS
TSL dans les compétences
prosodiques grammaticales
des sujets
étudiés,
carence dans les capacités à
Qstat,dyn
0.42
0.65ou du moins,
0.45 à une0.55
imiter un contour intonatif. >&$&'0(?.-(/"(0$#"$+@%-()-(#-2&''"%00"'2-(-6!/&A+-(6&'$#-(?.*.'
apprentissage des modèles sur les données des enfants à DT influence, a priori , les scores de
A noter que
le groupe
TED-NOS
semble
se situer
d'un
contireconnaissance
sur les
intonations
produites par
les sujets
atteintsau
de "milieu"
TC ; comparé
notamment
entre les des
enfants
à développement
typique
et propres
les enfants
atteints
de
ànuum
un apprentissage
modèles
qui aurait été effectué
sur leurs
données.
Cependant,
troubles
est jugé
majeur
dans
la qualication
et leles
et
de façonautistiques.
a posteriori , Ce
cetterésultat
2&'1%@.#"$%&'(
'*"( !"0(
6&'$#+(
)-( réelles
différences dans
performances
-'(#-2&''"%00"'2-()-(/*%'$&'"$%&'(2&6!"#+(9()-0(modèles
appris
sur
les
données
diagnostic des TED-NOS car souvent fait par défaut. Des méthodologies de
des
sujets à DT.
traitement
du signal et de reconnaissance des formes ont donc été exploitées
non-paramétrique
a été
pour effectuer
la comparaison
pourUne
la méthode
caractérisation
objective
de utilisée
l'intonation
et permettent
ainsi statistique
la prodes
données
entre
les
@#&.!-0(
)*enfants,
i.e.,
une
p
-valeur
a
été
estimée
par
la
méthode
position de marqueurs diérentiels entre des groupes pathologiques et orentde
150
une voie vers l'individualisation de la prise en charge [Demouy et al., 2011].
50
1.5.2 Fonctionnalité émotionnelle
Epreuve de production de parole aective spontanée
La deuxième
épreuve du protocole porte sur l'évaluation de la capacité des enfants à produire spontanément des phrases en exploitant les dimensions aectives de la
prosodie. Cette épreuve a été conçue en collaboration avec le service de Psychiatrie de l'Enfant et de l'Adolescent de l'hôpital de la Pitié-Salpétriére, sous
l'impulsion de notre collègue Monique Plaza (CR CNRS Psychologie, ISIR).
La tâche consiste en un récit d'une histoire imagée contenant des stimuli affectifs catégorisés, par l'équipe clinique (orthophoniste et psychologue), en
quatre niveaux de valence émotionnelle : positive, neutre, négative et ambivalent. Le tableau 1.14 regroupe les données collectées (≈10h) et transformées en
groupe de soue (élimination manuelle par Fabien Ringeval de faux-départs,
d'hésitations, de bruits issus de l'environnement, de parole non liée à la tâche).
Tab. 1.14: Quantité de groupes de soue disponible pour l'analyse de la tâche de
production de parole spontanée aective
Valence DT TA TED-NOS TSL
Positive 597 99
118
184
Neutre
926 151
126
238
Négative 2050 339
283
535
Toutes 3943 652
586
1048
Système de reconnaissance
Les caractéristiques prosodiques utilisées
couvrent : (1) l'intonation, (2) l'intensité, (3) la qualité vocale et (4) le rythme.
L'analyse ne des données et des résultats est faîte dans la thèse de Fabien
Ringeval. L'idée de ce travail étant de proposer des métriques pour le diagnostique diérentiel, nous avons fait le choix de ne présenter dans ce document
que les résultats liés aux modèles non-conventionnels du rythme. La gure
1.18 présente les espaces de caractéristiques formés par les approches mixtes
(∆C , Fréquence IMF). Un des résultats majeurs est qu'une fois encore les
méthodes proposées s'avèrent être des marqueurs pertinents. Les sujets avec
troubles du langage (TA et TSL) montrent des valeurs très proches entre
les émotions, ce qui suggère un absence de traitements dédiés aux émotions.
Les résultats obtenus pour le groupe contrôle (DT) reète une maîtrise de la
composante dimensionnelle de l'aect. Les sujets TED-NOS ont tendance à
surjouer les émotions (écart entre "Neutre " et les valences positives et négatives). Ces résultats, au moins pour les sujets TED-NOS, sont en cohérence
avec les investigations menées par nos collègues en recherche clinique [Xavier
et al., 2011].
1.6. Apprentissage pour la caractérisation de signaux de parole en
situation réaliste
51
Fig. 1.18:
1.6
Espace des caractéristiques formé par les métriques rythmiques
Apprentissage pour la caractérisation de signaux de parole en situation réaliste
Dans le cadre du projet Motherese10 , en collaboration avec le service de
psychiatrie de l'Enfant et de l'Adolescent de la Pitié-Salpétriere (David Cohen)
et de l'université de Pise (Filippo Muratori), nous avons proposé un détecteur
de parole émotionnelle dans des situations naturelles et spontanées.
La composante émotionnelle analysée est un registre de parole spécique
produit par la mère durant l'interaction avec son enfant. Ce registre, appelé
mamanais ou motherese, a indéniablement une valence émotionnelle positive.
Une des hypothèses du projet porte sur le rôle régulateur du motherese dans
l'interaction et notamment chez les enfants atteints de troubles autistiques.
D'un point de vue encadrement, une partie de la thèse d'Ammar Mahdhoui
a porté sur le développement d'un détecteur robuste du motherese. Catherine
Saint-Georges, qui a eectué une thèse de sciences entre l'ISIR et le service de
psychiatrie de l'enfant et de l'adolescent, a contribué aux recherches liées à la
dénition du motherese et son implication dans l'interaction mère-enfant.
10 Financé
par la Fondation de France
52
1.6.1 Motherese
Le motherese ou mamanais est un registre de parole universel utilisé non
seulement par les mères mais aussi par les pères et potentiellement tout adulte
en interaction avec un bébé [Fernald and Kuhl, 1987]. Il se caractérise par
des modications des composantes linguistique (e.g. simplication du vocabulaire et la syntaxe), phonétique (e.g. durée des voyelles, hyper-articulation)
et para-linguistique (prosodie). Notre travail a porté sur cette dernière composante. Dans une perspective interdisciplinaire, nous nous sommes intéressés
aux caractéristiques acoustiques et prosodiques du motherese ainsi que son
impact dans l'interaction sociale. On pourra ainsi retrouver dans la thèse de
Catherine Saint-Georges [Saint-Georges, 2011] une description détaillée du
motherese. Nous avons récemment réalisé une revue de la littérature sur le
motherese [Saint-Georges et al., 2011a] montrant les dimensions interactive et
émotionnelle du mamanais. La réponse de l'enfant à ce signal social est jugé
être un marqueur de la dynamique de l'interaction, et de ce fait impacte le
développement de l'enfant. Le Chapitre 2 synthétise nos travaux de recherche
sur la dynamique de la communication humaine. D'un point de traitement du
signal social, le motherese est à rapprocher des émotions sociales notamment
car sa production a pour vocation de créer un eet chez le partenaire (ici le
bébé). A noter que d'un point de vue de la terminologie, on retrouve cette
distinction car le motherese est également appelé infant-directed speech (parole adressée à l'enfant) et a contrario on parle de parole adressée à l'adulte
(adult-directed speech). Les prochaines sections ont pour objet la description
de l'approche originale de caractérisation et de détection du motherese en situation naturelle, présentée dans la thèse d'Ammar Mahdhaoui [Mahdhaoui,
2010].
1.6.2 Classication de données naturelles et spontanées
La classication de données naturelles et spontanées est une tâche complexe car les états émotionnels sont qualiés de non-prototypiques (contrairement aux émotions actées) [Schuller et al., 2011]. Ces états émotionnels sont
généralement produits dans des situations et scénarios spéciques comme c'est
la cas du motherese.
Films familiaux
Notre étude porte sur des données réelles d'interaction
entre des parents et leurs enfants. Il s'agit d'enregistrement de parents s'adressant, en italien, à leurs enfants issus de lms familiaux. L'analyse de ces lms
est une méthode exploitée dans les recherches sur le développement de l'enfant car elle permet d'obtenir des informations sur les premiers mois et an-
situation réaliste
53
nées des bébés qui deviendront autistes. L'ensemble des lms familiaux ont
été gracieusement fournis par Filippo Muratori de l'université de Pise (Stella
Maris Scientic Institute). Plus d'informations sur les lms familiaux sont disponibles dans [Saint-Georges et al., 2011b]. D'un point de vue signal, ils se
caractérisent par la présence de bruit : mouvement de la caméra, qualité médiocre (la plupart des enregistrements ont été réalisé avant 2000), situations
diverses (e.g. jeu, bain), bruits domestiques... L'aspect longitudinal de l'étude
a pour eet d'augmenter les intervalles de temps entre les données (cf. gure
1.19).
-
"%$
.*02(,-+3,4,0&5,/-+,-.*.!0*16,(,/,
.*02(,-+3,4,0&5,/-+,-0*16,(,/,
.*02(,-1*175-+3,4,0&5,/
.*02(,-+3,4,0&5,/
"$$
!%$
!$$
%$
$-
!
Fig. 1.19:
"
&'()*+,-,.-/,0,/1(,
#
Distribution des données selon les trois semestres étudiés
Catherine Saint-Georges et Raquel Cassel11 ont, sous notre direction, sélectionné et annoté plus d'un millier de segments de parole en deux catégories : motherese et non-motherese (cf. gure 1.20). Cette dernière catégorie
correspond à la situation où les parents s'adressent aux enfants avec des productions verbales non-aectives (à rapprocher de l'adult-directed speech). La
délité inter-juge est bonne (kappa = 0.82, intervalle de conance à 95% CI :
[0.75-0.90]).
Classication
D'un point de vue théorique, la distinction entre les deux catégories étudiées (motherese et non-motherese) devraient pouvoir se faire en
exploitant uniquement les informations liées à la prosodie. Le contexte d'acquisition des données engendre des sources de variabilité importante (bruit,
intervalle entre les enregistrements). Le système mis en ÷uvre est décrit dans
les publications suivantes [Mahdhaoui et al., 2008, 2011]. Il s'agit d'un système
combinant caractérisations acoustique (MFCC) et prosodique (statistiques appliquées à la f0 et à l'énergie) exploitées ensuite par des classieurs (GMM et
11 Doctorante
participant au projet
54
Fig. 1.20:
Exemple d'annotation du motherese
k-nn). Une caractérisation multi-ancrage est employée : segmentale (fenêtres
d'analyse de 30ms) et supra-segmentale (zones voisées). Les résultats expérimentaux en détection de motherese montre la pertinence de la combinaison
des ancrages (cf. gure 1.21).
Cet algorithme est actuellement utilisé par l'équipe clinique (thèses de Catherine Saint-Georges et de Raquel Cassel) pour la détection du motherese
dans des les lms familiaux. Des limitations évidentes sont apparues : (1)
comme la plupart des approches supervisées, les performances du système dépendent fortement de la qualité et de la quantité des données d'apprentissage
et (2) l'annotation manuelle de parole émotionnelle est une tâche subjective
inuençant la fonction de prédiction du classieur. Nous avons proposé de traiter simultanément ces deux limitations en proposant un cadre semi-supervisé
de la détection du motherese.
1.6.3 Problématique de l'apprentissage semi-supervisé
L'apprentissage semi-supervisé propose un cadre formel permettant de renforcer les règles de catégorisations des classieurs supervisés en combinant
apprentissage et prédiction sur des données étiquetées et non-étiquetées. Les
and the optimal classifiers were determined by employing
ROC graphs to show the trade‐off between the hit and
false positive rates. A ROC curve represents the trade‐off
between the false acceptance rate (FAR) and the false
rejection rate as the classifier output threshold value is
varied. Two quantitative measures of verification performance, the equal error rate (EER) and the area under the
ROC curve (AUC), were calculated. All calculations were
conducted with Matlab (version 6). For best configuration
data, results were given with the 95% CIs that were
for the k‐nn classifier, the best scores (0.8113/0.812) were
obtained with an important contribution of the segmental
features (λ = 0.8), which is in agreement with the results
obtained without the fusion (Table 2). The best GMM
results (0.932/0.932) are obtained with a weighting factor
equal to 0.6, revealing a balance between the two features.
Table 3 summarizes the best results in terms of accuracy
as well as the positive predictive value (PPV) and negative
predictive value (NPV) for each classifier fusion (top
section) and cross‐classifier fusion (bottom section).
situation réaliste
55
Figure 2 ROC curves for Combination 1 (Comb1) and Combination 2 (Comb2). Combination 1 = Pknn,seg × Pknn,supra;
Combination 2 = Pgmm,seg × Pgmm,supra; λ = weighting coefficient used in the equation fusion for each combination.
Fig. 1.21:
Courbes ROC décrivant les performances de détection du motherese :
Comb1 (Classieur k-nn), Comb2 (Classieur
Int. J. MethodsGMM)
Psychiatr. Res. 20(1): e6–e18 (2011). DOI: 10.1002/mpr
e12
Copyright © 2011 John Wiley & Sons, Ltd
algorithmes les plus utilisés font appel aux méthodes génératives (EM : Expectation Maximization), transductives (minimisation de l'erreur commise sur
des données non étiquetées) et celles d'apprentissage comme le self-training et
le co-training.
Le self-training est une des premières méthodes d'apprentissage semisupervisé. Un classieur, entraîné sur quelques données étiquetées, est utilisé en prédiction sur des données non-étiquetées pour élargir son ensemble
d'apprentissage. A chaque itération de l'algorithme, les n exemples prédits
avec la meilleure conance sont ajoutés à la base d'apprentissage, et ce jusqu'à ce que la base de données non-étiquetées deviennent vide. L'algorithme
de co-apprentissage (co-training) [Blum and Mitchell, 1998] est généralement
considéré comme une extension du self-training. Il considère au moins deux
classieurs h1 et h2 qui se diérencient par les modèles choisis (GMM, k-nn...)
ou encore par les caractéristiques (acoustique ou prosodique pour la parole). Le
principe de l'algorithme, rappelé table 1.15, repose sur l'élargissement mutuel
de l'ensemble d'apprentissage. Ces techniques d'apprentissage ont été appliquées avec succès au traitement du langage naturel, de documents ou encore
de classication de pages web. L'application au traitement de la parole affective n'est pas simple car, comme nous l'avons vu, la grande majorité des
approches de reconnaissance exploitent plusieurs descripteurs (acoustiques,
prosodiques). Ces descripteurs sont combinés à diérents niveaux (espace des
caractéristiques, des décisions...). Dans [Mahdhaoui and Chetouani, 2011],
nous avons proposé un algorithme adapté à ces cas de gure et exploitant
une caractérisation multiple du signal de parole.
56
Entrée
Tab. 1.15:
Algorithme de co-apprentissage
Ensemble L des données étiquetées
Ensemble U des données non étiquetées
Tant que U 6= ∅
Apprendre un classieur h1 sur l'ensemble L
Apprendre un classieur h2 sur l'ensemble L
Etiqueter aléatoirement un nombre p des exemples de l'ensemble U en utilisant le classieur h1
Etiqueter aléatoirement un nombre p des exemples de l'ensemble U en utilisant le classieur h2
Ajouter l'ensemble T des exemples labelisées par h1 et h2 à l'ensemble L
Supprimer T de U
Fin
1.6.4 Co-apprentissage multi-vues
Le co-apprentissage multi-vues consiste à combiner les prédictions issues
de diérents classieurs (probabilité a posteriori) an d'obtenir une prédiction
unique pour chaque exemple de test (c.f. table 1.16). La méthode proposée
est une nouvelle forme de co-apprentissage exploitant simultanément fusion
d'informations et apprentissage semi-supervisé. Le principe repose sur l'idée
que plusieurs vues d'un même objet peuvent être utilisées pour renforcer les
règles de prédiction. La thèse d'Ammar Mahdhaoui a porté sur la dénition
de cet algorithme qui a donné lieu à la publication suivante [Mahdhaoui and
Chetouani, 2011].
Un des éléments importants de l'algorithme est l'étape d'estimation de la
conance de la classication d'un exemple donné (c.f. table 1.16). Cette étape
repose sur l'estimation de la marge de classication directement inspirée des
machines à vecteur support. Elle est donnée par l'équation suivante :
P
1 vk=1 ωk × hk (Cj |zik )
Pv
(1.18)
margej =
v
k=1 ωk
Avec hk (Cj |zj ) l'estimation de probabilité a posteriori d'appartenance de
l'exemple zj à la la classe Cj par la vue k . v est le nombre de vues (extracteur
de caractéristiques + classieur) utilisé par l'algorithme (e.g. MCC+GMM,
Statistiques sur la f0 + k-ppv,...). ωk est le poids associé à la vue k .
La marge peut être interprétée comme une mesure de conance dans la
prédiction. Il est ainsi possible de sélectionner les exemples catégorisés avec
la plus grande conance (identiquement à ce qui est fait avec les algorithmes
de co-training). L'avantage majeur de la méthode est la vision intégrative de
l'ensemble des vues en fusionnant les prédictions de l'ensemble des classieurs.
Cette fusion est dynamique car le poids associé à chaque vue est ré-évalué par :
Psize(T )
hk (zik )
(1.19)
ωk = Pv i=1
Psize(T )
hk (zik )
k=1
i=1
situation réaliste
57
Il s'agit ainsi de favoriser les vues améliorant la catégorisation au sens de la
marge. Les conditions d'arrêt sont identiques à celles du co-apprentissage (cf.
table 1.15) et portent sur le nombre d'exemple de la base non-étiquetée.
Tab. 1.16:
Algorithme de co-apprentissage automatique pour la classication du
motherese
Entrée
Ensemble L de m exemples étiquetés
1
v
L={(l11 , ..., l1v , y1 ), ..., (lm
, ..., lm
, ym )} avec yi = {1, 2} (problème bi-classe)
Ensemble U de n exemples non étiquetés
U ={(x11 , ..., xv1 ), ..., (x1n , ..., xvn )}
v = nombre de vue
Initialisation
ωk (poids des classieurs)= 1/v pour tous les classieurs
Tant que U 6= ∅
A. Classier tous les exemples de la base de test U
Pour k = 1, 2, ..., v
1. Apprendre le classieur hk sur l'ensemble L
2. Etiqueter les exemples de l'ensemble U en utilisant le classieur hk
3. Estimer la probabilité d'appartenance de chaque exemple xi de U à la classe Cj ,
P
p(Cj |xi ) = vk=1 ωk × hk (Cj |xki )
4. Etiqueter l'ensemble des données de la base U
Fin pour
B. Mettre à jour la base d'apprentissage L et la base de test U
Uj = {z1 , ..., znj } sont les ensembles d'exemples classiés Cj
Estimer la marge des exemples de la base Uj
Prendre Tj exemples de Uj prédits avec une conance supérieure à un seuil
Ajouter l'ensemble T à L et le supprimer de U
C. MettreP à jour le poids de chaque classieur
ωk =
size(T )
hk (zik )
i=1
Psize(T )
hk (zik )
k=1
i=1
Pv
Fin Tant que
1.6.5 Résultats
L'algorithme a été évalué en exploitant 9 vues : MFCC, descripteurs prosodiques, perceptuels ainsi que diérents classieurs (k-nn, GMM et SVM).
Les résultats expérimentaux sont décrits en détail dans [Mahdhaoui and Chetouani, 2011].
Le protocole d'évaluation des approches semi-supervisées consiste à faire
varier la quantité de données en apprentissage et de toujours tester un jeu de
données diérent mais xe (500 exemples de test). La gure 1.22 reprend les
résultats naux de [Mahdhaoui and Chetouani, 2011] en comparant une approche supervisée et des méthodes semi-supervisées : self-training, co-training
58
et multi-view co-training (méthode proposée dans la thèse d'Ammar Mahdhaoui). Une comparaison entre les méthodes semi-supervisées montre l'intérêt
de l'apport de la fusion d'informations dans le processus même d'apprentissage (multi-view co-training vs co-training et self-training). Par ailleurs, le
co-apprentissage multi-vue se révèle plus performant que la méthode supervisée (étiquetage manuelle des données d'apprentissage). Une des raisons est
le renforcement des règles de prédiction réalisé par les composantes coopérative et itérative de l'algorithme de co-apprentissage. Le co-apprentissage
multi-vues ore un cadre élégant et intéressant pour le traitement de données
subjectives.
80
75
Accuracy
70
65
60
proposed co−training method
supervised method (GMM−MFCC)
self−training method (GMM−MFCC)
standard co−training method using the two best classifier
standard co−training method using all classifiers
55
50
45
10
20
30
40
50
60
Number of Annotations
70
80
90
100
Performance en classication avec diérente quantité de données étiquetées en apprentissage
Fig. 1.22:
1.7
Discussion générale
Les travaux présentés dans ce chapitre nous ont permis de mettre en place
une approche complète de la caractérisation de signaux de parole, depuis l'extraction de caractéristiques jusqu'à la reconnaissance de composantes sociales.
Pour autant, ce travail n'est pas encore achevé. L'identication d'unités d'analyse de la parole émotionnelle reste un problème ouvert. L'unité optimale pour
la classication d'une émotion donnée varie selon l'individu et le contexte :
de segments voisés à des syllabes voire des segments non phonétiques dénis
par la proéminence acoustique, prosodique ou rythmique. Une meilleure compréhension de la nature de ces segments permettrait de mieux appréhender
le traitement automatique des émotions et plus généralement une meilleure
modélisation de l'aect. Des investigations plus poussées sur des corpus réa-
1.7. Discussion générale
59
listes associées à des caractérisations dimensionnelles sont nécessaires. Nous
avons entamé avec Fabien Ringeval, durant son post-doctorat à l'Université de
Munich avec Bjorn Schuller, l'étude de ces unités. Les résultats préliminaires
montrent l'intérêt des unités proposées dans la caractérisation dimensionnelle
de données spontanées (corpus de la campagne AVEC12 ).
La convergence de l'analyse des unités émotionnelles avec les algorithmes
d'apprentissage ore des perspectives intéressantes à l'image des "ememe ".
L'apprentissage semi-supervisé multi-vues autorise renforcement des lois de
prédiction et compréhension du rôle des unités dans ce renforcement. Un volet
très peu investigué est l'exploitation d'unités d'analyse dans un apprentissage
incrémental permettant une prise de décision bien avant la n de la production
verbale du locuteur.
12 http://sspnet.eu/avec2011/
Chapitre 2
Dynamique de la communication
humaine
2.1
Contexte
La communication face-à-face est un processus dynamique basé sur
l'échange et l'interprétation de signaux sociaux [Morency, 2010]. La maîtrise de cette dynamique impacte les tours de parole, l'engagement, l'attention
conjointe... Sa modélisation est identiée comme un verrou majeur du traitement du signal social et de la robotique personnelle. Les approches proposées
portent sur l'analyse de l'inuence mutuelle des participants l'un sur l'autre.
Une interaction réussie se caractérise par une adaptation dynamique des comportements des interactants.
Les travaux présentés dans ce chapitre portent sur la caractérisation de
cette adaptation appelée synchronie interactionnelle. Cet axe de recherche
trouve son ancrage scientique dans la fouille de la réalité (reality mining)
[Pentland, 2008] et a pour objectif le développement d'outils d'analyse et de
détection de la synchronie (et de la dyssynchronie) dans les interactions pour
une meilleure compréhension de la communication humaine. Les applications
visées portent sur le diagnostique diérentiel (comparaison de catégories diagnostiques dans les situations interactives) et le développement de systèmes
interactifs dotés d'intelligence sociale (cf. Chapitre 3).
2.2
Synchronie interactionnelle
La dénition même de la synchronie interactionnelle est un tâche complexe, car elle est, elle même, déterminée à partir de signaux riches et ambigus (gestes, tour de parole, regard...). Les termes utilisés dans la littérature
sont multiples : mimétisme (mimicry), résonance sociale, eet caméleon, etc...
L'attention conjointe, l'empathie, la théorie de l'esprit, l'engagement ou bien
encore la pragmatique (e.g. tour de parole) sont des compétences proches et
souvent nécessaires pour l'apparition de la synchronie interactionnelle.
Un des axes majeurs de nos activités de recherche portent sur l'étude
et la compréhension de la synchronie interpersonnelle en proposant des mé-
62
Dynamique de la communication humaine
thodes de détection, de caractérisation et de prédiction de ce phénomène. Les
thèses de Catherine Saint-Georges [Saint-Georges, 2011] et d'Ammar Mahdhaoui [Mahdhaoui, 2010] ont permis de mettre en évidence l'importance de
la synchronie dans l'identication de signes précoces de l'autisme sur la base
d'une caractérisation automatique des interactions mère-bébé dans les lms
familiaux (cf. section 2.3). Ces deux thèses ont été réalisées dans le cadre du
projet Motherese1 , en collaboration avec l'université de Pise. Actuellement,
Emilie Delaherche poursuit une thèse, sous mon encadrement depuis 2010,
sur la modélisation de la dynamique de l'interaction dans le cadre du projet
MULTI-STIM2 (cf. section 2.4). Ce projet a pour objectif le développement de
méthodes automatiques d'analyse ne de l'interaction dans une perspective de
diagnostique diérentiel (TA, TED-NOS cf. section 1.5). Les méthodes mises
en ÷uvre seront également déployées dans des systèmes robotiques (cf. projet
MICHELANGELO3 ).
Nous avons donc abordé la question de la caractérisation de la synchronie
sous diérents angles donnant lieu à la proposition d'une approche multidisciplinaire combinant méthodes computationnelles, traitement du signal social, psychologie développementale et psychiatrie [Delaherche et al., 2011].
Les sections suivantes précisent les dénitions et les fonctions allouées à la
synchronie interactionnelle. Nous introduirons également la problématique et
présenterons le positionnement scientique de nos contributions.
2.2.1 Dénitions
Bernieri and Rosenthal [1991] dénissent la synchronie interactionnelle
comme : "the degree to which the behaviours in an interaction are non-random,
patterned, or synchronized in both form and timing ". La coordination sociale
se caractérise par (1) la correspondance de comportements et (2) la dynamique
des échanges.
La correspondance de comportements se matérialise par une similitude
dans les actions, gestes ou postures. On parle alors d'imitation, de congruence
ou de l'eet caméléon. La synchronie est généralement associée à une coordination temporelle : adaptation et rythme des comportements (non nécessairement identique). La synchronie interactionnelle a également des bases cérébrales [Tognoli et al., 2007; Dumas et al., 2010; Guionnet et al., 2011]. Dumas
et al. [2010] ont, par exemple, montré la présence d'une corrélation entre la
synchronie interactionnelle et l'émergence d'une synchronisation entre les activités cérébrales des partenaires dans une tâche d'imitation (mouvement de
1 Soutenu
par la Fondation de France
Emergence UPMC 2009
3 STREP FP7-ICT-2011-7, ICT for Health, Ageing Well débutant le 01/10/2011
2 Projet
2.2. Synchronie interactionnelle
63
la main).
Selon Harrist and Waugh [2002], l'émergence de la synchronie requiert les
conditions suivantes : (1) le maintien de l'engagement ("tracking each other "),
(2) coordination temporelle des niveaux d'activité (posture, mouvement du
corps, expressions faciales), (3) contingence ; (4) harmonie (le partenaire perçoit l'état de l'autre et s'adapte en conséquence). L'émergence de la synchronie
est conditionnée à l'échange d'informations [Oullier et al., 2008].
2.2.2 Implications dans le développement de l'enfant
La psychologie développementale a largement contribué à la compréhension de la synchronie. Feldman [2007] qualie la synchronie comme la cooccurence de comportements, d'états aectifs et de rythmes biologiques entre
parents et bébé. La synchronie commence dès la vie prénatale et continue
après la naissance à travers l'interaction.
La synchronie interactionelle [Delaherche et al., 2011] joue un rôle dans :
(1) l'amélioration de la présence sociale, (2) l'augmentation du lien social par
la prise de conscience du rythme des interactions, (3) l'attachement sécurisant
dont le défaut impacte directement le développement de l'enfant (e.g. cas des
mères en dépression), et (4) l'acquisition du langage par exposition à la langue
de l'interaction [Kuhl, 2004; Goldstein et al., 2003].
A contrario, un défaut de synchronie impacte négativement l'interaction.
Murray and Trevarthen [1985] ont mené une expérience consistant à introduire un déphasage, dans un double circuit audio-vidéo, dans les interactions
mère-bébé. Le bébé reçoit, en diéré, l'enregistrement de la mère. En comparaison à une interaction directe, les auteurs ont montré que les bébés de 6
semaines sont perturbés par le délai de transmission du comportement maternel. Cette perturbation est expliquée par la capacité des bébés à détecter la
dyssynchronie. Nadel et al. [1999] ont proposé un schéma expérimental complémentaire (direct/déphasé/direct) et ont pu relever un désintéressement du
bébé en situation déphasée, et un regain d'intérêt lorsque l'interaction devient
synchrone.
2.2.3 Implications dans les interactions sociales chez
l'adulte
En ce qui concerne l'interaction chez les adultes, la synchronie interactionnelle contribue à la régularité de l'interaction. Chartrand and Bargh [1999] ont
montré l'existence d'un lien entre la perception de la régularité d'une interaction et le degré d'imitation des interactants. De même Lakens [2010] a relevé
64
l'existence d'une relation entre la perception des diérences de mouvement et
la perception de l'entativité (émergence d'une unité sociale).
Dans un domaine proche de nos préoccupations, Ramseyer and Tschacher
[2011] ont exploité des méthodes computationnelles (quantité de mouvement,
test statistique) pour caractériser la synchronie non-verbale entre un patient
et un thérapeute durant des séances de psycho-thérapies. Les résultats de
cette étude montrent que la synchronie a été plus importante dans les séances
jugées, par les patients, comme présentant une richesse et une ecacité accrues
des échanges. La coordination entre les interactants permet de juger de la
cohésion d'un groupe [Hung and Gatica-Perez, 2010] et est un indicateur de
la participation de chaque individu (e.g. rôle dans l'interaction) [Vinciarelli,
2009].
2.2.4 Implications pour la robotique interactive
En s'inspirant de la psychologie développementale et plus particulièrement
du motherese (cf. section 1.6.1) et du motionese (gestes et actions adressés à
l'enfant), Rolf et al. [2009] ont proposé un modèle de détection de synchronie exploitant des informations multimodales. Ce modèle est utilisé pour la
caractérisation des phases d'apprentissage sociale à l'image de l'interaction
parent-bébé. Les auteurs cherchent à doter le robot iCub4 de capacités de
détection de la synchronie et d'apprentissage social, l'autorisant ainsi à se focaliser sur les phases d'apprentissage pendant l'interaction avec un partenaire
humain. [Prepin and Gaussier, 2010] ont proposé une architecture robotique
où la synchronie avec le partenaire humain joue le rôle de signal de renforcement, pour l'apprentissage de tâches (e.g. mouvement des bras du robot).
Michalowski et al. [2007] ont conçu le robot Keepon5 dont le mode de
communication est basée sur la synchronie. Ce robot ne possède que très
peu de degrés de liberté. La synchronie (ou la dyssinchronie) joue le rôle
de régulateur de l'engagement avec des partenaires (enfants autistes). Dans
un autre contexte, Prepin and Pelachaud [2011] proposent de modéliser la
synchronie pour la caractérisation des échanges entre un agent virtuel et son
partenaire humain dans des tâches de dialogue.
2.2.5 Caractérisation automatique de la synchronie
La détection, la caractérisation et l'évaluation de la synchronie sont problématiques du fait de la variété des facteurs impliqués dans son émergence.
L'étude et l'analyse de la synchronie permettent d'aner nos connaissances
4 http
5 http
://www.icub.org/
://beatbots.net/
2.2. Synchronie interactionnelle
65
sur les mécanismes régissant la communication humaine et notamment sa
composante dynamique.
Problématique
L'étude de la synchronie par des méthodes noncomputationnelles vont de la micro-analyse de comportements à la perception
globale de la synchronie. Les méthodes d'annotation se proposent d'évaluer
de manière ne et locale les comportements des interactants. Il peut s'agir
de micro-comportements tels que les mouvements de la tête, des yeux, du
tronc... mais également de macro-comportements tels que le contact visuel
ou tactile... Les méthodes d'annotation de comportements fournissent des
informations riches. Elles sont généralement exploitées en psycho-pathologie
pour une meilleure compréhension des modes d'interaction (e.g. autisme). Les
contraintes liées à l'annotation sont multiples : nécessité de plusieurs annotateurs (le plus souvent préalablement formés aux grilles de cotation utilisées),
délité inter-juge et validation des grilles, temps nécessaire (annotation et
analyse), multitude de données... Les méthodes basées sur un jugement global
de l'interaction orent une alternative. Elles sont également cohérentes avec
le fait que la synchronie interpersonnelle n'est pas perçue par un unique signal mais par l'ensemble des signaux verbaux et non-verbaux échangés durant
l'interaction (intégration multi-modale).
Le développement de méthodes computationnelles pour la caractérisation
de la synchronie est un domaine en plein essor. On distingue généralement les
méthodes partant de données déjà annotées (e.g. rire, pleur, tour de parole) de
celles exploitant directement les signaux (e.g quantité de mouvement, pause).
Messinger et al. [2010] caractérisent la dynamique des interactions mères-bébés
par la modélisation des séquence de comportements annotés : probabilités
p(bi , mi , bi−1 , bi−1 ) estimées par maximum de vraisemblance. Les dicultés
résident dans (1) le caractère souvent hiérarchique des comportements et (2)
la variabilité du nombre et du type de comportements à traiter [Magnusson,
2000].
Les méthodes exploitant directement les signaux sociaux sans annotation
sont multiples (voir [Delaherche et al., 2011] pour une revue de la littérature)
et une formalisation de la problématique montre que les étapes requises sont
généralement les suivantes :
1. Extraction de caractéristiques (souvent uni-modale) : quantication de
mouvement, suivi de points caractéristiques (e.g. bras, tête), posture.
2. Mesures : corrélation, cohérence (syntonie), analyse par quantication
récurrente, déphasage.
3. Test de signicativité : surrogate data, bootstrap.
4a. Paramètres : niveau de synchronie, décalage temporel, leader...
66
4b. Représentation de la synchronie : cartes de corrélation, graphes de récurrence...
Positionnement scientique
Une formalisation générique du concept
de synchronie interactionnelle requiert une modélisation dynamique, multimodale et contextualisée des signaux non-verbaux.
Mes contributions, dans ce domaine, portent sur des étapes fondamentales
de la caractérisation :
- Indices non-verbaux : Deux approches sont proposées pour l'estimation
de l'intensité de la liaison entre les signaux échangés par les interactants.
Une première exploite des signaux sociaux préalablement annotés. La
caractérisation de la dynamique des séquences d'échange est basée sur
des modèles n-grammes dénissant des schémas interactifs (interactive
patterns). Une seconde méthode cherche à prendre en compte la nature
des signaux échangés. Elle est basée sur l'extraction et la corrélation
d'indices non-verbaux (prosodie, rythme de la parole, pause, quantité
de mouvement, geste....).
- Intégration multimodale : L'exploitation de la multi-modalité est évidemment requise en traitement du signal social mais soulève plusieurs
problèmes. La liaison entre les modalités (cross-modalité) est dicile à
caractériser. Notre démarche a consisté à prendre en compte de manière
explicite la cross-modalité. Les propriétés mathématiques de la factorisation en matrices non-négatives autorisent une modélisation intégrative
et organisationnelle des schémas interactifs. Nous avons proposé une formalisation sous forme de matrice de synchronie permettant d'exploiter
un ensemble d'outils statistiques de mesure de l'intensité de liaison (e.g.
corrélation, synchronie).
- Interprétation des résultats : Il s'agit de produire des représentations
explicitant les échanges entre les partenaires. La propriété de nonnégativité et le caractère parcimonieux de la factorisation en matrices
non-négatives se sont révélés pertinents dans cette tâche : décomposition des schémas interactifs sur la base de stratégies de communication,
activation de ces stratégies... En considérant la matrice de synchronie
comme une matrice de similarité, nous avons pu introduire une représentation en dendrogramme facilitant l'interprétation par des personnes
non-expertes.
- Caractérisation continue du niveau de synchronie : Si les signaux émotionnels commencent a être caractérisés de manière continue (espace
dimensionnel), ce n'est que très rarement le cas de la synchronie, de la
dominance ou encore de la cohésion. Le passage de catégories discrètes
(faible ou forte synchronie) à un espace continu est un verrou majeur de
2.3. Modélisation intégrative de la synchronie
67
la modélisation. Nous avons proposé quelques solutions à cette problématique.
Les travaux présentés dans ce chapitre ont été réalisés dans le cadre de
projets collaboratifs : Fondation de France, MULTI-STIM, Action Européenne
COST 2102 Cross-Modal Analysis of Verbal and Non-Verbal Communication
(Prof. Anna Esposito).
2.3
Modélisation intégrative de la synchronie
Les thèses de Catherine Saint-Georges et d'Ammar Mahdhadoui portent
sur la dynamique des interactions dans le cadre du projet Motherese. Il s'agit
encore une fois d'un binôme favorisant les échanges et les recherches interdisciplinaires. Les thèses de Catherine Saint-Georges (école doctorale Cerveau,
Cognition et Comportement) et d'Ammar Mahdhaoui (ED Sciences Mécaniques, Acoustique, Electronique et Robotique) ont porté sur l'analyse et la
modélisation de la synchronie parent-bébé mais avec des volets diérents.
Cette approche interdisciplinaire nous a permis de proposer des méthodes
computationnelles innovantes et ecaces dans un cadre applicatif enrichissant : la compréhension des interactions atypiques.
2.3.1 Signes précoces de l'autisme : étude de lms familiaux
Les investigations sur les signes précoces utilisent principalement (1) des
questionnaires rétrospectifs renseignés par les parents, (2) des études prospectives et (3) l'étude des lms familaux. Nos recherches exploitent des informations extraites de lms familiaux car ils présentent l'avantage d'être écologique : étude des interactions parent-bébé dans un environnement naturel
dans des conditions spontanées et sans aucun dispositif expérimental.
Dans une revue complète et critique des études portant sur les lms familiaux, nous avons pu identier un ensemble de signes précoces [Saint-Georges
et al., 2010]. La gure 2.1 résume les résultats de cette étude. Un premier résultat réside dans l'importance de la dynamique développementale dans l'apparition des signes (long-terme). La majorité des signes concernent l'interaction
sociale (e.g. contact visuel, expressions faciales, attention conjointe). Catherine Saint-Georges a également étudié la concordance des résultats obtenus
avec ceux issus de méthodes prospectives [Saint-Georges, 2011].
Les études sur les signes précoces de l'autisme se concentrent majoritairement sur l'enfant et que très rarement sur son environnement immédiat. La
nature de l'interaction, les sollicitations et les réponses du partenaire, n'est
l’évitement actif de la relation et du regard) pourraient signer l’échec de la
mise en place de la relation (Muratori & Maestro, 2007) et de l’entrée dans le
langage. De plus l’attention apparaît instable et on retrouve plus nettement
une hypotonie et une hypoactivité, ainsi qu’un manque de jeu social et
symbolique.
68
Manque de communication vocale(N=4)
Manque de gestes communicatifs (N=5)1
COMMUNICATION
Manque de réponse au prénom (N=4)2
Pauvreté des interactions (N=3)3
Isolement (N=3)
Manque de regard aux visages (N=3)
COMPORTEMENTS
SOCIAUX
Moindre quantité et/ou qualité du contact visuel (N=5)4
Manque d’expression faciale (++ positive) (N=5)5
INTER
SUBJECTIVITE
ACTIVITE
Moins de pointage (N=3)6
Manque
d’intersubjectivité
Manque d’autres comportements
(N=2)
intersubjectifs (showing, attention
conjointe) (N=5)
Moindre développement
cognitif (N=2)
Manque d’activité
(N=1)
0
6
9
12
18
24 mois
Figure 1. Signes précoces d’autisme en fonction de l’âge et des principaux
axes de développement issus de l’étude des films familiaux
Les signes
Fig.
2.1: reconnus pour être spécifiques de l’autisme, comparé au retard de
Signes précoces d'autisme en fonction de l'âge et des principaux axes de
développement [Saint-Georges, 2011]
En gras : les signes reconnus pour être spéciques de l'autisme (comparé au retard
mental). N indique le nombre d'étude.
développement, apparaissent en gras. N indique le nombre d’études rapportant l’item
correspondant.1 Significatif au 2e semestre dans 1 étude; 2 significatif dans les derniers mois
de la 1e année dans 3 études; 3 significatif au 1er semestre dans 1 étude; 4 significatif au 2e
semestre dans 2 études; 5 significatif dans les derniers mois de la première année dans 2
études; 6 significatif au 2e semestre dans 1 étude.
Quels sont les signes dont la spécificité vis-à-vis de l’autisme est attestée ?
pas
explicitement prise en compte dans les analyses proposées alors qu'elle
9
joue un rôle fondamental dans le développement de l'enfant (e.g. acquisition
du langage [Kuhl, 2004], l'apprentissage [Meltzo et al., 2009]). L'étude de la
synchronie des comportements mère-bébé permet l'accès à ces informations
[Feldman, 2007], et donc à une compréhension plus ne des mécanismes régissant les interactions.
2.3.2 Modélisation computationnelle de la synchronie
En partant des données annotées par nos collègues de l'université de Pise
[Muratori et al., 2011], nous avons proposé, dans la thèse d'Ammar Mahdhaoui, une modélisation de la synchronie dans les lms familiaux (cf. gure
2.2). Le corpus utilisé est détaillé dans [Saint-Georges et al., 2011b]. Nous
rappelons succinctement la composition des catégories diagnostiques :
Groupe 1 (AD) : 15 enfants diagnostiqués autistes (10 garçons / 5 lles).
Groupe 2 (ID) : 12 enfants avec retard mental (7 / 5 ).
Groupe 3 (TD) : 15 enfants avec un développement typique (9 / 6 ).
La base de données est composée d'un total de 42 lms (d'une durée minimale de 10 minutes) répartis sur les 3 premiers semestres de vie des enfants.
Les comportements du bébé et de la mère ont été annotés par l'université de
Pise suivant la grille ICBS (Infant Caregiver Behavior Scale) (détaillée dans
Films Familiaux
CB
CB
CB
IB
CB
IB
CB
IB
69
CB
CB
IB
temps
temps
<3s
Interactions
multimodales
Analyse
quantitative
Modèle d’interaction
(n-gram)
Codage
(tf-idf)
Regroupement
des signaux (NMF)
Analyse
statistique
Analyse
statistique
Fig. 2.2:
Analyse automatique de l'interaction parent-bébé
{CG→BB} schémas interactifs du caregiver (CG) vers le bébé.
[Saint-Georges et al., 2011b]). Cette grille est composée de 29 étiquettes faisant référence à la capacité du bébé (BB) à engager des interactions (e.g.
s'orienter vers une personne, sourire, vocalisation...) et 8 étiquettes décrivant
les sollicitations et stimulations du parent (CG)6 (e.g. toucher, vocalisation).
L'approche proposée dans la thèse d'Ammar Mahdhaoui est inspirée de
l'analyse automatique de documents (gure 2.2). Elle vise à exploiter l'ensemble des signaux annotés pour caractériser la synchronie interactionnelle.
Extraction et caractérisation de schémas (patterns) interactifs
La première étape de toute étude sur la synchronie repose sur l'extraction
de schémas interactifs ("méta-signaux"). Pour ce faire, nous avons considéré
que l'impact d'un comportement d'un interactant sur son partenaire était limité dans le temps. Un fenêtrage des vidéos permet d'obtenir des segments
temporels où les séquences de comportements ({CG→BB} et {BB→CG})
prennent du sens. Feldman [2007] relève également l'importance d'une analyse temporelle à horizon nie. Une étude de la littérature sur la synchronie parent-bébé permet de xer la durée des fenêtres à 3s7 [Feldman, 2007].
Nous obtenons alors un ensemble de méta-comportements appelés schémas
interactifs. Par exemple, le schéma interactif 'Touch#Vsim ' signie que le parent a communiqué un signal 'touching " et que l'enfant a répondu par une
'vocalisation simple ' dans une fenêtre de 3s.
6 ou
tout autre partenaire adulte
retrouverons une durée similaire dans une étude liée à la robotique interactive (cf.
Chapitre 3)
7 Nous
70
Les schémas interactifs forment un nouvel espace de méta-comportements
orant une nouvelle perspective d'analyse de l'interaction. D'un point de vue
mathématique, toutes les combinaisons de comportements dénis dans la grille
ICBS sont possibles (CG × BB comportements = 8 × 29). Cependant, nos
travaux portent sur des interactions humaines régies par des règles de communication, limitant de ce fait la combinatoire.
L'estimation de l'intensité de liaison entre les comportements des interactions est la seconde étape de notre modélisation (cf. gure 2.2). Notre
approche a consisté à estimer des n-grammes (modèles de Markov d'ordre
n). Pour un modèle 3-gram, nous obtenons l'estimation de la probabilité
d'apparition du comportement i en fonction des comportements précédents
(compi−1 , compi−2 ) : P (compi |compi−2 , compi−1 ). Dans le cadre de notre collaboration avec des cliniciens, nous nous sommes restreints à des bi-grammes
pour des raisons de simplication des interprétations (nécessaire dans notre
phase de conception de modèles). La modélisation par bi-gramme autorise
une représentation graphique de l'interaction [Saint-Georges et al., 2011b].
L'estimation des modèles n-grammes repose sur le critère de maximum de
vraisemblance.
Une première validation de cette approche a consisté à mener une étude
comparative des bi-grammes obtenus (méta-comportements) selon diérents
groupes et semestres (S1 :0-6 mois, S2 :6-12 mois, S1 :12-18 mois). Des modèles linéaires généralisés mixtes (GLMM) ont été utilisés avec, après analyse
des schémas interactifs, des lois de quasi-Poisson (détails disponibles dans
[Saint-Georges et al., 2011b; Saint-Georges, 2011; Mahdhaoui, 2010]). Cette
étude permet d'obtenir des résultats sous une forme directement interprétable
par des cliniciens (signicativité), avec l'avantage de proposer une analyse
interactive et développementale.
Intégration d'informations par factorisation de matrices nonnégatives
La modélisation par n-grammes et les analyses par GLMM fournissent des
indications utiles sur la dynamique et la pertinence de chacun des schémas
interactifs. L'inuence mutuelle des signaux sociaux échangés n'est pas explicitement modélisée. Une modélisation intégrative permet d'étudier les liaisons
entre ces schémas interactifs. Une autre motivation de l'approche intégrative
réside dans la nécessité de pouvoir estimer tous les n-grammes. Or, un schéma
interactif n'apparaissant pas dans la base d'apprentissage peut apparaître dans
la base de test. Les méthodes de lissage permettent en partie de pallier à ce
problème. Cependant, un groupe pathologique peut justement être caractérisé
par la présence ou l'absence de certains schémas interactifs et là les méthodes
71
de lissage ne sont plus adaptées (atténuation des diérences). La modélisation
par n-gramme se révèle assez peu représentative de la dynamique de l'interaction parent-enfant (développementale) pour dans une découverte de signes
précoces.
Nous avons proposé de résumer ces schémas interactifs en estimant des
clusters de schémas interactifs. L'analyse par semestre et par groupe se fait
alors, non pas sur les schémas eux-mêmes, mais via ces clusters. Cette approche permet de modéliser les liaisons existantes entre les schémas interactifs. L'augmentation du niveau d'abstraction permet une comparaison de
structures de l'interaction : nature et nombre diérents de clusters. Plusieurs
approches de clustering peuvent alors être utilisées : k-moyennes [Mahdhaoui,
2010], réseaux de neurones... Nous avons opté pour une représentation permettant la modélisation et l'interprétation des liaisons.
Matrices non-négatives
La présence et l'absence de certains comportements humains dans une scène interactive informent sur la nature et le
contexte des échanges. En exprimant les scènes interactives sous la forme de
matrices non-négatives, notre objectif est d'exploiter des méthodes de décomposition pour extraire des informations sémantiques de plus haut-niveau
(clusters de comportements interactifs). Les matrices non-négatives ont été appliquées avec succès à de nombreux domaines (séparation de sources, recherche
d'informations, biologie...). La factorisation de matrices non-négatives est pertinente pour l'interprétation de comportements sociaux [Wu et al., 2008].
La factorisation en matrices non-négatives [Lee and Seung, 1999] est une
méthode d'extraction de caractéristiques impliquant la décomposition d'une
matrice non-négative V (dimension n × m) en deux matrices non-négatives
W (n × k ) et H (k × m) :
V ≈ WH
(2.1)
Le rang k de la factorisation représente le nombre de facteurs latents
((n + m)k < nm) et est généralement interprété comme le nombre de clusters
(groupes de schémas interactifs). Les lignes ou les colonnes des matrices (H
et W) sont des indicateurs du degré d'appartenance à un cluster donné.
La contrainte de non-négativité est pertinente dans l'analyse de comportements humains autorisant ainsi uniquement des combinaisons additives et non
soustractives comme c'est potentiellement le cas de l'analyse en composantes
principales. En introduction de ce manuscrit, nous avons rappelé les travaux
de Eagle and Pentland [2009] sur les comportements humains, la méthode proposée se situe dans la même lignée (reality mining ). La représentation obtenue
par factorisation de matrices non-négatives permet de former un dictionnaire
de schémas interactifs (basis vectors). Nous considérons la représentation ob-
72
tenue comme une extension des Eigenbehaviors de Eagle and Pentland [2009].
La factorisation en matrices non-négatives est un processus d'optimisation
dont les éléments importants sont les algorithmes et les critères utilisés (e.g.
multiplicatif, norme de Frobenius, divergence...), l'optimisation du rang k et
l'initialisation. Les contraintes de parcimonie ne sont explicites pas dans les
algorithmes généralement utilisées mais peuvent aisément être introduites sous
forme d'optimisation sous contraintes. Plusieurs stratégies sont actuellement
proposées dans la littérature, les détails de nos travaux sont présentés dans
[Mahdhaoui, 2010; Saint-Georges et al., 2011b].
Représentation des schémas interactifs
Un point sur lequel nous souhaitons revenir dans ce manuscrit concerne le pré-traitement des données. Il
s'agit de proposer une représentation ecace des diérents schémas interactifs
(n-grammmes). Nous avons exploité une approche très largement utilisée en
analyse de document : tf-idf (term frequency - inverse document frequency).
Notons que cette représentation s'est généralisée à toutes les modélisations
à base de dictionnaire (e.g. bag-of-words) pour la reconnaissance d'objets,
d'actions, de sons...
La représentation tf-idf consiste en l'application d'une fonction de pondération statistique qui permet d'évaluer l'importance d'un schéma interactif
spécique pour une scène d'interaction donnée. L'idée principale est qu'un
schéma interactif qui survient fréquemment dans les scènes peut ne pas être
discriminant et devrait donc avoir un poids moins important qu'un schéma
peu fréquent. Le poids dépend du nombre d'occurrences du schéma interactif
dans la scène et il varie en fonction de sa fréquence dans l'ensemble du corpus.
Pour un schéma interactif ti dans la scène dj , nous estimons tfij :
nij
tfij = P
l nlj
(2.2)
Avec nij nombre d'occurrences du schéma interactif (n-gram) ti dans la scène
dj , le dénominateur introduit une normalisation par rapport aux occurrences
de l'ensemble des schémas interactifs de la scène.
La fréquence inverse (idf ) est une mesure de l'importance générale d'un
schéma interactif (mesure d'informativité). Elle est dénie comme le logarithme de l'inverse de la proportion de scènes qui contiennent ce schéma interactif :
|D|
(2.3)
idfi = log
|{d : ti ∈ d}|
Où |D| représente le nombre total de scènes dans le corpus et {d : ti ∈ d}| le
nombre de scènes où le schéma interactif ti apparaît. La représentation nale
est obtenue en pondérant ces deux mesures : (tf − idf )ij = tfij × idfi .
73
Groupes de schémas interactifs
A l'issue de la factorisation en matrices
non-négatives, les schémas interactifs sont organisés en un nombre k de clusters pour chaque semestre et chaque groupe étudiés. La structure de cette
organisation est fondamentale dans notre travail car elle fournit des indications sur l'évolution des interactions parents-bébés selon les semestres et les
groupes. Les critères d'optimisation du nombre de clusters, proposés dans
la littérature, ont des objectifs variés : parcimonie, séparation des clusters,
sémantique... Nous avons opté pour une méthode simple basée sur la maximisation de l'homogénéité et de la séparabilité des clusters [Mahdhaoui, 2010;
Saint-Georges et al., 2011b].
L'originalité de notre approche de caractérisation de la synchronie réside
dans la comparaison de clusters de schémas interactifs. L'information mutuelle normalisée (Normalized Mutual Information) est une métrique adaptée
à cette comparaison [Strehl and Ghosh, 2002]. L'information mutuelle normalisée entre deux résultats de clustering diérents yb1 et yb2 permet de mesurer
la corrélation entre deux regroupements :
Pk Pk
i=1
1
1,2
j=1 ni,j log
2
N M I(b
y , yb ) = r
Pk
1
ni
1
i=1 ni log n
P
n×n1,2
i,j
n1i ×n2j
k
j=1
2
n
n2j log nj
(2.4)
où n1i est le nombre de schémas interactifs appartenant au cluster ci en utilisant le clustering yb1 , n2j est le nombre de schémas interactifs appartenant au
cluster cj en utilisant le clustering yb2 et n1,2
i,j est le nombre de schémas interactifs appartenant au cluster ci en utilisant le clustering yb1 et au cluster cj en
utilisant le clustering yb2 .
Lorsque N M I(b
y 1 , yb2 ) = 1, les résultats de clustering, et donc l'organisation
des schémas interactifs, sont considérés comme identique.
2.3.3 Interprétation des résultats
Apport des schémas interactifs
La gure 2.3 regroupe les résultats de l'analyse des interactions synchrones
des bébés autistes (ceux des enfants typiques et avec retard mental sont décrits dans [Mahdhaoui, 2010; Saint-Georges et al., 2011b]). Nous avons reporté les schémas interactifs (n-gram) pour les deux conditions : ({CG→BB}
et {BB→CG}) selon les trois semestres de vie des enfants. Un modèle linéaire
généralisé mixte permet d'étudier la signicativité statistique des comportements d'un semestre à l'autre.
74
Représentation développementale des principaux modes d'interaction du
bébé à devenir autistique [Saint-Georges, 2011]
En haut : sens adulte → bébé / En bas : sens bébé → adulte. S1, 2, 3 : Semestres 1,
2, 3. Entre parenthèses : % du comportement au sein de l'ensemble des interactions
du groupe pour ce semestre. Les èches indiquent la stabilité (→) ou les variations
signicatives par rapport au semestre précédent : en hausse (%) ou en baisse (&)
(*p< 0.05 ; **p<0.01 ; ***p<0.001). La couleur rouge indique une diérence signicative du groupe autiste comparé au groupe typique ; un comportement en rouge
traduit une dierence dans la comparaison transversale entre les 2 groupes à un
semestre donné ; une èche en rouge traduit une diérence dans la progression développementale (la direction de la èche dière entre les 2 groupes). La comparaison
avec le groupe typique est décrite dans [Saint-Georges et al., 2011b].
Fig. 2.3:
L'objet de ce manuscrit n'est pas de décrire les résultats cliniques de cette
étude. Néanmoins, il est important de relever que la caractérisation de la synchronie apporte un nouveau point de vue et surtout ore de nouvelles perspectives dans la recherche de signes précoces de l'autisme. Les études traditionnelles ne prennent pas en compte de manière explicite le rôle de l'interactant
(le parent). Notre modélisation a pu montrer que lorsque l'interaction est initiée par les bébés autistes, les parents répondent normalement et s'adaptent
75
à lui par une hyperstimulation dès le premier semestre. De plus, les parents
de bébés autistes maintiennent des modalités de stimulation (e.g. toucher)
qui devraient normalement, dans le développement typique, être abandonnées
aux second et troisième semestres pour laisser place à une communication plus
élaborée (e.g. production de mots, regard).
Apport de l'intégration par matrices non-négatives
L'analyse par factorisation en matrices non-négatives permet de structurer
les schémas interactifs précédemment décrits. Nous avons comparé les structurations produites par la factorisation selon les diérents semestres. Notre référence reste le groupe de bébés typiques (TD) que nous comparons aux bébés
autistes (AD) et avec retard mental (ID). La gure 2.4 présente les résultats
obtenus. La méthodologie mise en ÷uvre montre clairement le caractère non
linéaire et déviant du développement des bébés autistes par rapport aux bébés
typiques. Le développement des bébés avec retard mental semble se stabiliser
tout en étant diérent de celui des bébés autistes. Ce résultat est en accord
avec les connaissances sur l'autisme : le caractère déviant de développement
est la principale diérence entre les enfants avec retard mental et les enfants
autistes.
An de démontrer l'utilité de notre approche, nous avons étudié le retard de développement (ID) en mesurant l'information mutuelle normalisée
entre le semestre 2 du groupe TD et le semestre 3 du groupe ID. Le résultat
(N M I(S2T D , S3ID ) = 0.52) indique une plus forte ressemblance entre ces semestres (comparée à N M I(S3T D , S3ID ) = 0.47). La diérence s'explique par
un retard dans l'acquisition des compétences liées à l'interaction sociale. De
plus amples discussions et analyses sont reportées dans la thèse de Catherine
Saint-Georges avec des implications directes pour le dépistage de l'autisme
[Saint-Georges, 2011].
2.3.4 Limites des méthodes basées sur l'annotation de
comportements
Une délité inter-juge raisonnable requiert l'implication d'annotateurs experts tout en étant consommatrice de temps. Il faut néanmoins signaler qu'en
psycho-pathologie, par manque de systèmes automatiques et par tradition,
un grand nombre de base de données sont déjà annotées (exemple des lms
familiaux).
D'un point de vue traitement du signal social, l'idée même d'annoter des
comportements est à remettre en cause. L'annotation revient à restreindre les
signaux sociaux analysés et à les formuler sous forme de dictionnaire. L'ap-
76
Représentation développementale des principaux modes d'interaction du
bébé à devenir autistique [Saint-Georges, 2011]
Fig. 2.4:
parition, dans une scène interactive, d'un comportement non préalablement
déni ne peut être facilement intégré à la modélisation. L'exploitation de modèles thématiques combinant approches supervisées et non-supervisées est une
solution possible à ce problème (e.g. Latent Dirichlet Analysis) [Farrahi and
Gatica-Perez, Aug. 2010].
Une limitation majeure de l'approche est la non prise en compte de la
nature des signaux sociaux. En eet, les gestes, les vocalisations ou encore
les sourires sont produits par des personnes diérentes dans des contextes
variés. Il est néanmoins possible de coupler l'annotation et la caractérisation
de signaux sociaux. Dans la thèse d'Ammar Mahdhaoui, nous avons proposé
de prendre en compte la nature des vocalisations de la mère : détection du
motherese (cf. section 1.6). Cette combinaison de détection de signaux de
parole et annotation permet d'aner la modélisation. Par exemple, nous avons
pu montrer que la nature des hyperstimulations vocales (regulation-up) des
parents de bébés autistes est à rapprocher du motherese [Saint-Georges, 2011].
2.4
Coordination multi-modale : du signal à
l'interprétation
Dans le cadre de la thèse d'Emilie Delaherche (projet MULTI-STIM), nous
avons investigué une voie diérente pour la caractérisation automatique de la
synchronie interactionnelle. Nous sommes partis du constat que les humains
2.4. Coordination multi-modale : du signal à l'interprétation
77
sont des juges ecaces et ables de la coordination sociale [Lakens, 2010;
Cappella, 1997] et ceci même dans des conditions dégradées. Un des objectifs
de la thèse d'Emilie Delaherche est de proposer des méthodes innovantes de
détection, de caractérisation et de prédiction du niveau de synchronie dans
les interactions (humain-humain et homme-robot) en exploitant la nature des
signaux échangés.
Les indices utilisés par les humains pour évaluer l'harmonie et la coordination d'interactants sont multiples. Un des dés majeurs du traitement
du signal social réside dans l'identication et la détection automatique d'indices non-verbaux permettant d'inférer les états et les comportements des
interactants. La majorité des travaux exploite des informations de bas-niveau
extraites du ux audiovisuel pour la caractérisation d'informations de plus
haut-niveau. Par exemple, Hung and Gatica-Perez [2010] ont montré qu'il est
possible d'estimer la cohésion des participants à une réunion par un ensemble
d'indices verbaux : pauses entre les tours de parole individuels, mouvement
pendant les tours de parole, la synchronie... La reconnaissance de rôle et la
détection de la dominance sont largement étudiées par la communauté. Les
approches proposées exploitent des informations liées à l'activité vocale de
l'interactant, la proximité des interventions ou encore la quantité de mouvement lors des interventions [Vinciarelli, 2009; Worgan and Moore, 2011; Hung
et al., 2011; Varni et al., 2009].
Une des innovations de notre approche réside dans la dénition d'un
nouvel espace de représentation explicite de la synchronie. Contrairement
aux approches proposées dans la littérature [Ramseyer and Tschacher, 2011],
nous nous sommes attachés à exploiter des signaux mutli-modaux en portant
une attention particulière à l'intensité des liaisons entre ce signaux (crossmodalité). A noter que cette représentation explicite de la synchronie est en
cohérence avec des travaux récent en neurosciences prônant un point de vue
interactif de la caractérisation des activités cérébrales [Dumas et al., 2010].
2.4.1 Synchronie et intégration multi-modale
Les grandes étapes de conception d'un système de caractérisation automatique de la synchronie ont été rappelées section 2.2.5. Le système développé
dans le cadre de la thèse d'Emilie Delaherche reprend ces étapes et est décrit
gure 2.5.
Le système a été appliqué à la caractérisation de l'interaction entre un
thérapeute et des enfants typiques ou atteints de trouble envahissant du développement. La situation interactive est une tâche de collaboration consistant à
reconstituer un clown à partir de formes en polystyrène. La tâche nécessite des
échanges réussis de signaux verbaux et non-verbaux pendant la manipulation
78
6.1.3
Matériel
!"#$%&'()'*+&%+"&,'-.'-$%'/-01$232"-435'0-6&5)
Les interactions ont été filmées à l’aide d’une caméra Canon MV800, positionné audessus des participants. La fréquence d’échantillonnage audio est de 32 kHz et la .,*19,.!
fréquence )*$7! 71$#1(2 8(')*+-2 K! ;,15+'! F<,';,,&! J! "&.!
d’échantillonnage
video2.5:
est de 25 images par seconde. Des bracelets de couleur (orange
et
Fig.
"((15&,.!
'$! ,"#+! 812,-! "##$*.1&5! '$! '+,! *,#,&'&,((! $
%&!"##$%&'(!)$*!'+,!-$%.&,((!$)!'+,!($%&./!0'!1(!,2'*"#',.!1&!
sont positionnés au bras des deux personnes afin de faciliter le suivi de leurs mains. Les
7$9,7,&'/!
C+%(A! '+,! 812,-(! M1&! 7$9,7,&'N! 1&! '+,! #%**,&'!
4&,*56! 1(! vert)
#$78%',.!
"(!
'+,!
"9,*"5,.!
(:%"*,!
9"-%,!
$)!
'+,!
données audio ont été annotées dans l’outil Anvil (Kipp, 2001) afin de délimiter les tours de
$)!'+,!91.,$!"*,!519,&!"&!178$*'"&'!;,15+'A!'+,!$&,(!1&!7$9,
,(! 1&! "! ;1&.$;!
(15&"-/!
0'!
+"(!
<,,&!
8*,91$%(-6!
%(,.!
)$*!
parole des deux participants. Une video a finalement été écartée car le démonstrateur avait
mal compris=>?@A!
les consignes.
La durée$*!
totale
des 7 interactions
est de 35 minutes environ.
La
1&!'+,!8*,91$%(!)*"7,!"!(7"--,*!;,15+'A!,'#/!C+1(!),"'%*,!519
*$&6! "((,((7,&'!
1&',&'1$&!
,7$'1$&!
*,#$5&1'1$&/!
durée moyenne d’une interaction est de 4min57s.
1&',*,('1&5! 1.,"! $)! '+,! '*"V,#'$*6! $)! 7$9,7,&'! .%*1&5! E! )*
"&.!'(')*+!;,*,!,2'*"#',.!;1'+!B*""'!C$$-/!
Diérentes étapes du système de caractérisation automatique de la synchronie [Delaherche and Chetouani, 2010]
d'objets. Elle est déclinée sous diérentes formes
(cf. gure 2.6) : collaboration
"&.!#$&(,:%,&'-6!$&!#$78-,',!5,('%*,(/!Q,!7"6!.,',#'!;1'
#D! B1'#+! "&.! ,&,*56! ),"'%*,(! #"**6!
&$!– Pièces
1&)$*7"'1$&!
Table 6.1
du puzzle 1&! '+,!
),"'%*,!"!-$$(,!(6&#+*$&6!<,';,,&!'+,!';$!8"*'&,*R(!7$9,7,
par échange
de signaux
verbaux
non-verbaux (parole, regard, expressions
#,! $)! (8,,#+/! E,9,*'+,-,((A!
8"%(,(!
#$&'"1&!
#%,(! $&! et
'+,!
(b) Pièces parasites
(a) Pièces du clown
faciales)
et
imitation
(en
plus
des
signaux
de communication,
possibilité
de 7,"(%*,(! '+,! "7$%
'2#)"1
:#/-.03&2 =U@/! W$'1$&!
,&,*56!
#'1$&"-!('"',!F&,2'!'%*&!8-"&&1&5A!1&)$*7"'1$&!8*$#,((1&5!"&.!
Pièce
Couleur
Pièce
Couleur
Jambe
droite
Bleue
Patte
droite
Bleue
7$9,7,&'!
';$! 91.,$!
)*"7,(! <%'! #"&R'! .1(#*17
%&1#"'1$&! <*,"G.$;&H/!
;,!le.,#1.,.!
'$! .,(15&!
"!
voir C+$%5+A!
ce
que fait
partenaire).
21 enfants
ont
participé<,';,,&!
à cette étude
(âge déJambe gauche Bleue
Patte gauche Bleue
<,';,,&!
"!
-"*5,!
*,51$&!
7$91&5!
(-$;-6!
! ),"'%*,! '$! "##$%&'!veloppemental
)$*! '+,(,!
8"%(,(I!
>!
Tronc
Rouge1'! '"G,(!
Tronc '+,! 9"-%,!
Gris
: 4-6 ans)
: 14 enfants
typiques et 7 enfants atteints de troubles $*! "! (7"--! *,51$&! 7
Main droite
Grise
Tête
Jaune
:%1#G-6/! X$A! ;,! #$78%',.! "&! "..1'1$&"-! ),"'%*,! '$! 7,"(%*,
&$&,! $)! '+,! 8"*'&,*(! ;"(!
(8,"G1&5!
"&.!
J!
$'+,*;1(,/!
K&!
Main gauche
Grise
Yeux
Bleu
du développement.
)"('!"!8,*($&!1(!7$91&5/!C+,!9,-$#1'6!F.1*,#'1$&!"&.!7"5&1'%
;1&.$;! ;"(! #$&(1.,*,.!
"(! (1-,&'! 1)!
'+,! "%.1$! ,&,*56! ;"(!
Tête
Jaune
Vert
,"#+! 812,-! ;"(! #$78%',.2 <6! '+,! Y%#"(! "&.! Z"&".,! $8'1#"! "! 8*,.,)1&,.! '+*,(+$-.! Chapeau
"&.! 1'(! .%*"'1$&!
;"(! -"*5,*! '+"&!
7,'+$.! =>>@/! C+,&! 7,"&! 9,-$#1'6! ;"(! $<'"1&,.! <6! "9,*"51&
/! K(! )1--,.! 8"%(,(! FM%+NA! M%+7N///H! #$&'"1&! #$78"*"'19,-6!
9,-$#1'6!7"5&1'%.,!$)!"--!812,-(!1&!'+,!*,51$&!$)!1&',*,('/!
-,&'! "%.1$! ,&,*56! "(! 1&)$*7"'19,! (8,,#+A! &$',! '+1(! ),"'%*,!
##$%&'(!)$*!,78'6!8"%(,(/!!
P8)";+<13$)=#.3-$&-2Q+,&!;,!#$78%',!7$'1$&!,&,*56!$*!9,-
/0.1 #"#$%&! ;"(! $<'"1&,.! <6! "88-61&5! "! (,'! $)! 8,*#,8'%"-!
'$!'+,!"%.1$!(15&"-/!C+,!7,'+$.!1(!)%--6!.,(#*1<,.!1&!=?@/!0'!
*151&"--6! .,(15&,.! <6! O%771&(! '$! .,',#'! '+,! 8P#,&',*(! 1&!
+/!C+,!*,(%-'1&5!(15&"-!8*,(,&'(!'+,!,9$-%'1$&!$)!'+,!9$#"-1#!
6!$)!'+,!(8,"G,*/!C+,!8,"G(!$)!'+1(!(15&"-!',&.!'$!<,!'+,!7$('!
! 7$7,&'(! 1&! (8,,#+/! Q,! 1&',&.! '$! 1&9,('15"',! 1)! '+$(,!
! 8$1&'(! #$1&#1.,! ;1'+! ($7,! <,+"91$%*"-! *,(8$&(,! 1&! '+,!
*R(!"''1'%.,/!
;,! #$&(1.,*! '+,! $9,*"--! 7$9,7,&'!1&!'+,!91.,$D!5,('%*,A!-,"
+,".! &$.(///! Q,! .,#1.,.! '$! "..! "! -"('! ),"'%*,! '$! )$#%(! $&! +
5,('%*,(/! ["&.(\! '*"#G1&5! ;"(! #$78%',.! ;1'+! '+,! #$
O"7(+1)'! FO$&'1&%$%(-6! "."8'"'19,! 7,"&(+1)'H! "-5$*1'+7! =]
-"('A! ;,! #$78*,((,.! 9! "&.! +! O"*',(1"&! #$$*.1&"',(! '$! '+,!
#$$*.1&"', ) < 9: ; +: /!!
E$',!'+"'!"%.1$!"&.!91.,$!),"'%*,(!;,*,!*,(#"-,.!F71&17%7!
(%<'*"#'1$&! "&.! *"&5,! .191(1$&H! '$! 7"G,! "--! '+,! ,-,7,&
<,';,,&!J!"&.!>/!!
Echange d'informations verbale
(b) Imitation
! /#0'123'4$5)'62 (a)Figure
Q,! $<'"1&,.!>]!),"'%*,(!)$*!,"#+!;1&.$;!$)!1&',*"#'1$&!F]!
6.1 – Installation de l’expérience
et non-verbale
#+!(8,"G,*A!"!*,51$&!$)!1&',*,('!;"(!(,-,#',.!1&!'+,!91.,$/!
),"'%*,(! <,-$&51&5! '$! $&,! 8"*'&,*! $*! '+,! $'+,*! "##$*.1&5! '
'*"#',.!'+,!)$--$;1&5!),"'%*,(!)$*!,"#+!*,51$&!$)!1&',*,('D!
(,#'1$&! $)! '+,! 91.,$A! L! 91(%"-! ),"'%*,(! )$*! .,7$&('*"'$*! "
Fig. 2.6: Congurations91(%"-!),"'%*,(!)$*!,28,*17,&',*H/!
expérimentales
-"1 #"#$%&! 42!5! 1(! .,)1&,.! "(! '+,! &%7<,*!
$)!
812,-(!
1&!
39
7,&'! <,';,,&! '+,! #%**,&'! 91.,$! )*"7,! "&.! "! *,),*,&#,!
7)(!
8&39$%&9'-.'9:4/;%-4:'
/! Q,! #$78%',.! '+,! .1)),*,&#,! <,';,,&! '+$(,! ';$! 17"5,(/!
,-=-.! >1))'?4$#1(2%1'33#%#'($2
#+!812,-A!1)!'+,!.1)),*,&#,!;"(!"<$9,!"!8*,.,)1&,.!'+*,(+$-.A!
O$**,-"'1$&! #$,))1#1,&'! ,('17"',(! '+,! -1&,"*! *,-"'1$&(+18! <,
2,-! ;"(! *,5"*.,.! "(! M1&! 7$9,7,&'N/! C+,! *,),*,&#,! 17"5,!
';$!*"&.$7!9"*1"<-,(!9!"&.!+-!!
8."',.!")',*!,"#+!1',*"'1$&!"(!"!;,15+'1&5!(%7!$)!'+,!#%**,&'!
!
F;,15+'!!!T!J/>H!"&.!'+,!8*,91$%(!*,),*,&#,!17"5,!F;,15+'!
!( A9 ( " 9 @A+ ( " + @
) 9+ <
B*,#1$%(-6! ('%.1,.! '$! "((,((! 1&',*"#'1$&"-! (6&#+*$&6! =>U@A!
!( A9 ( " 9 @ = !( A+ ( " + @ =
&!,&,*56!1&)$*7(!"<$%'!"!(+"*,.!.6&"71#(!<,';,,&!'+,!';$!
*(!7$9,7,&'!$*!"!-1(',&,*R(!#$$*.1&"'1&5!+1(!7$9,7,&'!;1'+!
0'! 1(! &$*7"-1^,.! <,';,,&! =P>I! >@/! C+,! (15&! $)! '+,! #$**,
,"G,*R(!(8,,#+/!
#$,))1#1,&'!1&.1#"',(!'+,!.1*,#'1$&!$)!"(($#1"'1$&!<,';,,&!9!"
79
Extraction de paramètres
La majorité des travaux se focalisent sur l'information visuelle [Ramseyer
and Tschacher, 2011; Sun et al., 2011]. Une des originalités de notre approche
est de traiter explicitement la cross-modalité. Pour chaque locuteur, nous nous
sommes intéressés à la prosodie (f0 et énergie), aux pauses et à l'énergie vocalique (proéminence rythmique section 1.4.1). La quantité de mouvement et
la trajectoire des mains des interactants sont les indices visuels utilisés (avec
diérentes zones d'intérêts et/ou échelles temporelles). Les détails techniques
sont décrits dans [Delaherche and Chetouani, 2010].
Mesures de la synchronie
La mesure la plus largement employée est la corrélation (coecient de
corrélation) [Ramseyer and Tschacher, 2011]. Nous avons introduit la cohérence, qui mesure le degré de liaison entre deux signaux dans le domaine
fréquentiel, an de caractériser la syntonie. D'un point de vue interaction, la
syntonie traduit une harmonie des échanges sans nécessairement une ressemblance (synchronie, imitation). La cohérence est un outil statistique comparant
le contenu fréquentiel des signaux échangés. Il est ainsi possible de mesurer
une liaison entre les signaux produits par les interactants sur des bandes de
fréquence spéciques (même fréquence de mouvements). La corrélation et la
cohérence ont l'avantage de proposer une représentation continue. Cependant,
un des problèmes majeurs de ces outils statistiques reste l'estimation qui n'est
qu'empirique et qui dépend donc du nombre d'échantillon utilisé.
D'autres mesures, inspirées des travaux en synchronie dans les signaux
EEG [Dauwels et al., 2010], ont été utilisées : l'information mutuelle, l'interdépendance non linéaire. Les performances se sont révélées moins bonnes et
les résultats sont plus dicilement interprétables (stage de master d'Emilie
Delaherche). La robustesse et la abilité de l'estimation sont directement liées
au nombre d'échantillons qui, du fait de l'analyse sur horizon nie (fenêtres
de quelques secondes), est relativement faible.
Evaluation de la signicativité des mesures
Bernieri [1988] a proposé un paradigme pertinent pour l'évaluation subjective de la synchronie interactionelle. Dans un contexte d'interaction parentbébé, plusieurs situations sont présentées à des juges sur des écrans séparés :
(1) interaction entre une mère et son propre enfant, (2) interaction entre une
une mère et un enfant inconnu et (3) une situation articielle crée par la projection de lms de mères et d'enfants issus d'interactions diérentes. L'idée
80
étant de proposer un niveau de référence de la synchronie en exploitant des
pseudo-interactions.
D'un point de vue computationnel, le paradigme des pseudo-interactions
se traduit par la génération de N nouvelles interactions en permutant les
séquences de l'interaction originale (surrogate data). Une méthode de bootstrap permet alors de comparer les scores de synchronie obtenus dans les deux
situations et de conclure sur le niveau de synchronie (z-test) [Ramseyer and
Tschacher, 2011; Delaherche and Chetouani, 2010]. A noter que comme la majorité des méthodes de boostrapping, le nombre d'échantillons générés impacte
directement les résultats (plus il est faible moins bonne est la modélisation statistique).
La permutation peut se faire sur les données brutes via des fenêtres de
durée de quelques secondes ou sur le jeu de paramètres extraits (espace des
caractéristiques). Nos expérimentations montrent, qu'en fonction du caractère
répétitif ou non de la tâche, les résultats sont discordants. Pour une tâche répétitive, la ressemblance entre des fenêtres diérentes de quelques secondes
est possible réduisant ainsi le nombre de fenêtres jugées comme synchrone.
Prendre une décision binaire sur la base d'un test statistique n'informe pas
susamment sur la nature de la synchronie. Une caractérisation dimensionnelle (continue) permettrait de mieux appréhender le rôle de la synchronie
dans l'interaction sociale.
Paramètres et représentation de la synchronie
En s'inspirant des travaux de [Feldman, 2003, 2007] sur l'interaction
parent-bébé, nous avons proposé les paramètres suivants :
L'orientation de la synchronie qui correspond à l'identication du leader
de l'interaction ("relation cause à eet").
Le délai entre les deux partenaires dénit le temps moyen pour qu'un
changement dans le comportement de l'un des partenaires entraîne une
réaction chez son partenaire.
Le degré de synchronie correspond au nombre de fenêtres jugées comme
synchrone (obtenu à partir du paradigme des pseudo-interactions).
Dans le cadre de la thèse d'Emilie Delaherche, nous avons étudié l'intégration multi-modale en analysant le niveau de synchronie entre les diérents
signaux échangés. Pour ce faire, nous estimons une matrice de corrélation
Rxy décrite gure 2.7(a). Dans un cadre général, et en utilisant d'autres mesures d'intensité de liaison (cohérence, information mutuelle...), la matrice de
synchronie réalise une intégration multi-modale qu'il est possible d'estimer
de manière dynamique et/ou incrémentale. La hiérarchie entre les intensités
de liaison est obtenue en transformant la matrice de synchronie en matrice
81
de similarité (Dxy = 1 − Rxy ). Un dendrogramme permet alors la visualisation cette hiérarchie (gure 2.7(b)). Les méthodes spectrales de clustering
permettent également d'étudier la structure des matrices de similarité. Ces
travaux, non publiés, ne sont pas présentés dans ce document.
(a) Matrice de synchronie estimée à partir d'une interaction
(b) Dendrogramme correspondant
Fig. 2.7:
Paramètres de la synchronie interactionnelle
82
2.4.2 Des indices non-verbaux au degré de coordination
Dans une formalisation de l'analyse automatique de comportements sociaux humains, Vinciarelli et al. [2009] ont identiés des étapes génériques :
(1) enregistrement de la scène, (2) détection des personnes, (3) extraction
d'indices comportementaux à partir du ux audio-visuel et interprétation de
ces indices en terme de signaux sociaux, (4) caractérisation d'informations
contextuelles et interprétation de comportements sociaux. De nombreux verrous sont identiés pour chacune de ces étapes. Nos travaux se concentrent
sur les étapes 3 et 4. Les prochaines sections présentent l'analyse automatique
du niveau de coordination dans une tâche d'imitation entre un thérapeute et
un enfant (cf. gure 2.6(b)). La démarche est décrite gure 2.8 et nécessite la
caractérisation et la classication de signaux sociaux.
Flux
multimodal
Extraction
d'indices
comportementaux
Interprétation de signaux
sociaux
Comportements
sociaux
Interprétation du contexte
Fig. 2.8:
Principe de reconnaissance de signaux sociaux
Signaux sociaux corrélés au degré de coordination
Un questionnaire a été conçu pour l'évaluation de la perception de la coordination et la régularité de l'interaction sur la base des travaux de [Bernieri,
1988]. Les détails de ce questionnaire sont décrits dans [Delaherche and Chetouani, 2011a]. Il vise à évaluer, selon une échelle de Likert (1 à 6), les mouvements simultanés, le rythme, la régularité et la ressemblance des actions.
Le degré de coordination perçu a été évalué par 17 juges. L'accord inter-juge
est plus important pour les degrés extrêmes : "faible coordination" et "forte
coordination" (cf. gure 2.9). Ce type de courbe en C est caractéristique de
l'évaluation par des humains de comportements subjectifs. Hung and GaticaPerez [2010] obtiennent une courbe similaire dans une étude portant sur la
cohésion sociale dans un meeting. Le recours au jugement par des annotateurs montre que la coordination et la synchronie interactionnelle ne suivent
pas une loi tout-ou-rien et doivent être caractérisées dans un espace continu
83
(dimensionnel). De plus, les faibles scores inter-juge sont caractéristiques de
l'évaluation subjective dans les approches dimensionnelles [Gunes et al., 2011].
6
5.5
5
Mean evaluation score
4.5
item1
item2
item3
item4
4
3.5
3
2.5
2
1.5
1
0.1
0.2
0.3
0.4
0.5
Mean weighted kappa
0.6
0.7
0.8
Degré de coordination perçu par de juges en fonction de l'accord interjuge (mean weighted kappa), mesuré sur l'ensemble des dyades et des items du
questionnaire [Delaherche and Chetouani, 2011a]
Fig. 2.9:
Un des objectifs de nos travaux est d'identier et de détecter les indices qui peuvent être utilisés pour la mesure automatique du degré de
coordination. Nous avons proposé plusieurs catégories d'indices extraits
semi-automatiquement8 (décrites en détail dans [Delaherche and Chetouani,
2011a]) :
- Le tour de parole : durée et ratio des pauses sur l'ensemble de l'interaction, durée et ratio d'un tour de parole, taux de recouvrement de tours
de parole des interactants.
- Les actes de dialogue : catégorisation des sollicitations du thérapeute
(type de questions, répétition), catégorisation des réponses de l'enfant
(adéquate, inattendue, inadéquate)9 .
Les indices extraits automatiquement concernent :
- Le tour de rôle gestuel : en exploitant le suivi des mains des interactants,
l'idée est d'extraire un ensemble de caractéristiques décrivant les phases
actives et passives [Delaherche and Chetouani, 2011a].
- La synchronie dans les mouvements : nous avons repris la caractérisation
proposée section 2.4.1 mais en la restreignant aux informations visuelles
(quantité de mouvement).
8 Annotation
manuelle puis extraction de statistiques
réalisé dans le cadre d'un mémoire de deux étudiantes en orthophonie, encadrées
par Monique Plaza
9 Travail
84
Résultats et interprétations L'étude a consisté à corréler les diérents
indices extraits avec les scores des questionnaires. Les résultats détaillés sont
présentés dans [Delaherche and Chetouani, 2011a]. Ne sont repris, dans ce
document, que les résultats majeurs.
Les indices caractérisant la durée des pauses sont corrélés positivement au
degré de coordination. Les interactions jugées comme faiblement coordonnées
tendent à contenir plus d'interventions du thérapeute, avec plus de questions
(catégorielles et ouvertes) traduisant la nécessité de stimuler des réponses.
Pour les enfants, plus les interventions sont courtes plus la coordination est
importante : les échanges sont alors ecaces. Les enfants typiques produisent
des phrases courtes (backchannels) alors que les enfants atteints de troubles
envahissant du développement présentent des écholalies et des disgressions.
La durée des pauses gestuelles du thérapeute est corrélée négativement
au degré de coordination : le thérapeute doit attendre l'enfant. Une variation
uniforme de la durée des pauses (rythme) est jugée comme un signe de bonne
coordination. Des variations importantes traduisent des dicultés locales dans
l'accomplissement de consignes. L'attention de l'enfant à l'égard des consignes
est mesurée par le taux de recouvrement d'activité gestuelle.
Contrairement aux attentes, plusieurs mesures de synchronie se sont montrées négativement corrélées au degré de coordination. An d'étudier plus
précisément ce résultat, une analyse du délai de synchronie a été entreprise
(augmentation du délai entre les fenêtres traitées [Delaherche and Chetouani,
2011a]). Les résultats ont montré un changement de tendance de la corrélation.
Une première justication réside dans le caractère complexe de la tâche nécessitant un temps de réalisation variable en fonction des capacités de l'enfant.
Cette variabilité n'inuence que très peu les évaluateurs dans la perception
qu'ils ont du degré de coordination. Le questionnaire repose sur une évaluation
globale alors que la méthode proposée section 2.4.1 repose sur une analyse ne
de l'interaction. Une des dicultés de la caractérisation de la synchronie réside
dans la dénition de la fenêtre d'analyse. Une seconde justication se trouve
dans les stratégies suivies par les enfants pour manipuler les objets. Les paramètres extraits sont globaux (quantité de mouvement, tracking des mains)
ne sont pas assez précis pour caractériser des informations de haut-niveau :
diérents mouvements pour la réalisation d'une même tâche. La connaissance
de l'objet manipulé est une source d'informations clairement manquante dans
notre analyse automatique du degré de synchronie dans la tâche d'imitation.
Prédiction du degré de coordination
L'évaluation automatique de la synchronie interactionelle suscite un intérêt croissant de la communauté scientique avec des applications allant de
85
l'analyse de réunion [Hung and Gatica-Perez, 2010] à la conception d'agents
conversationnels animés [Swartout et al., 2006; Prepin and Pelachaud, 2011].
Sun et al. [2011] analysent le degré de mimétisme dans des dyades en utilisant
la corrélation entre des indices visuels (quantité de mouvement, saillance...).
Très peu de travaux se proposent de prédire automatiquement le degré de
coordination, d'imitation ou encore de synchronie interactionelle. On retrouve
partiellement ces ambitions dans l'analyse automatique de réunion [Hung and
Gatica-Perez, 2010; Vinciarelli, 2009] mais pas de manière explicite. Prepin
and Gaussier [2010] proposent une architecture neuronale permettant de détecter la synchronie entre les comportements du robot et de l'humain. La
synchronie est ensuite utilisée comme signal de renforcement pour la modication des comportements du robot.
Le recours au jugement par des humains montre que la synchronie est un
phénomène continu. La prédiction dans un espace continu est un des dés
majeurs du traitement du signal social. Comme c'est le cas pour les émotions
[Gunes and Pantic, 2010; Gunes et al., 2011], une meilleure connaissance des
phénomènes est une étape nécessaire pour la modélisation dimensionnelle de
signaux sociaux. Notre ambition est de proposer une caractérisation continue
et cohérente de la synchronie interactionnelle.
Du discret au continu
La catégorisation de signaux sociaux en classe discrète a le mérite de "faciliter" la classication par l'exploitation de méthodes
traditionnelles. La dénition de catégories discrètes se fait en ne retenant
que les exemples considérés comme certain (score inter-juge important). Par
exemple, Hung and Gatica-Perez [2010] proposent un seuil sur le score interjuge (mean weighted kappa > 0.3) pour ne conserver que des interactions
marquées par de forte ou faible cohésions. Une approche similaire a été proposée dans [Delaherche and Chetouani, 2011b].
Nous souhaitons revenir, dans ce manuscrit, sur une modélisation permettant la prédiction continue du degré de coordination. Les étapes requises sont
les suivantes :
1. Apprentissage d'un classieur SVM sur les données extrêmes par sélection des interactions marquées par une forte ou faible coordination
(problème bi-classe)
2. Modélisation probabiliste : estimation d'une sigmoïde en sortie du classieur SVM.
3. Estimation du degré de coordination des interactions non sélectionnées
(classe de coordination intermédiaire) : probabilité estimée par SVM.
La procédure leave-one-out cross-validation est utilisée pour l'évaluation des
performances. L'approche dimensionnelle n'est pas compatible avec l'estima-
86
tion de score de classication : la notion de catégorie discrète n'existe plus.
L'erreur quadratique moyenne (MSE, Mean Squared Error) est la métrique la
plus utilisée. Notre choix s'est porté sur la corrélation qui permet de comparer
des données non nécessairement homogènes (probabilité et degré perçu). Les
caractéristiques des métriques d'évaluation des systèmes de prédiction dans
des espaces dimensionnels restent encore mal connues, et la dénition d'une
métrique générique reste un problème ouvert [Gunes et al., 2011].
Les résultats de prédiction sur les données collectées dans le cadre de la
tâche d'imitation avec une évaluation par corrélation (gure 2.6(b)) sont présentés table 2.1. Les caractéristiques utilisées sont celles décrites section 2.4.2
et concernent la gestuelle, les tours de parole et la synchronie de l'enfant
(Chi) et du thérapeute (The) [Delaherche and Chetouani, 2011a,b]. Le nombre
de gestes réalisés par l'enfant ou le thérapeute sont des paramètres intéressants car ils permettent une caractérisation continue du degré de coordination
(r = 0.9). Les durées de production verbale de l'enfant caractérisent correctement les diérents degrés de coordination (r = 0.68).
Tab. 2.1: Corrélation entre la sortie du classieur (probabilité) et le score d'évaluation du degré de coordination.
Caractéristique
Gest_Chi_nb
Gest_The_nb
Spe_Chi_dur_range
Spe_Chi_dur_max
Spe_Chi_dur_std
Cor_Hands_win5_lag0_ratio
Spe_Chi_dur_mean
Spe_Pau_dur_ratio
Spe_Pau_dur_mean
Cor_Glob_win5_lag1_ratio
Spe_Pau_dur_med
Spe_Chi_dur_med
Coh_Post_win1_lag2_ratio
Spe_Chi_dur_ratio
Gest_The_pau_dur_range
Gest_The_pau_dur_max
∗ p<0.05
∗∗ p<0.01
∗∗∗ p<0.001
r
0.9∗∗∗
0.76∗∗∗
0.68∗∗∗
0.66∗∗
0.65∗∗
0.59∗∗
0.59∗∗
0.58∗∗
0.58∗∗
0.56∗∗
0.56∗∗
0.56∗∗
0.54∗
0.51∗
0.44∗
0.44∗
Les expérimentations reportées dans ce manuscrit et détaillées dans [De-
87
laherche and Chetouani, 2011b] montrent qu'il est possible de prédire de manière satisfaisante les diérents degrés de coordination. Cette caractérisation
dimensionnelle n'est cependant pas possible pour l'ensemble des paramètres :
Gest_The_pau_dur_max = 0.44 (cf. table 2.1)
2.4.3 Limites des méthodes basées uniquement sur des
informations de bas-niveau
L'exploitation de signaux imposent la mise en ÷uvre d'un ensemble de traitements allant de la segmentation de la parole et du locuteur à l'estimation
de quantité de mouvement. Dans la majorité des situations, les acquisitions
sont réalisées dans des conditions précises. Même dans le cas de données spontanées (interaction thérapeute-enfant), le paradigme expérimental permet de
limiter les déplacements des interactants et de ce fait autorise une prise de
vue unique. Le traitement de données non contrôlées, comme celles issues des
lms familiaux, par des méthodes de traitement du signal est une tâche complexe. L'introduction de capteurs de nouvelle génération, comme la Kinect,
est une solution à prendre en considération. Nous travaillons d'ailleurs sur ce
point. La robotique, caractérisée par des capteurs mis en mouvement, ore
des perspectives intéressantes en permettant de focaliser la perception.
Le développement de méthodes automatiques d'analyse de comportements
humains reposent sur la caractérisation d'indices non-verbaux. La détection
même de ces indices est problématique car ces indices sont distribués sur des
modalités diérentes. Les modélisations intégratives doivent nécessairement
prendre en compte la diversité des dynamiques temporelles de ces indices
La prédiction dans un espace dimensionnel de signaux sociaux est un dé
majeur du traitement du signal social. La méthode proposée est basée sur
un classieur discriminant (SVM) utilisé par la suite en régression. Une des
perspectives de ce travail est de mettre en ÷uvre des méthodes d'apprentissage
adaptées et réalisant une régression explicite [Nicolaou et al., 2011].
2.5
Ce chapitre a permis d'apporter les bases à l'étude de la dynamique de la
communication humaine. Les méthodes d'analyse proposées ont permis d'étudier la synchronie interactionnelle en exploitant des informations de bas et
haut-niveau. Nous avons discuté les avantages et les limites de chacune de ces
approches. Une méthode intermédiaire basée sur l'apprentissage d'un dictionnaire de comportements, à partir des signaux échangés, semble pertinente.
En eet, exploiter directement les signaux autorise une caractérisation ne et
88
surtout la prise en compte de la nature des signaux.
L'apprentissage de modèles reétant la dynamique de la communication
humaine reste problématique du fait de la subjectivité des comportements.
Cette problématique se retrouve, par exemple dans l'évaluation de la dépression ou de la douleur. L'évaluation est généralement réalisée par des méthodes
subjectives : recueil d'informations verbales auprès du patient, échelle visuelle,
hétéro-évaluation... L'identication de métriques de la subjectivité est un dé
majeur de l'analyse automatique de comportement humain. Les méthodes
mises en ÷uvre doivent explicitement prendre en compte les scores de jugement [Chittaranjan et al., 2011].
Le cadre inter-disciplinaire de nos travaux de recherche contribue grandement à la compréhension et à l'analyse du rôle de la synchronie interactionnelle. Ce cadre, que nous avons cherché à promouvoir, apporte des perspectives nouvelles pour les méthodes de traitement automatique. A l'instar du
social learning [Meltzo et al., 2009], une formalisation inter-disciplinaire des
signaux sociaux autoriserait une avancée importante dans un grand nombre
de domaines allant du traitement du signal social aux sciences cognitives en
passant par la psychologie avec des applications importantes en clinique.
Chapitre 3
Intelligence sociale pour la
robotique personnelle
3.1
Contexte
Doter les robots "d'intelligence" est un des dés majeurs de la robotique
cognitive dont l'objet principal est de répondre à un certain nombre de questions liées à la perception (e.g. reconnaissance d'objets), la localisation (e.g. où
suis-je ?), la navigation (e.g. où aller ?), la manipulation (e.g. comment prendre
cet objet ?), au contrôle / la planication (e.g. que dois-je faire maintenant ?),
l'apprentissage (e.g. puis-je mieux faire ?), à l'interaction (e.g. comment communiquer), etc... Investiguer ces questions fondamentales est nécessaire pour
le développement de la robotique personnelle. Le septième programme-cadre
de la communauté Européenne1 identie la robotique cognitive comme un
axe structurant de plusieurs domaines de recherche allant du signal à l'intelligence articielle en passant par l'automatique, le contrôle, l'apprentissage...
Ces recherches sont souvent menées dans un cadre inter-disciplinaire (e.g. neurosciences, sciences cognitives).
La diversité des travaux menés en robotique cognitive montrent les dicultés importantes à doter les systèmes d'autonomie et de fonctionnalités de
haut-niveau. Nos travaux en robotique cognitive se fondent sur la dénition
proposée par Simon Haykin : Cognitive Dynamic System. Il s'agit d'une généralisation des modèles adaptatifs en traitement du signal. Simon Haykin
propose la dénition suivante : "A Cognitive Dynamic System is a system
that processes information over the course of time by performing the following
functions :
Sense the environment ;
Learn from the environment and adapt to its statistical variations ;
Build a predictive model of prescribed aspects of the environment
And thereby develop rules of behaviour for the execution of prescribed
tasks, in the face of environmental uncertainties, eciently and reliably
in a cost-eective manner."
1 http://cordis.europa.eu/fp7/ict/cognition/home_en.html
90
Intelligence sociale pour la robotique personnelle
Les applications visées sont multiples : radio cognitive (e.g. allocation dynamique de fréquence), recherche d'informations sémantiques... et la robotique
cognitive.
Les capacités requises pour le développement de systèmes cognitifs sont
(1) la capture et la détection de signaux, (2) la perception (pertinence d'une
représentation), (3) la planication (prévoir et simuler le futur), (4) la décision
(choisir une action) et (4) l'action (inuencer le monde). On retrouve ainsi les
boucles traditionnelles de la robotique perception-(décision)-action.
Le robot a vocation à partager son espace, la tâche à réaliser et la décision
avec l'humain. Les diérentes capacités du robot doivent être dénies avec la
prise en compte explicite de la présence de l'humain dans la boucle interactive
(e.g. perception, planication, navigation...). Nos travaux se situent dans ce
contexte et visent à doter les robots d'intelligence sociale tout en nous focalisons sur l'aspect dynamique de l'interaction. L'intelligence sociale requiert la
reconnaissance et la gestion de signaux sociaux (e.g. tour de parole, attention
mutuelle...). Cet axe de recherche se situe dans la prolongation de nos travaux
sur la caractérisation de l'interaction sociale (Chapitres 1 et 2). La robotique,
par son caractère contrôlable, ore la possibilité de développer de nouvelles
méthodes d'investigation [Meltzo et al., 2010]. De plus, de par nos applications en santé, notre ambition est de développer des systèmes interactifs pour
l'assistance de personnes décientes.
3.2
Dynamique de l'interaction Homme-Robot
Un des verrous majeurs de la robotique interactive est la prise en compte
explicite de l'humain dans la boucle interactive. La réalisation d'une tâche
repose sur la capacité des deux interactants (l'homme et le robot) à collaborer
(e.g. partage de la décision) et à communiquer. Le degré de partage de la
tâche permet de diérencier les approches utilisées en robotique interactive.
La téléopération se caractérise par un faible degré de partage, les décisions
sont prises par l'humain mais en contre-partie sa charge cognitive est accrue. La robotique autonome implique peu d'interventions de l'humain mais
requiert souvent une modélisation avancée de la tâche (perception, planication, contrôle). L'autonomie ajustable, qui autorise un transfert dynamique de
pouvoir de décision entre le robot et l'humain, ore un cadre intéressant pour
l'étude de la dynamique de l'interaction homme-robot. La réhabilitation physique et la commande de drones sont des situations nécessitant une autonomie
ajustable.
La coopération se dénit comme une action simultanée dans un objectif
commun. La coordination entre les partenaires n'est pas une condition néces-
3.2. Dynamique de l'interaction Homme-Robot
91
saire à la réalisation d'une tâche. Chapitre 2, nous avons pu voir que la perception du degré de coordination n'est pas liée au succès de la tâche. Klein et al.
[2004] identient 10 challenges pour le développement d'une équipe humainagent : (1) établissement d'un contrat de base, (2) modélisation des actions
et intentions des autres interactants, (3) prédictibilité des comportements, (4)
directabilité (modication des actions en fonction des informations reçues),
(5) expression d'intentions et d'états internes par des signaux (e.g. comportements, actions), (6) observation et interprétation de signaux, (7) négociation
des buts, (8) collaboration explicite dans les mécanismes d'autonomie et de
planication, (9) gestion de l'attention et (10) contrôle des coûts de la coordination. Les axes de recherche proposés dans la littérature visent à traiter un
ou plusieurs de ces dés. Les approches proposées trouvent leurs fondements
dans la théorie de l'intention jointe, les plans partagés... Ces approches ont
été appliquées avec succès en robotique interactive sur la base de modèles de
dialogue et/ou de planication [Rich et al., 2001; Clodic et al., 2009].
Une coopération ecace, qui ne peut se faire sans communiquer et caractériser les signaux échangés, est un dé majeur de la robotique interactive. De
par nos compétences en traitement du signal social, nous nous sommes intéressés à la détection d'événements et d'actions subjectifs requis pour le succès
d'une coopération. Ce chapitre présente nos contributions en robotique interactive et, plus précisément, sur les mécanismes permettant d'engager et de
maintenir des interactions en exploitant la caractérisation de signaux sociaux.
Les sections suivantes précisent la nature des signaux sociaux étudiés ainsi
que la problématique liée à la caractérisation automatique de l'engagement.
3.2.1 Dénitions
L'engagement est déni comme le processus qui permet à des partenaires
d'établir, de maintenir et de mettre n à des interactions [Sidner et al., 2004].
Dans [Le Maitre and Chetouani, 2011], nous avons passé en revue les indices
sociaux de l'engagement.
Le contact visuel (eye-contact) est généralement considéré comme le signal social le plus révélateur du degré d'engagement dans la communication
[Couture-Beil et al., 2010; Rich et al., 2010; Sidner et al., 2004; Castellano
et al., 2009; Ishii et al., 2011]. Une analyse détaillée du rôle du regard dans
l'interaction est proposée dans [Kendon, 1967]. L'étude montre l'importance
du contexte dans l'interprétation de cet indice social (e.g. regard associé à la
parole, regard pendant le tour de parole d'autrui...). Le contact visuel est un
signal prédicteur du changement de tour de parole [Nakano and Ishii, 2010;
Ishii et al., 2011; Duncan, 1972; Goodwin, 1986], de l'attention [Argyle and
Cook, 1976], de l'agrément [Goman, 1963]. Un travail intéressant, proposé
92
par Shimada et al. [2011], montre l'intérêt du contact visuel dans l'acceptabilité des robots.
Peters et al. [2009] ont proposé un modèle de l'engagement basé sur une
boucle action-cognition-perception, qui permet de diérencier plusieurs aspects de l'engagement : perception (e.g. détection d'indices sociaux), cognition (e.g. état interne : motivation), action (e.g. expression de l'intérêt). Une
dimension subjective est introduite pour traduire l'expérience ressentie par les
individus. L'engagement n'est pas un processus simple dont la caractérisation,
la détection et la compréhension restent des problème ouverts et nécessaires
à traiter pour la conception de robots interactifs.
3.2.2 Caractérisation automatique de l'engagement
Les mécanismes permettant d'établir une interaction sont multiples et dépendent du contexte. En complément à l'étude de la synchronie interactionnelle, l'analyse et la caractérisation de l'engagement ont pour objectif le développement de systèmes robotiques dotés de capacités avancées de coopération
et de coordination autorisant ainsi d'envisager des interactions de longue durée
(e.g. de quelques heures à plusieurs mois).
Problématique
La diculté de la caractérisation automatique de la direction du regard motive la recherche d'autres indices sociaux. Goman [1963] a
proposé le concept de "face engagement" décrivant l'emploi du contact visuel,
du regard et de la dynamique de la tête dans l'établissement et la régulation
des interactions. La détection automatique de l'engagement se traduit alors
par : (1) la détection de visage et (2) la classication d'expressions faciales
ou de gestes de la tête. Plusieurs systèmes de détection de l'engagement sont
ainsi proposés dans la littérature [Mutlu et al., 2009; Rich et al., 2010; Ishii
et al., 2011].
D'autres indices peuvent être employés pour la caractérisation automatique de l'engagement. Castellano et al. [2009] ont proposé de combiner le
contact visuel et le sourire, dans un scénario de jeu, comme indicateur du
degré d'engagement. Cette caractérisation est enrichie par des informations
contextuelles comme l'état du jeu et le comportement du robot (expressions
faciales). La posture et la quantité de mouvement [Sanghvi et al., 2011], la
proxémie [Shi et al., 2011; Michalowski et al., 2006] sont autant d'indices
non-verbaux pertinents dans la caractérisation du degré d'engagement. Mower et al. [2007] ont proposé une voie alternative en exploitant des signaux
physiologiques (température et conductivité de la peau) dans l'estimation de
l'engagement à partir d'informations implicites.
3.2. Dynamique de l'interaction Homme-Robot
Positionnement
93
La caractérisation de l'engagement est nécessaire à un
grand nombre de situation interactive. Ce besoin est accentué lors d'interaction avec des personnes décientes. L'engagement est également un témoin
de la qualité de l'interaction exploitable pour l'amélioration des interfaces.
Nos travaux ont porté sur l'étude des signaux non-verbaux permettant
d'établir et de réguler l'interaction :
Détection de l'intention et de l'émotion du locuteur : La prosodie est
considérée comme le support de l'intention et de l'émotion. Notre approche de caractérisation a consisté à reprendre nos algorithmes de classication du mamanais (parole adréssée à un bébé) et de les appliquer à
la catégorisation de la parole adressée à un robot (robot-directed speech).
Dynamique de l'engagement : Dans une expérience visant à développer des agents sensitifs, nous avons proposé l'introduction d'un modèle
dynamique de la communication permettant de soutenir l'échange de
signaux sociaux. Ce modèle statistique est inspiré de la communication
phatique.
Caractérisation continue du degré d'engagement : Dans le cadre d'interaction avec des personnes âgées, atteintes ou non de troubles cognitifs légers, le robot a besoin de continuellement évaluer le degré d'engagement
de son partenaire dans une tâche donnée. En s'inspirant des travaux liés
à l'évaluation de la coopération homme-robot, nous avons introduit une
métrique de l'engagement pouvant être reliée à la charge cognitive de
l'utilisateur.
Robotique d'assistance : Il s'agit de proposer des systèmes innovants
pour l'aide au handicap en exploitant des signaux sociaux et notamment
l'engagement. Nous avons également enrichi la caractérisation de l'état
de l'utilisateur par l'analyse de signaux physiologiques orant ainsi une
voie alternative d'interaction avec le robot.
Il est important de signaler que nos travaux sur la caractérisation de la
parole aective (Chapitre 1) trouvent bien évidemment leur place dans la
robotique interactive. Une partie de ces travaux sont actuellement portés sur
des systèmes robotiques (projets FP7 MICHELANGELO et FUI PRAMAD2).
Par souci de concision, nous ne présentons pas ces travaux qui, d'un point de
vue méthodologique, sont proches de ceux décrits Chapitre 1.
Les contributions, présentées dans ce chapitre, ont été réalisées dans le
cadre de projets soutenus par l'ANR (TecSan'09 ROBADOM et dans une
moindre mesure ANR MIRAS) ou de collaborations (école d'été eNTERFACE'08, Action Européenne COST 2102). Les méthodes décrites sont, pour la
plupart, intégrées dans des systèmes robotiques d'assistance et testées dans
des conditions naturelles : interaction avec des patients atteints de diérentes
pathologies, mise en ÷uvre dans des services cliniques.
94
3.3
Supports non-verbaux de la dynamique
d'une interaction
3.3.1 Communication phatique
La dynamique est nécessaire à toute communication tout en étant un témoin de la qualité de l'interaction (chapitre 2). Cette dynamique n'est que très
peu présente dans les systèmes robotiques entraînant des interactions souvent
considérées comme laborieuses. Le manque d'expressivité des intentions et/ou
états internes du robot est une des causes majeures de la rupture de la boucle
interactive.
La boucle interactive est entretenue par la communication phatique peutêtre verbale ou non-verbale : vocalisations (e.g. "hum-hum", oui), reformulations ou demandes de clarication, mouvements de tête (e.g. hochement),
gestes, mimiques faciales (e.g. sourire), regard... On parle alors de "backchannel" [Allwood et al., 1992] par opposition au"main-channel" qui fait
l'objet de la plupart des eorts de recherche en robotique interactive. La dynamique crée par la production de "back-channel" par l'auditeur est diérente
de celle des tours de parole. En eet, les "back-channel" visent à soutenir
l'émetteur dans la transmission du message.
La génération automatique et ecace de signaux multi-modaux de "backchannel", généralement associée à des systèmes de dialogue, est une étape requise pour le déploiement d'agents (virtuels ou robots) communicants [Wrede
et al., 2010; Morency, 2010; Schroder et al., 2011; Vinciarelli et al., 2011].
3.3.2 Modélisation de la communication
Les travaux présentés dans cette section ont été réalisés pendant l'école
d'été eNTERFACE'082 . J'ai été le principal investigateur du projet "Mulitmodal Communication with Robots and Virtual Agents" avec comme coinvestigateurs : Thierry Dutoit (Université de Mons, Belgique), Jean-Claude
Martin (LIMSI) et Catherine Pelachaud (Institut Telecom). Le principe de
cette école est réunir autour d'un projet des chercheurs conrmés et des
étudiants venant de plusieurs universités sur une durée de 4 semaines. J'ai
donc eu l'occasion de co-encadrer plusieurs étudiants : Samer Al Moubayed
(KTH), Malek Baklouti (Thales), Ammar Mahdhaoui (UPMC), Stanislav Ondas (Technical University of Kosice, Slovaquie), Jérôme Urbain (Mons) et Yilmaz Mehmet (Koc University, Turquie).
2 http://enterface08.limsi.fr/static/projects/7/e08-project7.pdf
3.3. Supports non-verbaux de la dynamique d'une interaction 95
Description du projet
Nous avons opté pour un cadre expérimental initialement déni pour étudier la communication humaine [McNeill, 1992] : quasi-monologue d'un locuteur racontant une histoire et un auditeur exprimant son engagement par
des indices non-verbaux. Nous avons proposé de développer une plateforme
commune à des agents virtuels et des robots permettant la régulation de l'interaction. Le robot (ou l'agent) analyse des signaux de communication : prosodie, geste, expression faciale (sourire) et tactile. Après interprétation de ces
signaux, le robot/agent doit être capable de produire des signaux non-verbaux
pour exprimer son engagement dans l'interaction ("back-channel" ).
L'application visée est le développement d'un auditeur sensitif (active listening) : l'humain raconte une histoire et le robot est capable de montrer son
intérêt et son engagement pendant le récit (an de ne pas rompre la boucle
interactive).Une base de données a été collectée dans diérentes langues (anglais, arabe, français, slovaque et turque). L'annotation s'est focalisée sur certains signaux sociaux : sourire, gestuelle de la tête (hochement), proéminence
acoustique (changement rapide de la voix pendant le discours). La base de
données est décrite dans [Al Moubayed et al., 2009].
Modélisation statistique de l'interaction
La démarche suivie est similaire à celle proposée section 2.3. A partir de
l'annotation de situations interactives (humain-humain), nous avons estimé
des modèles bi-gramme permettant d'identier des structures de l'interaction :
"If some signal (eg. head-nod | pause | pitch accent) is received, then the
listener sends some feedback_signal with probability X."
Certaines de ces structures on été étudiées dans la littérature [Ward and
Tsukahara, 2000; Maatman et al., 2005]. Notre approche a consisté à conrmer
la présence de ces structures dans nos modèles par signicativité d'un n-gram
donné. Une autre contribution a consisté à proposer de nouvelles structures
gérant la multi-modalité :
Mono-modal signal ⇒ mono-modal feedback : head_nod is received, then
the listener sends head_nod_medium.
Mono-modal signal ⇒ multi-modal feedback : smile is received, then the
listener sends head_nod and smile.
Multi-modal signal ⇒ mono-modal feedback : head_activity_high and
pitch_prominence are received, then the listener sends head_nod_fast.
Multi-modal signal ⇒ multi-modal feedback : pitch_prominence and
smile are received, then the listener sends head_nod and smile.
Les n-grammes autorisent une modélisation guidée par les données. Notre
système de prédiction apprend les séquences d'événements produits par le
96
locuteur (e.g. head_activity_high and pitch_prominence ) amenant l'auditeur
à émettre un "back-channel". Comme nous l'avons signalé chapitre 2, une
des limites de cette approche réside dans sa dépendance à l'annotation (choix
des étiquettes et qualité de l'annotation). Pour traiter de ce problème dans le
cadre de la modélisation de l'interaction de la dynamique des "back-channels",
il est possible d'exploiter des méthodes d'adaptation non-supervisée de ngrammes (adaptation en ligne au comportement du partenaire humain). Par
ailleurs, une méthode intégrative permettrait de mieux qualier l'importance
des comportements (section 2.3)
Détection automatique de signaux sociaux
Les signaux sociaux étudiés sont le sourire, la quantité de mouvement
de la tête, la gestuelle de la tête (e.g. "head_nod, head_shake") ainsi que
la proéminence acoustique. La détection des indices extraits du visage a été
réalisée par Malek Baklouti (Thales) sous mon encadrement. Les techniques
mises en ÷uvre font appel à la détection et le suivi de points caractéristiques
du visage [Al Moubayed et al., 2009]. Un détecteur d'activité vocale, adapté
au contexte temps-réel, a également été développé pour la caractérisation
des pauses. Dans ce manuscrit, nous ne revenons que sur la détection de
proéminence dénie comme un changement brusque du rythme de la parole.
Cette rupture du rythme de narration nécessite l'attention de l'auditeur sous
la forme d'un feedback. La détection de ces événements repose sur la distance
de Hotelling (cf. section 1.4.2, équation 1.17). Une modèlisation gaussienne des
paramètres prosodiques du locuteur, estimée en ligne, autorise une détection
individualisée d'événements prosodiques proéminents [Al Moubayed et al.,
2009].
Les algorithmes développés durant l'école d'été ont été mis à la disposition de la communauté scientique. Dans le cadre d'une collaboration avec
les partenaires du projet Emotirob3 , coordonné par Dominique Duhaut (Université de Bretagne Sud), Sébastien Saint-Aimé a pu intégrer la détection de
proéminence dans son système d'interaction Homme-Robot.
Génération de signaux non-verbaux
Le principe du modèle de génération de signaux non-verbaux est décrit
gure 3.1. La prédiction de "back-channel" repose sur les modèles de Markov
cachés préalablement appris sur la base de données humain-humain. Les probabilités émises par les modèles sont analysées (e.g. lissage temporel) an de
prendre une décision discrète sur le type de "back-channel" à émettre. Une
3 http://www-valoria.univ-ubs.fr/emotirob/
3.4. Caractérisation du degré d'engagement
97
fois la décision prise, l'algorithme désactive l'ensemble des modèles, pendant
quelques secondes, an de ne pas générer plusieurs "back-channels".
L'architecture mise en ÷uvre est indépendante de la plateforme interactive
(e.g. Aibo, GRETA). La exibilité de notre architecture repose sur la transmission aux agents (robotique ou virtuel) des informations de haut-niveau décrites
via le langage de contrôle BML (Behavior Markup Language) [Vilhjálmsson
et al., 2007]. Nous avons, pour cela déni, des comportements réactifs (e.g.
mouvement de tête) adaptés à chaque plateforme. Son évaluation est décrite
dans [Al Moubayed et al., 2009]. A noter que la exibilité de notre architecture
a permis un déploiement rapide sur plusieurs plateformes (Aibo, Emotirob)
pendant l'exposition "Villes Européennes des Sciences" en Novembre 2008 au
Grand Palais (Paris).
Fig. 3.1:
3.4
Principe du modèle de génération de feedback non-verbaux
Caractérisation du degré d'engagement
3.4.1 Détection de l'interlocuteur
Une étape importante du développement d'un système interactif est l'identication de l'interlocuteur. La détection de visage et/ou la localisation de
sources sonores sont les stratégies généralement suivies. Nous avons développé, dans le cadre du projet ANR ROBADOM, un algorithme de détection
de l'interlocuteur exploitant la synchronie audio-visuelle. Le principe de la détection, décrit gure 3.2, repose sur la (1) détection de partenaires potentiels
("on-view" ) et (2) la détection de visage parlant par corrélation des caractéristiques audiovisuelles (MFCC et DCT de la zone de la bouche)("on-talk" ).
•
Approach
We propose to automatically characterize «on-view» and «on-talk»
states
• This strategy requires:
98
• Detection of all potential partners
Un suivi de l'interlocuteur par ltrage particulaire permet de garder le contact
• Multi-modal fusion with audio features
avec l'interlocuteur [Chetouani et al., 2010] (cf. gure 3.3).
Video
Potential partners
detection
On-talk detection
Addressee detection
Audio
(a) Détection de l'interlocuteur
mardi 5 juillet 2011
(b) Exemple de détection
Fig. 3.2:
Principe de détection de l'interlocuteur : on-view + on-talk
La détection de l'interlocuteur basée sur la cascade "on-view vs o-view"
puis "on-talk vs o-talk" est souvent trop simpliste. Il est en eet possible
de dialoguer avec un interlocuteur non-visible et inversement de voir un individu sans intention d'entreprendre une interaction. Modéliser la dynamique
des états "on-view vs o-view" et "on-talk vs o-talk" est requis. Oppermann
et al. [2001] soulignent l'importance de la détection du registre "o-talk" pour
l'amélioration des performances des systèmes de reconnaissance de la parole,
pertinence du dialogue... Cependant sa détection n'est pas une tâche aisée.
Par exemple, lorsqu'un interlocuteur se met à lire des instructions, les informations lexicales s'avèrent non discriminantes. Le contexte permet de lever
les ambiguïtés de détection [Batliner et al., 2007; Lunsford et al., 2005].
Par exemple, dans la situation interactive du projet ROBADOM, décrite
gure 3.4, le patient change continuellement d'interlocuteur : ordinateur (exercice de stimulation cognitive), robot. Cependant, ce changement d'interlocuteur ne traduit pas nécessairement un désengagement de la tâche.
Les utilisateurs du dispositif expérimental de stimulation cognitive sont des
patients atteints de troubles cognitifs légers (Mild Cognitive Impairments). La
stimulation cognitive est identiée comme une des méthodologies permettant
Fig. 3.3:
99
Maintien du contact visuel : Implementation sur le robot Jazz
Situation triadique : cas de l'interaction patient - exercice de stimulation
- thérapeute/robot (projet ROBADOM [Chetouani et al., 2010])
Fig. 3.4:
d'atténuer le déclin, chez les personnes âgées, de certaines fonctions cognitives
(e.g. mémoire, attention)[Yanguas et al., 2008]. La diminution de l'attention et
l'augmentation de la charge cognitive ont pour conséquence de réduire le degré d'engagement du patient dans les exercices de stimulation cognitive. Suite
à l'étude de ces situations interactives, sous la forme de dialogue (magicien
d'Oz), nous avons constaté une augmentation d'auto-verbalisations (self-talk )
des patients atteints de troubles cognitifs légers. Ce dialogue interne est considéré comme un indicateur du degré d'engagement du patient dans la tâche.
Un des dés de la robotique sociale est de proposer des systèmes interactifs
capables d'évaluer le degré d'engagement et de proposer des actions (verbales
et/ou non-verbales) permettant d'améliorer l'engagement. La section suivante
présente nos contributions dans la caractérisation de l'engagement via le selftalk.
100
3.4.2 Du self-talk à une métrique de l'engagement
Le self-talk est un dialogue interne ou parole privée (self-directed speech)
[Diaz and Berk, 1992]. Il se diérencie de la parole adressée directement à l'interlocuteur, en l'occurrence le robot ou l'ordinateur, que l'on qualie de parole
sociale car explicitement adressée à autrui (system/robot directed speech).
Le self-talk est un indicateur de l'auto-régulation de comportements due à
la tâche (dicultés et performances) [Lunsford et al., 2005; Fernyhough and
Fradley, 2005; Vygotsky, 1986]. Lunsford et al. [2005] ont étudié les indices audiovisuels du self-talk et concluent par l'importance des informations contextuelles (e.g. regard, activité de l'utilisateur).
Dans le cadre de la thèse de Jade Le Maitre, nous avons proposé d'évaluer le
degré d'engagement des personnes âgées dans des situations interactives par la
caractérisation du self-talk [Le Maitre and Chetouani, 2011]. Les applications
visées sont la conception d'interfaces dotées d'intelligence sociale autorisant
ainsi une amélioration de l'acceptabilité des systèmes d'assistance (en collaboration avec le laboratoire LUSAGE, AP-HP Broca, dirigé par Anne-Sophie
Rigaud). Les investigations sur les comportements d'auto-régulation orent
de nouvelles perspectives de compréhension des stratégies individuelles d'interaction.
Corpus audiovisuel
Un corpus audiovisuel a été constitué sur la base du protocole de stimulation cognitive (cf. gure 3.4). La constitution de ce corpus a été réalisée au
sein du service de Gérontologie de l'Hôpital Broca avec des patients âgés de
66 à 88 ans dont certains sont atteints de troubles cognitifs légers.
Le corpus est décrit dans [Le Maitre and Chetouani, 2011]. L'annotation
est audiovisuelle et reprend le protocole déni dans [Lunsford et al., 2005]
consistant à analyser le regard et la production verbale de l'utilisateur. Le
tableau décrit la distribution des productions verbales selon les 8 patients.
La durée des productions varie de 1 à 2.5s. Il est également intéressant de
noter que, pour une même tâche de dialogue, les patients ont des stratégies
très diérentes : quantité de verbalisation, ratio entre les registres de parole.
Les patients MCI (3 et 7) produisent plus d'auto-verbalisations. Nous avons
proposé d'évaluer automatiquement les stratégies individuelles des patients
an d'estimer un degré d'engagement dans la tâche.
Détection du self-talk
Hacker et al. [2006] ont montré la pertinence d'une caractérisation prosodique pour la discrimination entre les registres on-talk et o-talk. De plus,
101
Quantité d'auto-verbalisation et de parole adressée au système
Patients
1 2 3
4 5 6 7 8 Total
Self-Talk
10 1 106 14 30 37 58 49 315
System Directed Speech 19 7 85 2 6 20 37 55 231
Tab. 3.1:
notre expérience dans la caractérisation de registre de parole adressée à un
bébé (infant-directed speech) (section 1.4) ou un robot montre que la prosodie joue un rôle fondamental dans l'expression de ces actes de dialogue. Nous
avons opté le pour une caractérisation du rythme : analyse basse fréquence
de l'enveloppe rythmique (cf. section 1.4). La caractérisation rythmique est
motivée (1) par l'analyse des registres de parole recueillis (on-talk et self-talk )
et (2) par les résultats obtenus par [Hacker et al., 2006] montrant l'importance
de la durée dans la distinction entre ces deux registres de parole.
Le tableau 3.2 présente les résultats obtenus. Parmi les résultats importants, on notera que la caractérisation de l'énergie du signal est plus discriminante que celle de la fréquence fondamentale (pitch). La raison principale de
cette diérence réside dans le fait que le self-talk est produit pour soi-même
et donc avec une énergie souvent plus faible. Le rythme apporte un gain non
négligeable et notamment lorsqu'il est associé au classieur SVM (cf. table
3.2). Les scores obtenus avec cette dernière conguration permettent d'envisager l'utilisation de la détection automatique du self-talk dans un système
d'évaluation de l'engagement.
Scores de reconnaissance (10 folds cross-validation)
Caractéristiques Decision Tree k-NN
SVM
Pitch
49.8%
53.35% 52.16%
Energie
55.54
54.29% 59.51%
Rythme
52.78%
56.58% 56.97%
Pitch
57.42%
59.28% 64.31%
+ Energie
Pitch
55.46%
58.20% 71.62%
+ Energie
+ Rythme
Tab. 3.2:
Métrique de l'engagement
La gure 3.5 décrit le principe de l'évaluation du niveau d'engagement
dans la tâche de dialogue (cf. gure 3.4). Notre approche exploite le fait que
la production self-talk est un indicateur de la charge cognitive du patient.
102
Fig. 3.5:
Description du système proposé pour l'évaluation du degré d'engagement
En nous inspirant des travaux de [Olsen and Goodrich, 2003] sur les métriques en interaction Homme-Robot, nous avons proposé dans [Le Maitre
and Chetouani, 2011] de caractériser "l'eort d'interaction" de l'humain (IE :
Interaction Eort). L'IE est une mesure sans unité traduisant l'eort consacré par l'utilisateur à l'interaction. L'estimation de la mesure n'est pas aisée
car elle requiert des techniques avancées d'évaluation de l'interaction eective
(e.g. eye-tracker et/ou activité cérébrale). L'originalité de notre travail est
de considérer (1) la parole adressée au robot (on-talk ) associée à un contact
visuel (on-view ) comme une interaction eective et (2) la production de selftalk comme un indicateur de la charge cognitive liée à la diculté de la tâche.
L'estimation de l'eort d'interaction est donnée par :
IE =
SDS
SDS + ST
(3.1)
Avec SDS la durée de parole adressée au robot et ST la durée des autoverbalisations (self-talk ).
Le numérateur caractérise l'interaction eective alors que le numérateur
est un indicateur de la production verbale de l'utilisateur. IE est une mesure
sans unité (0 ≤ IE ≤ 1). Une interaction sera considérée comme ecace si elle
ne nécessite pas de comportements d'auto-régulation (IE ≈ 1). L'IE permet
de mesurer la qualité de l'interaction que nous qualions, au moins pour cette
tâche, comme une indication du degré d'engagement du patient. Les comportements d'auto-régulation permettent, dans certains cas, aux patients d'améliorer leurs performances [Fernyhough and Fradley, 2005; Vygotsky, 1986].
Nous avons étudié le degré d'engagement de plusieurs personnes âgées dans
la tâche triadique (cf. gure 3.4). Nous présentons table 3.3 les résultats de
cette expérience selon une annotation manuelle et une détection automatique
du self-talk. La mesure IE traduit le niveau d'engagement des patients dans
la tâche. Par exemple, les patients 4 et 5 n'ont que très peu interagis avec le
robot tout en présentant de nombreux comportements d'auto-régulation (cf.
tableau 3.1). Il en résulte un degré d'engagement très faible (≤ 0.2).
Les diérences relevées entre l'annotation manuelle et la détection de self-
3.5. Robotique d'assistance
103
talk sont dues (1) aux performances obtenues par le détecteur de self-talk
(table 3.2) et (2) à la détection automatique de l'activité vocale qui, malgré
son adaptation au contexte robotique [Al Moubayed et al., 2009], s'avère non
susamment performante dans la détection de sons à énergie très variable.
Néanmoins, nous retrouvons les mêmes tendances à savoir une valeur élevée
de la mesure IE pour un engagement important. De par la dénition de la
métrique IE (équation 3.1), le nombre de verbalisation impacte directement
l'estimation : plus il est faible moins bonne est l'estimation du degré d'engagement (sous-estimation pour le patient 4). La métrique IE n'a pas vocation
à être une mesure générique de l'engagement. Elle est uniquement adaptée à
des situations de dialogue avec des personnes âgées.
Nous pouvons, cependant, noter que la métrique IE permet d'obtenir des
tendances qui s'avèrent utiles pour la compréhension des comportements individuels. Ces niveaux d'engagement seront par la suite utilisés pour la modication du comportement du robot : production de comportements verbaux
et/ou non-verbaux. Cette étape est actuellement étudiée avec les partenaires
du projet ROBADOM.
Estimation de l'eort d'interaction (degré d'engagement)
Patients
1
2
3
4
5
6
7
8
Annotation
0.5 0.83 0.45 0.13 0.20 0.43 0.42 0.57
Détection du Self-talk
0.62 0.78 0.53 0.08 0.26 0.46 0.38 0.63
Tab. 3.3:
3.5
Robotique d'assistance
Nos contributions en intelligence sociale s'appuient sur des domaines d'application liées à l'assistance de personnes décientes, fournissant ainsi un terrain d'expérimentation riche pour le développement de nouveaux modèles
d'interaction. Une autre motivation de ce champ d'application est l'impact
sociétal, pressenti ou avéré, de la robotique d'assistance [Feil-Seifer and Mataric, 2005; Tapus et al., 2007; Broekens et al., 2009].
Les sections suivantes illustrent quelques réalisations et expérimentations
menées dans des projets collaboratifs visant à développer des robots d'assistance.
3.5.1 Interface multi-modale
Les sections 3.5 et 3.4 ont souligné l'importance de la caractérisation de
signaux sociaux pour l'établissement, le maintien et d'évaluation de l'engagement dans une tâche donnée. Dans le cadre de la robotique d'assistance pour
104
des personnes âgées, il est primordial de prendre en compte le déclin cognitif
qui concerne l'attention, de la mémoire, etc. ... Ce déclin aecte l'engagement
dans des tâches complexes. Les robots sociaux ont vocation à aider les patients
en proposant des encouragements, indications, etc. ... [Tapus et al., 2007]. Et
l'optimisation des signaux échangés (nature et dynamique) est indispensable
pour l'amélioration de l'acceptabilité et de l'eet des robots d'assistance.
Dans le cadre d'une collaboration avec la société Robosoft, formalisée sous
la forme d'un co-encadrement de la thèse de Consuelo Granata, nous avons
proposé d'améliorer l'engagement des personnes âgées dans des tâches interactives via la conception d'une nouvelle interface multi-modale. Le principe
de l'interface est décrit gure 3.6(a). Notre idée repose sur l'exploitation de
la cross-modalité : support visuel du ux audio. Dans [Granata et al., 2010],
nous avons proposé un système permettant à l'utilisateur d'interagir soit avec
la parole soit avec l'écran tactile. La synthèse de la parole est associée à un
texte ou à une représentation graphique créant ainsi un support visuel du
message. Une étude a permis d'identier et d'évaluer les préférences des utilisateurs âgés atteints ou non de troubles cognitifs légers. Le système déployé
par la société Robosoft sur des robots de service est présenté gure 3.6(b).
(a) Support visuel de la modalité auditive
(b) Exemple d'interface
Principe de l'interface multi-modale déployée dans un robot de service
pour personnes âgées
Fig. 3.6:
3.5.2 Engagement dans une interaction physique
L'engagement dans une tâche ne se caractérise pas seulement par des signaux verbaux et non-verbaux. Plusieurs situations interactives, impliquant
des patients (le plus souvent âgés), nécessitent l'analyse d'indices implicites de
l'engagement : double tâche (charge cognitive), pathologies rendant la communication dicile ou impossible (e.g. stades avancées d'Alzheimer)... Mower
105
et al. [2007] exploitent des signaux physiologiques (conductivité de la peau)
pour prédire l'intention des utilisateurs de mettre n à un exercice.
Un des intérêts de la robotique est d'orir une interface à la fois cognitive
et physique. Il est ainsi possible, via une interaction physique (e.g. toucher,
manipulation), d'interagir avec le robot. Dans le cadre du co-encadrement
de la thèse de Cong Zong avec Xavier Clady, nous avons travaillé sur l'estimation de l'état d'un patient manipulant un déambulateur robotisé (gure
3.7(a)) (Projet ANR MIRAS). L'idée étant de percevoir, de manière continue,
les déplacements et les intentions de l'utilisateur via la mesure de signaux
physiologiques (e.g. rythme cardiaque) et de la dynamique de la marche.
Le degré d'engagement d'un patient manipulant un déambulateur est estimé par l'indice de coût physiologique de la marche (ICP, Physiological Cost
Index). L'ICP est un outil simple et couramment utilisé en clinique, non invasif
et fonctionnel pour mesurer l'énergie dépensée pendant la marche. Il consiste
à mesurer la variation de la fréquence cardiaque au repos et après un test
de marche rapportée à la vitesse du déplacement. À l'aide des capteurs qui
seront installés sur le robot, nous pourrons mesurer et analyser en temps réel,
l'indice de coût physiologique de la marche. Cependant plusieurs dicultés
ont été identiées :
La variation de la fréquence cardiaque chez le sujet âgé est très importante et aura pour conséquence de limiter l'interprétation et la robustesse
de l'analyse par l'ICP.
L'ICP est calibré pour un test de marche standard sur une distance xe
qui n'est pas forcément compatible avec l'utilisation non contrainte d'un
déambulateur.
Le caractère intégratif de la robotique impose de traiter en parallèle les
problématiques. Nous avons développé un système de caractérisation nonlinéaire et adaptative de signaux physiologiques. La caractérisation des phases
de la marche, pendant la manipulation du déambulateur, est une étape requise
pour l'estimation de l'indice de coût physiologique. Ces travaux s'inscrivent
dans le développement de systèmes d'aide à la mobilité : Personal Aids for
Mobility [Lacey and MacNamara, 2000; Spenko and Dubowsky, 2006]. L'enjeu
étant de doter ces robots de fonctionnalité avancée de perception de l'humain
(incluant l'état physiologique) an de proposer un ensemble de service : aide
à la navigation, monitoring physiologique...
3.5.2.1 Signaux physiologiques
Les signaux physiologiques permettent d'accéder à l'état interne de l'utilisateur et sont des indicateurs de son engagement [Mower et al., 2007] et de
son état émotionnel [Kim and André, 2008]. De par la nature très hétérogène
106
(a) Déambulateur robotisé
Fig. 3.7:
(b) Plateforme d'acquisi- (c) Exemple de synthèse
tion
Illustrations de travaux exploitant des signaux physiologiques
des signaux (gure 3.8), les traitements appliqués à ces signaux font souvent
appel à des méthodes avancées de traitement du signal. L'originalité de notre
approche est d'exploiter les composantes rythmiques des signaux physiologiques [Cong and Chetouani, 2009]. La caractérisation du rythme basée sur la
transformée d'Hilbert-Huang (THH) (section 1.4.2) a été adaptée au traitement de signaux physiologiques. La décomposition modale empirique (EMD :
Empirical Mode Decomposition) décompose le signal physiologique en composantes modulées en amplitude et fréquence (modes intrinsèques IMF). Notre
approche a consisté à exploiter cette décomposition adaptative et non-linéaire
pour la détection de composantes oscillantes induites par les émotions. Cette
approche introduit une étape de ssion dans les traitements (gure 3.8).
Fig. 3.8:
Caractérisation de signaux physiologiques basée sur la ssion de données
107
Dans [Cong and Chetouani, 2009], nous présentons un système de classication exploitant la ssion des signaux physiologiques : rythme cardiaque,
respiration, conductivité de la peau et l'activité électrique des muscles. Nous
avons exploité la base de données de l'Université d'Augsburg [Kim and André, 2008]. Les résultats de classication par SVM sont présentés table 3.4.
Le système baseline exploite des statistiques appliquées directement aux signaux physiologiques [Kim and André, 2008]. L'approche fusion consiste à
estimer une fréquence moyenne du signal à partir de l'ensemble des modes
intrinsèques. L'approche ssion sélectionne les modes pertinents à la reconnaissance d'émotions. Cette dernière approche obtient le meilleur score tout
en permettant une réduction de la dimension du vecteur caractéristique (cf.
table 3.4).
La base de données de l'Université d'Augsburg [Kim and André, 2008]
ore un cadre limité et surtout diérent des applications visées (personnes
âgées manipulant un déambulateur). La nalisation de la conception du robotdéambulateur (gure 3.7(a)) permettra d'inclure les capteurs nécessaires à
l'acquisition de signaux physiologiques : rythme cardiaque et conductivité
de la peau (mesures par contact). La société Robosoft est en charge de la
conception de ce dispositif et devrait être disponible pour la n de la thèse de
Cong Zong (printemps 2012).
Scores de classication à partir de quatre signaux physiologiques
HHT-based
Baseline
Approche 'Fission' Approche 'Fusion'
(32 paramètres)
(28 param.)
(24 param.)
Exploitation de
71%
76%
62%
quatre signaux
Tab. 3.4:
3.5.2.2 Caractérisation de la marche
La marche se décompose en quatre phases : (1) tonus postural, (2) initiation du premier pas, (3) mouvement et (4) terminaison. Le dispositif mis
en ÷uvre, capteurs infra-rouge (IR) positionnés au niveau des jambes du patient (cf. gure 3.7(b)), a pour objectif la caractérisation des phases d'appui
(stance) et oscillante (swing) de la marche.
Six patients âgés de 77 à 90 ans, atteints de diérents troubles physiques
(incluant des chuteurs), ont utilisé le déambulateur équipé du dispositif de
mesure [Cong et al., 2010] (Hôpitaux Charles-Foix et Henri Mondor). Nous
avons évalué les performances de caractérisation lors d'un test de marche de
10m. Les résultats, présentés table 3.5, sont en cohérence avec ceux obtenus
108
avec un accéléromètre (capteur largement utilisé pour la caractérisation de la
marche).
Tab. 3.5:
Estimation de paramètres temporels de la marche
Patients
Stance time (s)
Swing time (s)
Cadence (steps/min)
1
2
3
4
5
6
1.09
1.03
0.55
1.84
1.31
1.12
(68%) (70%) (65%) (78%) (67%) (68%)
0.52
0.43
0.30
0.53
0.65
0.53
(32%) (30%) (35%) (22%) (33%) (32%)
74.7
83.5
142.8
50.7
61.2
72.1
La caractérisation de la marche durant une utilisation quotidienne du déambulateur requiert de segmenter les phases de marche homogène. L'enjeu
étant de décomposer les utilisations en action élémentaire. La caractérisation
de la marche, pendant ces actions élémentaires, combinée à la caractérisation
de signaux physiologiques ouvrent la voie à l'estimation de l'indice de coût
physiologique. Dans [Cong et al., 2010], nous avons présenté un algorithme
de segmentation de signaux issus des capteurs IR. L'algorithme compare des
fenêtres adjacentes par des métriques de similarité (rapport de vraisemblance
généralisé (GLR) et la divergence de Kullback-Leibler). Dans une tâche de
détection de changement de vitesse, l'utilisation de capteurs IR s'avère moins
performante que l'accéléromètre. Ce dernier est directement porté par le patient permettant une caractérisation plus dèle des mouvements.
Une approche basée modèle a été proposée, en collaboration avec Xavier
Clady et Philippe Bidaud, dans le cadre de la thèse de Cong Zong. L'idée repose sur l'estimation des paramètres du modèle Human36 [Wieber et al., 2008].
La capture du mouvement (Codamotion) permet de reconstruire parfaitement
ce modèle et de l'utiliser par la suite dans la prédiction de la trajectoire du
centre de masse qui est un indicateur de la stabilité de l'humain pendant la
marche (cf. gure 3.9). La position des jambes, estimée par les capteurs IR, est
combinée à une modélisation des membres supérieurs par une caméra 3D (cf.
gure 3.7(b)). Un exemple de synthèse du modèle par le système de capture
embarqué est présenté gure 3.7(c). La méthode mise en ÷uvre permet d'obtenir une prédiction de la trajectoire du centre de masse. Comme le montre la
gure 3.9, les caractéristiques du cycle de la marche sont conservées : rythme
et complexité [Cong et al., 2011]. L'approche basée modèle permet d'accéder
plus nement à la dynamique de la marche.
Fig. 3.9:
3.6
109
Trajectoire du centre de masse
Nos contributions en intelligence sociale pour la robotique personnelle
concernent l'estimation de l'intention et de l'engagement de l'utilisateur. Les
travaux présentés dans ce chapitre concernent la caractérisation de ces actes
sociaux en exploitant des signaux multi-modaux (parole, visage, geste, signaux
physiologiques et marche). Cette caractérisation s'appuie sur des modèles développés dans le contexte de l'analyse du signal de parole (Chapitre 1) et de
la dynamique de la communication (Chapitre 2).
L'engagement est un phénomène complexe impliquant des signaux de communication, l'empathie, les liens sociaux... Il est considéré comme un indicateur du déroulement de l'interaction. Les dicultés rencontrées dans sa caractérisation sont le reet de ses diverses formes de manifestation : de l'acte de
dialogue à la synchronie en passant par des signaux implicites (e.g. signaux
physiologiques).
En robotique interactive, nous pouvons identier trois grands champs à
traiter pour la conception de robots personnels dotés d'intelligence sociale :
(1) la caractérisation d'indices verbaux et non-verbaux de l'engagement, (2)
la génération de comportements du robot exprimant l'engagement, et (3) l'apprentissage de modèles de la dynamique de l'interaction. La dénition d'indices génériques permettant d'établir et de maintenir le contrat de base d'une
interaction (humain-humain et humain-robot) reste un problème ouvert. Une
convergence avec la linguistique, la pragmatique, les sciences cognitives, la
psychologie, et les neurosciences est indéniablement une des voies à suivre.
L'analyse de phénomènes sociaux tels que l'engagement ou la synchronie passe
aussi par l'étude de situations interactives très diverses à l'image des expé-
110
riences menées dans ce manuscrit.
Les comportements verbaux et non-verbaux générés par le robot jouent un
rôle fondamental dans l'interaction. "L'embodiment " (incarnation) du robot
impacte l'acceptabilité et l'interaction. Dans le cadre du projet ROBADOM,
nous menons une étude sur l'acceptabilité par des personnes âgées de systèmes
robotiques. La conception de dispositif interactif améliorant l'engagement et
plus généralement l'interaction doivent combiner plusieurs technologies (e.g.
tablette, expressions faciales)[Bidaud et al., 2010]. La synthèse d'actions sociales requiert la dénition d'un dictionnaire ni de comportements multimodaux. La diversité des plateformes robotiques imposent une individualisation de ce dictionnaire. Notons que les travaux de normalisation de synthèse de
comportements (e.g. Behavior Markup Language) orent exibilité et facilité
d'utilisation.
La gestion de la temporalité des comportements constitue un axe de
recherche à investiguer. Les approches guidées par les données (humainhumain vers humain-robot) requièrent le développement de modèles d'apprentissage avancé. Les modèles génératifs (HMM) présentés dans ce chapitre ne
sont clairement pas susant. Les modèles discriminants tels que les champs
conditionnels aléatoires (Hidden Conditional Random Fields) introduisent les
contraintes nécessaires pour une caractérisation ne de la dynamique de l'interaction à partir de sources hétérogènes et dynamiques. L'approche intégrative oerte par la factorisation en matrices non-négatives doit pouvoir être
exploitée sur des séquences de comportements (Chapitre 2).
Ces axes de recherche poseront les bases de la conception de systèmes
interactifs dotés de capacité de coopération et plus généralement d'intelligence
sociale.
Projet de recherche
Notre projet de recherche s'articule autour de trois axes : (1) la dynamique de la communication, (2) l'intelligence sociale et les interfaces et (3)
la convergence entre le traitement du signal social et les sciences cognitives,
la psychologie, la psychiatrie et les neurosciences. Ces axes de recherche font
écho aux thématiques du groupe IMI2S (Intégration Multi-modale, Interaction et Signal Social) que nous avons proposé de créer au sein de l'équipe
Interfaces et Interactions durant le processus de restructuration de l'ISIR. A
notre connaissance, il s'agit d'une démarche unique visant à faire converger
traitement du signal social, robotique interactive, psychologie et psychiatrie.
Dynamique de la communication
Le traitement du signal social a pour objet l'analyse de signaux échangés avec l'humain et, comme les travaux présentés dans ce manuscrit ont pu
l'illustrer, la dynamique de ces échanges est fondamentale. Or, une dénition
objective de cette dynamique reste un problème ouvert. D'un point de vue
traitement du signal, la composante dynamique de la communication n'est
pas directement observable. Son étude ne se fait que via des signaux de communication et des comportements individuels. Nous détaillons par la suite
l'axe de recherche portant sur la dynamique de la communication. Cet axe est
structuré autour de trois questions jugées fondamentales.
Quelle(s) caractérisation(s) ?
La caractérisation des signaux sociaux, souvent multi-modaux, est une
étape cruciale à la compréhension de l'interaction. Nous voyons apparaître
des modèles permettant une caractérisation dimensionnelle mais actuellement
dédiés à des signaux émotionnels mono-modaux. La dynamique de la communication humaine s'exprime sous des formes très diverses : exploitation de signaux multi-modaux (e.g. geste + parole), des échelles temporelles diérentes,
échanges déséquilibrés de signaux entre des partenaires... Par conséquent, une
dénition objective de la dynamique est imprécise. Les faibles accords interjugers lors d'évaluations subjectives de la dynamique interactionnelle (e.g.
cohésion, synchronie) rendent compte de cette diculté.
Notre expérience de l'analyse de la dynamique de la communication dans
des contextes variés fait émerger quelques récurrences permettant de structurer une activité de recherche sur la caractérisation. Les analyses mises en
112
Projet de recherche
÷uvres sourent d'une sous-caractérisation du contexte. La diculté réside
dans le fait que les informations contextuelles d'un signal social donné dépendent d'autres signaux sociaux. Par exemple, le langage est intrinsèquement
lié à la dynamique de la communication sans en être la composante exclusive.
Le regard permet de lever grand nombre d'ambiguïté tout en étant lui-même
inuencé par le contexte (e.g. tâche, intentions des partenaires). Dans le cadre
du projet MULTI-STIM, nous avons récemment étendu la caractérisation de
la synchronie interactionnelle décrite Chapitre 2 en proposant un paradigme
de manipulation d'objets sur ordinateur. Ce paradigme permet l'étude, via
un eye-tracker, de la direction du regard ainsi qu'une identication précise
des objets manipulés. Ces informations étaient clairement requises dans notre
travail et sont en cours d'intégration à notre système de caractérisation.
Les dés de la caractérisation de la dynamique de la communication résident dans (1) la capacité à étudier simultanément des signaux sociaux très
diérents (e.g. nature, échelles temporelles, rôle) et (2) la proposition de modèles intégratifs pour une analyse inter-modale.
Quel(s) modèle(s) ?
De par le décit de dénition objective, la modélisation de la dynamique
n'est pas aisée. Une représentation intégrative à l'image des matrices de synchronie ou encore des matrices non-négatives semble une voie pertinente que
nous continuerons à explorer. L'exploitation de ces représentations intégratives requièrent l'étude de propriétés mathématiques permettant une décomposition pertinente et cohérente avec l'objectif d'analyse de la dynamique de
la communication. Les algorithmes de factorisation en matrices non-négatives
orent des possibilités en modélisation de comportements sociaux. La diculté réside dans l'identication d'indicateurs dèles des comportements ainsi
que leurs intégrations dans des critères d'optimisation des décompositions.
Nos travaux sur l'analyse de signaux sociaux devraient permettre l'identication d'indicateurs bas-niveau. Les pistes envisagées incluent la proposition de
modèles discriminants dès l'étape de modélisation de la dynamique.
La subjectivité des phénomènes sociaux étudiés doit être explicitement incluse dans les modélisations. Proposer des algorithmes d'apprentissage semisupervisé généralisés à des signaux multi-modaux est identiée comme une
étape nécessaire. Nous avons pour l'instant étudié des algorithmes de coapprentissage reposant sur la coopération de classieurs pendant la phase
d'apprentissage. Une formalisation commune avec les modèles intégratifs précédemment évoqués permettrait de traiter simultanément des aspects subjectifs et intermodaux décrivant les informations contextuelles. Ces dernières sont
requises à la compréhension et à l'interprétation de la dynamique de la com-
Projet de recherche
113
munication. Un des axes structurant du groupe IMI2S récemment constitué
à l'ISIR porte sur l'intermodalité. Dans ce cadre, nous menons actuellement
des recherches sur la modélisation des stratégies employées par des patients
jeunes et âgés atteints ou non de pathologie pour reconnaître des émotions
dont le support est audiovisuel. Les stratégies intermodales sont étudiées sous
l'angle de vue de la fusion d'informations (métriques de complémentarité de
classieurs)
L'évaluation des modèles proposés est une tâche complexe mais importante. La constitution de bases de données dans des contextes réalistes et spontanés autorise des études intéressantes tout en soulevant des questionnements
sur l'annotation, les conditions d'acquisition... Néanmoins, la disponibilité de
telles bases de données permet la comparaison rigoureuse des algorithmes proposés par tout un chacun. Tout en eectuant cet eort de comparaison, nous
souhaitons également proposer de nouvelles bases avec des paradigmes plus
riches (e.g. manipulation d'objets, actes de communication, long-terme). La
recherche clinique ore un cadre expérimental rigoureux et adapté à cette ambition mais requiert une mobilisation importante de personnes et de moyens.
Le groupe IMI2S, de par sa constitution interdisciplinaire, permettra de
traiter la problématique de l'évaluation des modèles de la dynamique de la
communication. D'un point de vue général, nous pensons qu'un projet de
recherche portant sur la dynamique interactionnelle doit combiner : (1) caractérisation des signaux de communication (2) modélisation intégrative (3)
modélisation du contexte tout en se focalisant sur l'interdépendance de ces
composantes.
Comment soutenir la dynamique ?
Les recherches sur la dynamique de la communication ont non seulement
comme objectif la dénition d'indicateurs de l'interaction mais également
la proposition de méthodologies permettant de soutenir cette dynamique.
Comme nous l'avons illustré à plusieurs reprises dans ce manuscrit, un défaut de synchronie tend à rompre l'interaction sociale (e.g. autisme, robotique
personnelle). Les algorithmes favorisant le maintien de l'engagement et plus
généralement de l'interaction sont développés pour des situations interactives
précises sur la base d'agents virtuels ou robotiques.
Le maintien de cette dynamique repose sur la capacité à analyser les signaux sociaux du partenaire et à produire des réponses adéquates. Les modèles
décrits précédemment ont vocation à être utilisés dans ce contexte interactif.
Le dé réside dans le passage d'un mode de communication archaïque (e.g. backchannels d'un auditeur actif) à une communication plus élaborée. Le projet
FP7 Michelangelo ore un cadre expérimental adapté à cette ambition car il
114
Projet de recherche
vise à développer ,dans des situations d'attention conjointe (e.g. triade thérapeute/parent - enfant - robot), un paradigme de bio-feedback permettant aux
intervenants de prendre conscience de la qualité de l'interaction. L'analyse de
signaux sociaux permettra d'identier et de caractériser des indicateurs des
échanges qui seront par la suite exploités pour prédire des comportements du
robot. La confrontation de notre approche basée sur des signaux non-verbaux
à celle basée sur des signaux physiologiques (e.g. EEG) est envisagée. Ces
étapes seront, bien évidemment, réalisées dans un contexte inter-disciplinaire.
Interfaces et intelligence sociale
Le robot comme outil d'expérimentation en traitement du signal social
mais également comme dispositif d'assistance sont les fondements de notre
ancrage en robotique. Du fait du caractère intégratif de la robotique, l'intelligence sociale ne s'exprimera pleinement que dans des projets d'envergure
impliquant la perception de l'environnement, la navigation, la manipulation
d'objets.... La modication des modes de communication (e.g. réseaux sociaux, mondes virtuels et réels) impose non seulement une adaptation mais
également une anticipation des recherches dans le domaine de l'interaction
sociale.
Projets intégratifs
Outre les projets MULTI-STIM et ROBADOM, nous sommes impliqués
dans le projet FUI PRAMAD2 : Plateforme Robotique d'Assistance et de
Maintien à Domicile du programme FUI 11 (Fonds unique interministériel).
Ce projet, coordonné par la société Orange, vise à développer des solutions
robotiques simples, robustes et communicantes (dimension internet) au domicile de personnes âgées pour des interactions à long-terme (plusieurs semaines). PRAMAD2 regroupe des acteurs scientiques et industriels de domaines complémentaires : usage, robotique mobile, perception de l'environnement, intelligence ambiante, jeux sérieux. Nos contributions porteront sur le
développement de systèmes intégrés d'interaction sociale (caractérisation de
l'engagement) qui seront combinés à un système de dialogue pour l'assistance
dans les activités quotidiennes (e.g. rappel de médicaments) et la stimulation
cognitive. Un volet non encore traité dans nos travaux portera sur l'exploitation de la mobilité du robot. Le mouvement de l'humain et/ou du robot est un
signal social très peu utilisé par la communauté traitement du signal social.
Son association à des informations de posture orira la possibilité d'étudier
la proxémie et son impact dans l'engagement. Le déploiement de robots au
Projet de recherche
115
domicile soulève des questions nouvelles. Les interactions Homme-Robot ne
durent généralement que quelques minutes et le passage à une interaction de
plusieurs semaines est un dé majeur. La robustesse des algorithmes et des
systèmes est bien évidement une dimension à prendre en compte. Les comportements journaliers et/ou hebdomadaires sont des sources d'informations dont
la caractérisation (capteurs et algorithmes) est à préciser. Dans le cadre du
projet ROBADOM, nous avons mené une expérience consistant à demander
à des patients âgés de porter une montre actimétrique (e.g. accéléromètre)
pendant deux semaines. Un modèle inspiré de la décomposition de matrices
non-négatives développée pour l'étude des lms familiaux permet de structurer les activités des patients. La régularité extraite sera par la suite exploitée
par le robot en vue d'une personnalisation des interventions.
Nos eorts de collaboration et d'intégration dans des projets d'envergure
seront également accompagnés par la mise en ÷uvre d'un volet expérimental à
l'ISIR via les plateformes Robotex à l'ISIR et les salles expérimentales (service
de psychiatrie de l'enfant et de l'adolescent de l'hôpital de la Pitié-Salpêtrière,
salle interaction de l'ISIR).
Convergence réalité virtuelle - robotique
Nos travaux ont porté sur l'interaction sociale avec des agents virtuels
et robotiques. Nous voyons apparaître dans la communauté de l'interaction
une volonté de convergence des mondes virtuels et réels. L'immersion ou la
téléprésence sont des exemples de situations interactives qui vont tendre à
se développer avec des problématiques d'interaction sociale spéciques. Nous
avons a priori à disposition les éléments nécessaires pour l'étude de ces nouveaux paradigmes expérimentaux. Nous avons récemment initié une collaboration avec le groupe MAP (Manipuler, Analyser et Percevoir les échelles
micro et nanoscopiques), de l'ISIR, portant sur la détection d'intentions et la
caractérisation de l'engagement d'un opérateur humain lors de manipulation
d'objets virtuels (e.g. des molécules). Cette collaboration se traduit par le
co-encadrement d'une thèse dirigée par Stéphane Régnier.
L'haptique fait souvent oce de mode de perception entre les mondes
virtuels et réels. L'environnement scientique de l'ISIR, équipe interaction
dirigée par Vincent Hayward, favorise l'introduction de l'haptique dans nos
modèles. Un des objectifs est d'exploiter cette modalité pour l'évaluation des
intentions et du degré d'engagement des partenaires humains. De plus, la
composante active des dispositifs haptiques développés à l'ISIR permettrait
de recréer des boucles interactives avec une contribution de notre part sur la
composante sociale.
116
Projet de recherche
De l'investigation clinique aux sciences sociales
computationnelles
Les applications présentées dans ce manuscrit ont pu démontrer la pertinence du traitement du signal social pour la recherche clinique. L'ambition
de la fouille de la réalité (reality mining) est d'orir à long-terme des outils d'investigation similaires à l'imagerie biomédicale. En d'autres termes,
sommes-nous capables de lmer des interactions et de produire un ensemble
d'indicateurs sociaux reétant dèlement la dynamique sociale ? Plusieurs initiatives mondiales tendent vers cet objectif comme celles de Je Cohn au MIT
pour les expressions faciales ou encore Andrew Meltzo à l'université de Washington pour l'apprentissage social. Pour ce faire, nos eorts porteront sur
(1) la conception de modèles robustes, précis et interprétables et (2) la collecte
de données réalistes avec un grand nombre de patients avec des capteurs variés (incluant l'exploitation d'objets communicants). Par exemple, de par nos
travaux, il est maintenant envisageable d'évaluer des patients en production
d'émotions et plus uniquement sur la base de leurs capacités à reconnaître des
stimuli.
Passer d'une modélisation individuelle à une modélisation de groupe (principalement une dyade dans nos travaux) requiert le développement de nouveaux modèles. Les recherches menées en psychologie, en psychiatrie et en
sciences sociales doivent également enrichir ces modèles. On retrouve une problématique similaire mais souvent traitée à des échelles humaines et temporelles plus importantes dans un domaine émergent : les sciences sociales
computationnelles. Les algorithmes proposés doivent prendre en compte les
dimensions dynamique et intégratives des signaux échangés. Un dialogue fertile s'initie actuellement entre les neurosciences, la psychologie, la robotique
et l'apprentissage articiel auquel nous souhaitons pleinement contribuer.
Curriculum vitæ
118
Curriculum vitæ
Notice Individuelle
Mohamed CHETOUANI
Maître de Conférences
Section 61 du Conseil National des Universités :
Génie Informatique, Automatique et Traitement du Signal
Institut des Systèmes Intelligents et de Robotique (ISIR)
UPMC - CNRS
UMR 7222
4 Place Jussieu
75252 Paris Cedex
- Synthèse de la carrière p. 1
- Activités scientique p. 1
- Activités d'enseignements p. 5
- Responsabilités collectives p. 6
- Animation de la recherche p. 6
- Description des activités de recherche p. 8
- Liste classée de publications p. 11
Curriculum vitæ
119
Mohamed CHETOUANI
Institut des Systèmes Intelligents et de
Robotique UMR 7222
4 place Jussieu, 75252 Paris Cedex
Né le 04/11/1978
Nationalité Française
Marié, 2 enfants
Tél. : 01-44-27-63-08
E-mail : [email protected]
Parcours professionnel
20012004
20012004
2005
20042005
Avril-Mai 2005
Doctorant - Allocataire à l'Université Pierre et Marie Curie
Moniteur à l'Université Paris-Est Créteil Val de Marne (ex-Paris 12)
Qualication aux fonctions de Maître de Conférences sections 27 et 61
ATER à l'Université Pierre et Marie Curie
Chercheur invité à l'Université de Stirling (Ecosse) : Department of Computing
Science & Mathematics (Prof. Amir Hussain), nancement : Faculty Research
Grant 2005 (Université de Stirling).
Juillet 2005
Chercheur invité à l'Université Polytechnique de Mataro (Barcelone) : Signal
Processing Group. (Prof. Marcos Faundez-Zanuy), nancement par lÍaction Européenne COST 277 (Non-Linear Speech Processing) : short-term mission grant.
Depuis sept. 2005 Maître de Conférences, 61ème section à l'Université Pierre et Marie Curie
Depuis sept. 2007 Enseignant au Département Universitaire d'Enseignement et de Formation en
Orthophonie (DUEFO), Faculté de Médecine Pierre et Marie Curie
Depuis sept. 2008 Responsable du Groupe Perception Articielle et Handicap de l'ISIR (7
permanents)
Formation
Juin 1998
Juin 1999
D.U.T. Génie Electrique et Informatique Industrielle, Option Automatismes et Systèmes mention Bien ( 1er ) IUT de l'Université Paris XIII
Licence d'Ingénierie électrique mention Assez-Bien ( 1er ) Institut Galilée de
l'Université Paris XIII
Juin 2000
Maîtrise EEA (Electronique Electrotechnique Automatique), option Automatique et Informatique Industrielle mention Bien ( 1er ) Université
Juin 2001
DEA Robotique et Systèmes Intelligents, option Contrôle des Systèmes
Mécaniques mention Bien ( 1er ) Université Pierre et Marie Curie
Pierre et Marie Curie
Thèse
Thèse de Doctorat, intitulée Codage neuro-prédictif pour l'extraction de caractéristiques de signaux de
, soutenue le 14 Décembre 2004 au Laboratoire des Instruments et Systèmes, mention très honorable.
Directeur scientique : Prof. J.L. Zarader Section 61 Université Pierre et Marie Curie (UPMC).
Jury : Prof. M. Milgram ( Président ) - Dr. F. Bimbot ( Rapporteur ) - Prof. M. Najim ( Rapporteur )
- Dr B. Gas - Dr J.F. Bonastre
parole.
Activités scientiques
Equipe de Recherche
Groupe Perception Articielle et Handicap
Responsable scientique du groupe (7 permanents)
1
120
Curriculum vitæ
Thèmes de Recherche
Méthodologies statistiques : traitement du signal, extraction de caractéristiques, reconnaissance des
formes et apprentissage
Traitement du signal social : détection, analyse, fusion et reconnaissance de signaux,
Interaction Homme-Robot,
Interface ingénierie - psychologie et sciences cognitives.
Responsable de projets de recherche
Projet Emotion, Prosodie et Autisme (2006-2011) : Fondation France Telecom.
Projet Mamanais (Motherese) (2007-2011) : Fondation de France. Programme Autisme. Respon-
sable Clinique : David Cohen (Chef du Département de Psychiatrie de l'Enfant et de l'Adolescent du
Groupe Hospitalier de la Pitié-Salpétriére).
Projet Child-Computer Interaction (2008-2010) : Exploiting prosody. Programme Hubert-Curien
(PHC). Echange de doctorants, post-doctorants avec le laboratoire LSA de Budapest (Laboratory of
Speech Acoustics).
Projet MULTI-STIM (2010-2013) : Systèmes intelligents de stimulation multisensorielle pour les
enfants avec trouble complexe et multiple du développement. Programme Emergence de l'UPMC.
Coordinateur scientique pour le laboratoire
Action Européenne COST 2102 (2006-2011) : Cross-Modal Analysis of Verbal and Non-Verbal Com-
munication. Action coordonnée par le Pr. Anna Esposito (Italie).
Projet ROBADOM (2009-2012) : Impact d'un robot "majordome" à domicile sur l'état psychoaectif
et cognitif de personnes âgées ayant des troubles cognitifs légers. Projet coordonné par le Pr. AnneSophie Rigaud (Hôpital Broca). Programme ANR Technologies de la Santé.
Projet FUI PRAMAD2 (2011-2014) : Plateforme Robotique d'Assistance et de Maintien à Domicile.
Programme FUI 11 (Fonds unique interministériel). Projet coordonné par la société Orange, débutant
le 01/10/2011.
Projet FP7 ICT-2011-7 MICHELANGELO (2011-2014) : Patient-centric model for remote management, treatment and rehabilitation of autistic children, ICT for Health, Ageing Well. Projet coordonné par la société FIMI (Italie), débutant le 01/10/2011.
Participation à des projets de recherche
Action Européenne COST 277 (2001-2005) : Non-Linear Speech Processing. Action coordonnée par
le Pr. Marcos Faundez-Zanuy.
Projet MIRAS (2008-2012) : Multimodal Interactive Robot for Assistance in Strolling. Projet coordonné par la société ROBOSOFT. Programme ANR Technologies de la santé.
Activités d'encadrement
Titulaire de la PIR (Prime d'investissement en recherche) depuis le 1/10/2009
•
Thèses soutenues (3) :
F. Ringeval, Ancrages et modèles dynamiques de la prosodie : application à la reconnaissance des
émotions actées et spontanées , allocataire-moniteur, thèse commencée en octobre 2006 et soutenue
le 4 avril 2011, co-encadrement à 90 % avec Jean-Luc Zarader. Rapporteurs : Hervé Glotin et Yannis
Stylianou. Examinateurs : Olivier Adam, Bjoern Schuller, David Cohen, Jean-Luc Zarader. Actuellement Post-doctorant dans le groupe DIVA (Document, Image and Voice Analysis), Université de
Fribourg (Suisse).
A. Mahdhaoui , Analyse de signaux sociaux pour la modélisation de l'interaction face à face ,
allocation sur contingent du président, thèse commencée en octobre 2007 et soutenue le 14 décembre
2010, co-encadrement à 90 % avec Jean-Luc Zarader. Rapporteurs : Laurent Besacier et Alessandro
Vinciarelli. Examinateurs : Maurice Milgram, Jean-Claude Martin, David Cohen, Jean-Luc Zarader.
2
Curriculum vitæ
121
Actuellement Post-doctorant chez Orange Labs, Grenoble (France).
C. Saint-Georges, Docteur en psychiatrie, Dynamique, synchronie, réciprocité et mamanais dans
les interactions des bébés autistes à travers les lms familiaux , thèse de sciences, thèse commencée
en octobre 2007 et soutenue le 30 septembre 2011 ; co-encadrement à 40% avec David Cohen. Rapporteurs : Nicolas Georgie, Colwyn Trevarthen. Examinateurs : Marie-Christine Laznik, Philippe
Mazet, Fillipo Muratori, Jacqueline Nadel, David Cohen.
•
•
•
Co-encadrements de thèses en cours (6) :
C. Zong, Caractérisation et modélisation de signaux physiologiques pour l'interaction hommerobot , ANR MIRAS, thèse commencée en décembre 2008 et soutenance prévue en janvier 2012
(congé de maternité), co-encadrement à 30 % avec Xavier Clady (60%) et Philippe Bidaud.
C. Granata, Interaction multi-modale pour la robotique d'assistance , thèse CIFRE avec la société ROBOSOFT, thèse commencée en octobre 2008 et soutenance prévue en janvier 2012, coencadrement à 30 % avec Xavier Clady (60%) et Philippe Bidaud.
E. Delaherche, Modélisation de la dynamique de l'interaction centrée humain : application à
l'autisme , projet MULTI-STIM, thèse commencée en octobre 2010, co-encadrement à 90% avec
Philippe Bidaud.
J. Le Maître, Traitement de signaux sociaux pour l'interaction homme-robot , projet ROBADOM, thèse commencée en octobre 2010, co-encadrement à 90% avec Philippe Bidaud.
A. Parnandi, Machine learning and social signal processing for human-robot interaction , projet
MICHELANGELO, thèse commencée en octobre 2011, co-encadrement à 90% avec Philippe Bidaud.
L. Cohen, Interactions multimodales avec une scène de manipulation virtuelle et/ou réelle , collaboration avec le groupe MAP de l'ISIR, thèse commencée en octobre 2011, co-encadrement à 40%
avec Stéphane Régnier (30%) et Sinan Haliyo.
Encadrement d'ingénieur :
N. Melchior, ingénieur d'études en interaction-homme robot , projet PRAMAD2 à partir du
01/10/2011.
Encadrements de mémoire, stagiaires :
- Co-encadrement (30%) d'un mémoire d'orthophonie (2009) : Caractéristiques prosodiques des enfants
et adolescents, autistes, dysharmoniques, dysphasiques et sans pathologie. Dirigé par David Cohen,
co-encadré par : Laurence Robel (Hôpital Necker-Enfants Malades), Mohamed Chetouani, Monique
Plaza et Dominique Chauvin (Hôpital de la Pitié-Salpétriere).
- 17 stagiaires de DEA/Master 2, école d'ingénieurs (4 à 6 mois).
- 6 stagiaires de Master 1 et 2, projets école d'ingénieurs(1 à 2 mois temps plein).
3
122
Curriculum vitæ
Publications
Type de publication
Revues internationales
Nombre Nom des publications
12
Pattern Recognition, IEEE Trans. on Audio Speech
and Language Processing, Speech Communication,
Plos One, Cognitive Computation, Research in Austim
Spectrum Disorders, International Journal of Methods
in Psychiatric Research.
Revues nationales
1
Traitement du signal
Ouvrages collectifs
2
Numéro spécial dans Speech Communication, Springer
LNAI
Chapitre
7
"Understanding Parent-Infant Behaviors Using Nonnegative Matrix Factorization," "Automatic Motherese Detection for Face-to-Face Interaction Analysis",
"Maximising audiovisual correlation with automatic
lip tracking and vowel based segmentation", "Exploiting a vowel based approach for acted emotion recognition", " Nonlinear predictive models : Overview and
possibilities in speaker recognition", "Nonlinear speech
enhancement : An introductory overview", "Non-linear
speech feature extraction for phoneme classication
and speaker recognition."
49
ICPR, ICASSP, MLSP, ICRA, RO-MAN, ACM MulConférences internationales
timedia, ICMI, ICORR, ICANN, IJCNN ...
Conférences invitées
5
Training schools, NOLISP, IEEE Technical meeting.
Congrès nationaux et divers
9
JEP, JNRR, RFIA, GRETSI, RJCP ..
Brevet
1
Demande de brevet No 10 54317 du 02 juin 2010.
4
Curriculum vitæ
123
Activités d'enseignements
ATER puis Maître de conférences aecté à l'UFR d'Ingénierie de l'Université Pierre et Marie Curie pour
un volume horaire depuis 2004 de 1461 heures équivalents TD
Moniteur aecté à l'UFR Sciences et Technologies de l'Université Paris-Est Créteil Val de Marne (exParis 12) pour un volume de 192h TD de 2001 à 2004.
J'ai assuré diérents enseignements autour du traitement du signal, des signaux et systèmes continus et
discrets, de l'automatique, de la reconnaissance des formes, de l'apprentissage, des méthodes connexionnistes,
de l'informatique industrielle, de l'architecture des ordinateurs, des réseaux locaux industriels ... Le volume
horaire est décomposé dans les tableaux suivants.
Cours
788 heures eq. TD
1er cycle
763 heures eq. TD
TD
271 heures eq. TD
2éme cycle
479 heures eq. TD
TP
732 heures eq. TD
3éme cycle
548 heures eq. TD
Responsabilités pédagogiques :
Responsables de plusieurs UEs : Signaux et systèmes continus et discrets (L3 60 à 100 étudiants), Interaction et communication verbale (M2 15-20 étudiants), Reconnaissance des formes (M2 15 étudiants),
Analyse et codage des signaux (M2 10 étudiants), Physique pour les orthophonistes (150 étudiants).
Investissement pédagogique :
Depuis mon recrutement, j'ai pris en charge des cours liés au traitement du signal et à la reconnaissance
des formes. La pédagogie mise en oeuvre combine une logique de formation d'ingénieur et une logique de
formation par la recherche. Par exemple, dans le cadre des enseignements en reconnaissance des formes, je
propose d'étudier des algorithmes de l'état de l'art dans un contexte applicatif réaliste (e.g. détection de
visage). Pour les enseignements relevant de la licence, mon approche consiste à regrouper, dans la mesure
du possible, les enseignements théoriques et pratiques dans une même semaine d'enseignement. Le cours de
signaux et systèmes a ainsi lieu en début de semaine suivi de travaux dirigés et pratiques les jours suivants.
Ce cycle est complété par un bilan des notions la semaine qui suit.
L'adaptation au niveau de compétences est un des éléments de ma pédagogie. Depuis 2008, je suis en
charge du cours de physique pour les étudiants de 1ère année en orthophonie. La diversité de ces étudiants,
et pour certains leur méconnaissance complète de notions mathématiques, imposent une pédagogie basée sur
la démonstration audiovisuelle des phénomènes (e.g. applet, synthèse de sons pour l'analyse de Fourier).
1 Exemple de projet de robotique interactive
Depuis 2009, j'ai conçu un cours inédit pour le master sciences de l'ingénieur portant sur l'interaction
sociale. Ce cours est décliné en deux versions : Francophone (Master et école d'ingénieur) et Anglophone
Fig.
5
124
Curriculum vitæ
(Master International). Les thématiques traitées portent sur l'analyse, la prédiction et la synthèse de signaux sociaux avec un volet spécique à la robotique interactive. Une pédagogie par projets est développée
permettant de placer les étudiants dans une logique d'apprentissage collaboratif. Les projets portent sur le
développement et l'implémentation d'un scénario interactif proposé par les étudiants eux-mêmes. La gure 1
présente un exemple de projet : reconnaissance de locuteurs et de mots pour l'interaction homme-robot. Le
succès chez les étudiants des projets de robotique interactive se justient par le caractère intégratif requis :
traitement du signal, de l'image, synthèse et contrôle comportements de haut-niveau (e.g. URBI) et évaluation des performances. Les étudiants produisent un rapport de synthèse, une vidéo de l'interaction ainsi
qu'une présentation.
Résumé des responsabilités scientiques, collectives et administratives
Responsable scientique (depuis 2008) du Groupe Perception Articielle et Handicap du Laboratoire
(7 permanents)
Membre élu du conseil de laboratoire de l'ISIR
Membre élu titulaire (depuis 2006) de la commission de spécialistes, puis comité de sélection en section
61 de l'Université Pierre et Marie Curie.
Membre extérieur nommé des comités de sélection en section 61 de l'Université Paris-Est Créteil Val
de Marne (depuis 2009) et de l'Université d'Evry Val-D'Essonne (de puis 2010).
Co-animateur du groupe de travail GT5 Interactions personnes / systèmes robotiques du GDR Robotique (CNRS) avec Rachid Alami (LAAS).
Animation de la recherche et rayonnement
Echanges internationaux :
Chercheur invité à l'Université de Stirling (Ecosse) : Department of Computing Science & Mathematics
(Prof. Amir Hussain), titulaire du "Faculty Research Grant 2005" (Université de Stirling), avril-mai
2005.
Chercheur invité à l'Université Polytechnique de Mataro (Barcelone) : Signal Processing Group. (Prof.
Marcos Faundez-Zanuy), nancement par lÍaction Européenne COST 277 (Non-Linear Speech Processing) : short-term mission grant, juillet 2005.
Participation à des actions européennes COST 277 (Non-linear Speech Processing) et COST 2102
(Cross-modal Analysis of Verbal and Non-Verbal Communication).
Responsable du projet "Multimodal communication with robots and virtual agents" au workshop/école
d'été eNTERFACE'08. Co-responsables : Thierry Dutoit, Catherine Pelachaud et Jean-Claude Martin.
Collaboration avec l'Université de Pise : accès à la plus grande base de données de lms familiaux.
Analyse développementale de l'interaction parent-enfant (autiste, retard mental, typique).
Organisation de colloques, conférences, journées d'études :
Organisateur et Président du comité de programme du workshop NOLISP'07 : "ISCA Tutorial and
Research Workshop on Non-Linear Speech Processing" du 22 au 25 Mai 2007 à Paris.
Organisateur du projet "Multimodal communication with robots and virtual agents" au workshop/école
d'été eNTERFACE'08 : 6 étudiants européens + 3 permanents
Co-organisateur du workshop ANNPR'08 : "International Workshop on Articial Neural Networks in
Pattern Recognition" du 2 au 4 Juillet 2008 (Paris).
Exposition grand public (14 − 16 Novembre 2008) : La Ville Européenne des Sciences : stand "Le
Palais des robots. Robot mon ami".
6
Curriculum vitæ
125
Co-organisateur avec le Prof. David Cohen de l'atelier de réexion prospective PIRSTEC (Prospective
interdisciplinaire en réseau pour les sciences et technologies cognitives) : "Autisme et Prosodie : Quelles
implications possibles ?", 2 Octobre 2009, Hôpital de la Pitie-Salpetriere.
Organisateur du workshop on Learning for Human-Robot Interaction Modeling, RSS 2010 (Robotics :
Science and Systems), Zaragoza, Espagne, 27 Juin, 2010.
Organisateurs de plusieurs journées d'études pour le GT5 Interactions Personnes / Systèmes Robotiques
du GDR Robotique et de sessions spéciales pour les JNRRs : Interaction cognitive (2009), Robotique
cognitive (2011).
Co-organisateur avec Rachid Alami (LAAS) de la 1ère édition des Journées Nationales de la Robotique
Interactive (JNRI 2011) à Paris.
Organisation d'un symposium sur le traitement automatique de signaux sociaux et la robotique interactive dans le champ de la psychiatrie dans le congrès international IACAPAP 2012 (International
Association for Child and Adolescent Psychiatry and Allied Professions).
Expertises :
Expert pour l'ANR (en moyenne deux dossiers par an depuis 2008 pour l'ANR Blanc et pour les ANRs
thématiques CONTINT et TecSan).
Expert pour le Natural Sciences and Engineering Research Council (2010).
Expert extérieur nommé par l'Université de Stirling (Ecosse) pour l'évaluation de projets, de candidats
en 2009.
Membre du comité scientique du programme PIR Longévité et Vieillissement (Programme interdisciplinaire) depuis 2010.
Responsabilités éditoriales :
Membre (fondateur) du comité d'édition de la revue : Cognitive Computation (Springer). Indexation
dans ISI.Thompson Reuters accepté.
Editeur du numéro spécial de Speech Communication : "Special Issue on Non-Linear and Non-Conventional
Speech Processing", publié en 2009.
Editeur associé de la conférence RO-MAN 2011.
Membre du comité scientique des conférences suivantes : ICANN'05, WNSP'05 (Workshop on NonLinear Speech Processing), ICEIS'06 (IEEE International Conference on Engineering in Intelligent Systems), NOLISP'07, NNAM'07 (International Conference on Neural Networks and Associative Memories), International Workshop on Verbal and Nonverbal Communication Behaviours (2007), ICPR'08,
ICPR'10 (International Conference on Pattern Recognition), WCCI'08 (World Congress on Computational Intelligence), COST 2102 Training schools, JNRR, NOLISP 2011.
Président de session : NOLISP'05 (Speech Enhancement), ICANN'05 (Non-Linear Predictive Models
For Speech Processing, Sound and Speech Recognition), NOLISP'09 (Non-conventional features).
Relecteur pour des revues internationales : IEEE Transactions on Audio and Speech Processing (2
articles), Speech Communication (Special Issue on Non-Conventional and Non-Linear Speech Processing), Neurocomputing (4 articles), Pattern Recognition (10 articles), Journal of Acoustical Society of
America (1 article).
Relecteur pour de nombreuses conférences : NOLISP, WNSP, ICNSC, ICEIS, ICANN, ISNN, EUSIPCO, WCCI, ICPR, HRI, RO-MAN, ICRA.
Relecteur de livres, chapitres pour Springer.
Participation à un jury de thèse (extérieur à l'UPMC) :
Mr. Sébastien SAINT-AIME "Conception et réalisation d'un robot compagnon expressif basé sur un
modèle calculatoire des émotions", soutenue le 9 Juillet 2010, jury : Prof. Jacques Tisseau (Président),
Prof. Pascal Estraillier (Rapporteur), Dr. Rachid Alami (Rapporteur), Prof. Dominique Duhaut, Dr.
Brigitte Le-Pévédic.
7
126
Activités de recherche
Curriculum vitæ
Description des activités
Mes activités portent sur l'analyse, la caractérisation, la reconnaissance, la modélisation de signaux et de
comportements sociaux. La richesse et la complexité des signaux de communication et des comportements
imposent des caractérisations et des modélisations non-linèaires, adaptatives et contextualisées (personne, environnement, tâche, état cognitif/aectif...). Le traitement du signal multi-modal et les techniques d'apprentissage statistique apportent des solutions pertinentes pour la caractérisation des interactions. Les champs
d'applications visés sont la robotique interactive, l'assistance aux personnes décientes, la modélisation et
l'objectivation en sciences cognitives et notamment en pathologies (autisme, Alzheimer, troubles cognitifs
légers).
L'analyse automatique des signaux sociaux en lien avec la psychologie est un domaine de recherche
émergent appelé Social Signal Processing. En prenant appui sur les collaborations étroites menées avec des
psychiatres et des psychologues du service de Psychiatrie de l'Enfant et de l'Adolescent de l'hôpital de la
Pitié-Salpétriére, j'ai proposé un domaine spécique du traitement de signaux sociaux atypiques. Cet axe de
recherche se traduit par plusieurs éléments factuels :
Co-encadrements de doctorants/étudiants en psychologie (C. Saint-Georges, J. Demouy) et en ingénierie (A. Mahdhaoui, F. Ringeval, E. Delaherche) : binômes travaillant en étroite collaboration.
La demande d'intégration acceptée à l'ISIR de chercheurs du service de psychiatrie de l'enfant et de
l'adolescent : Prof. D. Cohen (PU-PH), Dr. M. Plaza (CR CNRS), Dr. Chaby (Mcf Paris 5).
La mise en oeuvre d'une salle expérimentale ainsi que des locaux de recherche dans le service clinique.
Organisations de plusieurs journées multi-disciplinaires (Pirstec, IACAPAP 2012).
Projets de recherche : Emotion, Prosodie et Autisme, Mamanais, Multi-STIM ainsi qu'un rayonnement
et des collaborations avec d'autres équipes hospitalières (Broca pour ROBADOM).
Visites et séminaires de chercheurs étrangers : Prof Anna Esposito (responsable de l'action Européenne
COST 2102 sur l'analyse cross-modale de la communication verbale et non-verbale) pendant 2 mois,
Dr. Alessandro Vinciarelli (responsable du réseau d'excellence SSPNet Social Signal Processing) en
tant que rapporteurs de thèse d'Ammar Mahdhaoui ainsi qu'une visite de l'équipe de recherche.
Invitations dans des écoles d'été (cf. publications).
Intégration dans l'association SSPNet (Social Signal Processing Association) nouvellement crée.
Transfert industriel (FUI PRAMAD2).
Proposition de création du groupe IMI2S (Intégration Multi-modale, Interaction et Signal Social) au
sein de l'ISIR lors de la phase de re-structuration du laboratoire.
Dans ce cadre, mes contributions sont décomposées en 3 classes de domaine :
Analyse et caractérisation de signaux sociaux : contribution à la détection, la caractérisation ainsi que la
reconnaissance de l'état de l'interlocuteur : identité, état cognitif/aectif/pathologique ; modèles nonlinèaires et non-gaussiens pour les signaux de parole ; ancrages acoustiques des états aectifs/cognitifs ;
fusion d'informations et apprentissages supervisé, non-supervisé et semi-supervisé pour la reconnaissance d'événements.
Apprentissage pour la modélisation de comportements interactifs : caractérisation et reconnaissance
de signaux et de comportements de régulation, d'engagement, de synchronie dans une interaction ;
modélisation à court et long-terme de comportements ; caractérisation du contexte ;
Robotique cognitive, interactive et sociale : interface ingénierie/sciences cognitives ; intelligence sociale ;
assistance aux personnes décientes ; plateforme robotique de stimulations de comportements chez les
personnes atteintes de troubles cognitifs ;
8
Curriculum vitæ
127
100
Chapitre 3. Classification semi-supervisée de signaux émotionnels
%
'(&')(45&$'&674&("&'() &8&6/4&'("&(& /0%"*)'(19&:(41(
!"#$$"%&"'()*+#,*("'-,&'$."*/"!0$"'1"/':41''*;*&,0'
!"#$$%&'(%)*+,&)%&'!'-
(a)
&
(b)
!"#$$%&'($#$(%)*+,&)%&'-
!"
@41''*;*&,0 $AB
?&':0*/)&,0($AB
.//0&$)*''12&
3&')
.//0&$)*''12&
3&')
@41''*;*&,0 $AC
!#
?&':0*/)&,0($AC
=,'*#$("&'(/0#717*4*)%'(
>(/#')&0*#0*
?%:*'*#$
<
<
<
<
!$
(c)
@41''*;*&,0 $A$
?&':0*/)&,0($A$
.//0&$)*''12&
3&')
(d)
(a) conventionnels (a) et non-conventionnels (b) et
!"#$%&$'& Variations des mesures issues des modèles
(c) du rythme selon les !"#$%&'()*+,-$.&#(&/*0+1"+2&*(#(&/+,)+3"+!'&(4+,"/*+3-)*2"!)+,)*+,5'$)*+)/+,étermine les valeurs moyennes, tandis que la hauteur et la largeur correspondent aux vale5'*+,-$!"'#type ; (d) roue des émotions de Plutchik [PLU80]69.
(b)
Fig. 3.5 – Architecture du système de co-apprentissage avec fusion de données
2 Illustration des recherches dans le domaine de la caractérisation (a) Dynamique des ancrages
5.
Conclusion
acoustiques
pour
l'analyse et la reconnaissance des émotions, (b)
Apprentissage semi-supervisé multi-vues de type co-training pour le renforcement de lois de
prédiction/classication de signaux sociaux.
Fig.
apprentissage ne permet pas la fusion de caractéristiques, les différents classifieurs sont indépendants. Par conséquent, afin de bénéficier de la complémentarité de différents descripteurs et classifieurs, nous proposons un algorithme
qui permet de prendre en compte l’ensemble des points de vue et d’obtenir
une prédiction unique par objet.
Nous avons présenté différentes théories du rythme dan*+ 3-(/#'&,5!#(&/+ ,)+ !)+ !6"2(#').
Cette première partie a montré que le rythme véhicule des phénomènes complexes dont leur
caractérisation ne peut reposer sur des mesures simples telles que le débit, puisque ce dernier
)/+)*#+#&5#+*(.23).)/#+75-5/)+!&.2&*"/#)0+Comme les phénomènes du rythme peuvent être à
3-&'(%(/)+,)s émotions procurées par la musique, /&5*+"8&/*+2'&2&*$+9!&..)+,-"5#')*+"5#)5'*+
3-&/#+:"(#+"52"'"8"/#;+,)+:"(')+3)+3()/+)/#')+3)*+2'&2'($#$*+,)+3"+.5*(75)+)#+,)+celles de la parole. En effet, le rythme apparaît clairement comme sous-modélisé dans les systèmes issus de
3-$#"#+,)+3-"'#+)/+')!&//"(**"/!)+,-$.&#(&/*. Nous avons donc développé des métriques nonconventionnelles pour capturer les phénomènes du rythme de la parole. Différentes techniques
ont alors été exploitées : (i) les mesures spectrales *5'+3-)/8)3&22)+)*#(.$)+2"'+3"+.$#6&,)+,)+
Tilsen, (ii) !"#$%#!&''# et la fréquence instantanées calculées au moyen de la THH, (iii) les
69
3.5.2
Co-apprentissage automatique pour la classification du motherese
Dans le cadre du co-apprentissage automatique de type multi-vue, nous time
avons
proposé une nouvelle méthode de co-apprentissage (cf. figure 3.5). Il
Home
s’agit
d’un algorithme de classification qui consiste à combiner les prédictions
Movies
issues de différents classifieurs (les probabilités à posteriori) afin d’obtenir une time
prédiction unique pour chaque exemple de test. La méthode proposée est une
nouvelle forme de co-apprentissage automatique, elle est plus appropriée aux
problèmes impliquant à la fois la classification semi-supervisée et la fusion
de données. Cet algorithme est conçu pour améliorer les performances de
classification grâce à la combinaison de données non étiquetées.
R. Plutchik, Emotion: A Psychoevolutionary Synthesis, dans Harper & Row, New York, 1980.
122
Multimodal
interactions
Interaction
Modelling
(n-gram)
!"#$%&&'
"&&(%#$')
Markov
diagrams
(a)
tf-idf
codification
Non Negative
Matrix factorisation
Statistical
analysis
GLMM
Normalized
Mutual
Information
(b)
F igu re 2. A nalysis of pa rent-infant interaction: general p rincipals
3 Illustration des recherches dans le domaine de la modélisation (a) Modèle non-supervisé et multimodal d'évaluation du niveau de synchronie dans une dyade, (b)
Modèle de comportements interactifs mére-bébé (3 éme semestre de vie d'un enfant autiste) basé sur une
analyse par modèles de Markov cachés et décomposition en matrices non-négatives de lms familiaux.
Fig.
{CG!""} ensemble of interactive patterns from caregiver (CG) to baby (BB)
#""!$%&'()*(+,-('./'0)1(234105('6311(2)*'/2.+',3,7'8""9'1.'432(:05(2'8$%9
GLMM=Generalized Linear Mixed Model
!"
"
(a)
(b)
4 Illustration des recherches dans le domaine de la robotique interactive (a) Plateforme de robotique
pour le dialogue multimodal, la stimulation multi-sensorielle (MULTI-STIM), (b)
Expérience de storytelling : robot et agent virtuel exploitant en temps-réel uniquement des signaux
non-verbaux (prosodie, proéminence acoustique, mouvements de la tête).
Fig.
9
128
Curriculum vitæ
Ces contributions sont réalisées avec une perspective développementale (du jeune enfant à la personne
âgée). De plus, elles se traduisent par un volet expérimental important. Nous avons ainsi développé une
expertise dans la constitution et l'analyse de bases de données réalistes :
Prosodie et autisme (memoire de J. Demouy, thèse de F. Ringeval) : 38 enfants avec pathologie (autisme,
troubles envahissant du développement non spécié, dysphasique) + 70 enfants typiques interagissant
chacun dans des tâches visant à évaluer leurs caractéristiques prosodiques ainsi que leurs états émotifs.
Synchronie (thèse de E. Delaherche) : 14 enfants avec pathologie (autisme) + 30 enfants typiques
collaborant avec un thérapeute dans une tâche naturaliste (e.g. puzzle)
Engagement social (thèse de J. Le Maître) : 8 patients atteints de troubles cognitifs légers (personnes
âgées de 70 à 95 ans) dans des tâches de robotique interactive (magicien d'Oz, systèmes de détection
automatique).
Assistance aux personnes décientes (thèse de C. Granata) : 30 personnes âgées (>70 ans) exploitant
le système de dialogue multi-modal.
Robotique ludique (Exposition au grand palais) : >100 personnes interagissant avec le robot EmotiROB (développé par Valoria) exploitant un système d'analyse automatique de la prosodie développé
sous la direction de l'ISIR. Version préliminaire développée en open source dans le cadre du projet
eNTERFACE'08.
Base de données eNTERFACE08_STEAD (Story TElling Audio-visual Database) : 22 sessions (dyades)
d'interaction en 5 langues diérentes (Anglais, Français, Slovaque, Arabe) incluant des interactions avec
un agent conversationnel et un robot) distribuée librement dans le cadre du projet eNTERFACE'08.
10
Curriculum vitæ
129
Liste des publications
Le nom des personnes ayant travaillé sous ma (co-)direction sont soulignés
Directions d'ouvrages :
1. Chetouani M., Hussain A., Gas B., Milgram M., Zarader J.-L. editor(s).Advances in Nonlinear
Speech Processing LNAI 4885, Springer Verlag, ISBN : 978-3-540-77346-7.
2. Chetouani M., Faundez-Zanuy M., Hussain A., Gas B., Zarader J.L., Paliwal, K. (2009). Guest
Editorial : Special issue on non-linear and non-conventional speech processing. Speech Communication. Page 713, 2009.
Revues internationales :
1. Gas B., Zarader J.L., Chavy C., Chetouani M.- Discriminant neural predictive coding applied
to phoneme recognition. Neurocomputing. Vol 56 pages 141-166, 2004.
2. Monte-Moreno E., Chetouani M., Faundez-Zanuy M., Sole-Casals, J. - Maximum Likelihood
Linear Programming Data Fusion for Speaker Recognition. Speech Communication. Vol 51 No 9
pages 820-830, 2009.
3. Chetouani M. Faundez-Zanuy M., Gas B., Zarader J.L.- Investigation on LP-Residual Representations For Speaker Identication. Pattern Recognition. Vol 42 No 3 pages 487-494, 2009.
4. Charbuillet C., Gas B., Chetouani M., Zarader J.-L - Optimizing feature complementarity by
evolution strategy : Application to automatic speaker verication. Speech Communication. Vol 51
No 9 pages 724-731, 2009.
5. Chetouani M., Mahdhaoui A., Ringeval F.- Time-scale feature extractions for emotional speech
characterization. Cognitive Computation, Springer, publisher. Vol 1 No 2 pages 194-201, 2009.
6. Wu Y-H., Faucounau V., Granata C., Boespug S., Riguet M., Pino M., Chetouani M., Rigaud
A.S. - Personal service robot for the elderly in home : A preliminary experiment of human-robot
interaction. Gerontechnology. Vol 9 No 2 Pages 260, 2010.
7. Saint-Georges C., Cassel R.S., Cohen D., Chetouani M., Laznik M-C., Maestro S., Muratori F.
- What studies of family home movies can teach us about autistic infants : A literature review.
Research in Autism Spectrum Disorders. Vol 4 No 3, pages 355-366, 2010.
8. Mahdhaoui A., Chetouani M., Cassel R.S., Saint-Georges C., Parlato E., Laznik M.C., Apicella
F., Muratori F., Maestro S., Cohen D. -Computerized home video detection for motherese may help
to study impaired interaction between infants who become autistic and their parents. International
Journal of Methods in Psychiatric Research, vol. 20, Issue 1, pages e6-e18, 2011.
9. Ringeval F., Demouy J., Szaszak G., Chetouani M., Robel L., Xavier J., Cohen D., Plaza, M.
- Automatic intonation recognition for the prosodic assessment of language impaired children.
IEEE Transactions on Audio, Speech and Language Processing, Vol. 19, No. 5, pages 1328-1342,
2011.
10. Demouy J., Plaza M., Xavier J., Ringeval F., Chetouani M., Périsse D., Chauvin D., Viaux S.,
Golse B., Cohen D., Robel L. - Dierential language markers of pathology in Autism, Pervasive
Developmental Disorders Not Otherwise Specied and Specic Language Impairment. Research
in Autism Spectrum Disorders, Vol. 5, Issue 4, pages 1402-1412, 2011.
11. Mahdhaoui A., Chetouani M. - Supervised and semi-supervised infant-directed speech classication for parent-infant interaction analysis. Speech Communication, Vol. 53, No. 9, pages 1149-1161,
2011.
12. Saint-Georges C., Mahdhaoui A., Chetouani M., Laznik M.C., Apicella F., Muratori P., Maestro S., Muratori F., Cohen D. (2011) - Do parents recognize autistic deviant behavior long before
diagnosis ? taking into account interaction using computational methods. Plos ONE, Vol. 6, No.
7 : e22393, 2011.
11
130
Curriculum vitæ
Revue nationale :
1. Gas, B., Chetouani M., Zarader J.L.- Extraction de caractéristiques non linèaire et discriminante : application à la classication de phonèmes. Traitement du signal. Vol 24, 2007.
Chapitres de livre :
1. Chetouani M., Faundez-Zanuy M., Gas B., Zarader J.L.- Non-linear speech feature extraction
for phoneme classication and speaker recognition. Nonlinear speech modelling and applications,
Springer Verlag, publisher, pages 340-350, 2005.
2. Hussain A., Chetouani M., Squartini S., Bastari A., Piazza F. - Nonlinear speech enhancement :
An introductory overview. Progress in Nonlinear Speech Processing, Springer Verlag, publisher,
pages 217-248, 2007.
3. Faundez-Zanuy M., Chetouani M. - Nonlinear predictive models : Overview and possibilities in
speaker recognition. Progress in Nonlinear Speech Processing, Springer Verlag, publisher, pages
170-189, 2007.
4. Ringeval F., Chetouani M. - Exploiting a vowel based approach for acted emotion recognition.
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction. Selected papers from COST Action 2102 International Workshop, Springer Verlag, publisher. Vol LNAI 5042
pages 243-254, 2008.
5. Abel A., Hussain A., Nguyen Q., Ringeval F., Chetouani M., Milgram M.- Maximising audiovisual correlation with automatic lip tracking and vowel based segmentation, BioIDMultiComm
2009. Vol LNCS 5707 pages 65-72, 2009.
6. Mahdhaoui A., Chetouani M., Zong C., Cassel R.S., Saint-Georges C., Laznik M-C., Maestro S.,
Apicella F., Muratori F., Cohen D.- Automatic Motherese Detection for Face-to-Face Interaction
Analysis, Multimodal Signals : Cognitive and Algorithmic Issues, Springer Verlag, publisher. Vol
LNAI 5398 pages 248-255, 2009.
7. Mahdhaoui A., Chetouani M. - Understanding Parent-Infant Behaviors Using Non-negative
Matrix Factorization. Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces,
Springer Verlag, publisher. Vol LNCS 6456 pages 436-447, 2011.
Communications invitées :
1. Chetouani M., Gas B., Zarader J.L. - Learning Vector Quantization and Neural Predictive
Coding for Nonlinear Speech Feature Extraction. EUropean SIgnal Processing COnference 2004
(EUSIPCO'04). Vienne, Autriche, 2004.
2. Chetouani M. - Non-linear predictive modelling for future speech processing applications. IEEE
UKRI IAS Chapter sponsored Seminar & Technical Meeting. Stirling, 2005.
3. Hussain A., Chetouani M., Squartini S., Bastari A., Piazza F. - Up-to-date Review of NonLinear Speech Enhancement. NOn LInear Speech Processing (NOLISP 05). Barcelone, Espagne,
2005.
4. Chetouani M. - Human-centered multi-modal signal processing. 3rd COST 2102 International
Training school on Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces :
, 2010.
5. Chetouani M. - Statistical methods for the characterization of impaired social interactions. 4th
COST 2102 International Training School on Cognitive Behaviourial Systems, 2011.
Communications avec actes (conférences internationales) :
1. Gas, B., Zarader J.L., Chavy C., Chetouani M. - Discriminant Features Extraction by Predictive
Neural Networks. International Conference on Signal, Speech and Image Processing (SSIP) pages
1831-1835, 2001.
2. Chetouani M., Gas B., Zarader J.L. - The modular neural predictive coding architecture. Intenational CONference onInformation Processing (ICONIP'02). Singapour, 2002.
Theoretical and Practical Issues
12
Curriculum vitæ
131
3. Chetouani M., Gas B., Zarader J.L., - Extraction de caractéristiques par codage neuro-prédictif.
Journées d'Etude sur la Parole (JEP'02). Nancy, France, 2002.
4. Chetouani M., Gas B., Zarader J.L. - Neural predictive coding for speech : the DFE-NPC. European Symposium on Articial Neural Networks (ESANN02), pages 275-280. Bruges, Belgique,
2002.
5. Chetouani M., Gas B., Zarader J.L., Chavy C. - Discriminative training for neural predictive
coding applied to speech features extraction. Intenational Joint Conference on Neural Networks
(IJCNN'02). Vol 1 pages 852-857. Honolulu, Hawai,USA, 2002.
6. Chetouani M., Gas B., Zarader J.L. - Cooperative modular Neural Predictive Coding. Neural
Networks for Signal Processing (NNSP'03). Toulouse, France, 2003.
7. Chetouani M., Gas B., Zarader J.L. - Maximization of the modelisation error ratio. Non-LIinear
Speech Processing (NOLISP03). Le Croisic, France, 2003.
8. Chetouani M., Gas B., Zarader J.L., - Modular neural predictive coding for discriminative
feature extraction. Intenational Conference on Speech and Signal Processing (ICASSP'03). Hong
Kong, Chine, 2003.
9. Chetouani M., Faundez-Zanuy M., Gas B., Zarader J.L., - Non-Linear Speech Feature Extraction
for Phoneme Classication and Speaker Recognition. International Summer School Neural Nets
E.R. Caianiello IX Course as a Tutorial Workshop on Nonlinear Speech Processing : Algorithms
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
. Vietri sul Mare (Salerno), Italie, 2004.
Chetouani M., Gas B., Zarader J.L. - Classieur à prototypes et codage neuro-prédictif pour
l'extraction non linéaire de caractéristiques pour la classication de phonèmes. Journées d'études
sur la Parole (JEP 04). Rabat, Maroc, 2004.
Chetouani M., Faundez-Zanuy M., Gas B., Zarader J.L., A New Nonlinear Feature Extraction Algorithm for Speaker Verication. International Conference on Spoken Langage Processing
(ICSLP 04). Iles Jeju, Corée, 2004.
Chetouani M., Faundez-Zanuy M., Gas B., Zarader J.L., A New Nonlinear speaker parameterization algorithm for speaker identication. Speaker Odyssey 2004. Toledo, Espagne, 2004.
Gas B., Chetouani M., Zarader J.L., Charbuillet C. - Predictive Kohonen map for speech features
extraction. ICANN'05 (International conference on articial neural networks). Warsaw, Pologne,
2005.
Chetouani M., Hussain A., Gas, B., Faundez-Zanuy M. - Non-Linear predictive models for
speech processing. ICANN'05 (International conference on articial neural networks). Warsaw,
Pologne, 2005.
Gas B., Chetouani M., Zarader J.L., Feiz F. - The predictive self-organizing map : application
to speech features extraction. WSOM'05 (Workshop on self-organizing map). Paris, France, 2005.
Chetouani M., Hussain A., Gas, B., Zarader J.L. - New sub-band processing framework using
non-linear predictive models for speech feature extraction. ISCA Tutorial and Research Workshop
on NOn LInear Speech Processing (NOLISP 05). Barcelone, Espagne, 2005.
Charbuillet C., Gas B., Chetouani M., Zarader J.L. - New approach for speech feature extraction
based on genetic algorithm. Non LInear Speech Processing Workshop (WNLSP 05). Crête, Grèce,
2005.
Charbuillet C., Gas B., Chetouani M., Zarader J.L. - Application d'un algorithme génétique à
la synthèse d'un prétraitement non linèaire pour la segmentation et le regroupement du locuteur.
JEP'06 (Journées d'Etudes sur la Parole). Dinard, France, 2006.
Charbuillet C., Gas B., Chetouani M., Zarader J.L. - Filter Bank Design for Speaker Diarization
Based on Genetic Algorithms. ICASSP'06 (IEEE International Conference on Acoustics, Speech
and Signal Processing). Toulouse, France, 2006.
and Analysis
13
132
Curriculum vitæ
20. Chetouani M., Hussain A., Gas, B., Zarader J.L. - Non-Linear Predictors based on the Functionally Expanded Neural Network for Speech Feature Extraction. ICEIS'06 (IEEE International
Conference on Engineering in Intelligent Systems). Islamabad, Pakistan, 2006.
21. Charbuillet C., Gas B., Chetouani M., Zarader J.L. - Multi Filter banks approach for speaker verication based on genetic algorithm. ISCA Tutorial on Nonlinear Speech Processing NOLISP'07.
Paris, France, 2007.
22. Chetouani M. - Interaction with autistic infants. International Workshop on Verbal and Nonverbal Communication Behaviours. Vietri-Sul-Mare, Italie, 2007.
23. Charbuillet C., Gas B., Chetouani M., Zarader J.L. - Complementary features for speaker verication based on genetic algorithms. ICASSP'07 (IEEE International Conference on Acoustics,
Speech and Signal Processing). Honolulu, Hawaii, USA, 2007.
24. Al Moubayed S., Baklouti M., Chetouani M., Dutoit, T., Mahdhaoui A., Martin J-C, Ondas S.,
Pelachaud C., Urbain J., Yilmaz M. - Multimodal Feedback from Robots and Agents in a Storytelling Experiment. eINTERFACE'08 Proceedings of the 4th International Summer on Multi-Modal
Interfaces, August 4-29,2008, Paris-Orsay, France. Pages 43-55, 2008.
25. Ringeval F., Sztaho D., Chetouani M., Vicsi K. - Automatic prosodic disorders analysis for
impaired communication children. 1st Workshop on Child, Computer and Interaction - WOCCI,
IEEE International Conference on Multimodal Interfaces, 2008.
26. Ringeval F., Chetouani M. - A vowel based approach for acted emotion recognition. Interspeech
2008. Pages 2763-2766, 2008.
27. Mahdhaoui A., Chetouani M., Zong C. - Motherese Detection Based On Segmental and SupraSegmental Features. IAPR International Conference on Pattern Recognition, ICPR 2008. Tampa
Florida, USA, 2008.
28. Mahdhaoui A., Chetouani M. - Automatic motherese detection for Parent-Infant Interaction.
Speech and face to face communication, workshop dedicated to the memory of Christian Benoit.
Grenoble, France, 2008.
29. Ringeval F., Chetouani M. - Une approche basée voyelle pour la reconnaissance d'émotions
actées. JEP'08 (Journées d'Etudes sur la Parole). Avignon, France, 2008.
30. Dahmani H., Selouani S.A, Chetouani M., Doghmane N. - Prosody Modelling of Speech Aphasia : Case Study of Algerian Patients. International Conference on Information & Communication
Technologies : from Theory to Applications. Damascus, Syria, 2008.
31. Zong C., Chetouani M. - Hilbert-Huang transform based physiological signals analysis for emotion recognition. IEEE Symposium on Signal Processing and Information Technology (ISSPIT'09),
2009.
32. Mahdhaoui A., Ringeval F., Chetouani M. - Emotional speech characterization based on multifeatures fusion for face-to-face interaction. International Conference on Signals, Circuits and Systems (SCS09), 2009.
33. Chetouani M. (2009). Mutlisensory Signal Processing for Emotion Recognition. Workshop on
Current Challenges and Future Perspectives of Emotional Humanoid Robotics, IEEE Interna-
, 2009.
34. Mahdhaoui A., Chetouani M., Cassel R.S., Saint-Georges C., Laznik M-C., Apicella F., Muratori F., Maestro S., Cohen D. - Home video segmentation for motherese may help to detect
impaired interaction between infants. Innovative Research In Autism (IRIA2009), 2009.
35. Mahdhaoui A., Chetouani M. - A new approach for motherese detection using a semi-supervised
algorithm. IEEE Workshop on Machine Learning for Signal Processing (MLSP'09), 2009.
36. Mahdhaoui A., Chetouani M., Kessous L. - Time-Frequency Features Extraction for Infant
Directed Speech Discrimination. ISCA Tutorial and Research Workshop on Non-Linear Speech
Processing (NOLISP09), 2009.
tional Conference on Robotics and Automation (ICRA'09)
14
Curriculum vitæ
133
37. Ringeval F., Chetouani M. - Hilbert-Huang transform for non-linear characterization of speech
rhythm. ISCA Tutorial and Research Workshop on Non-Linear Speech Processing (NOLISP09),
2009.
38. Al Moubayed S., Baklouti M., Chetouani M., Dutoit, T., Mahdhaoui A., Martin J-C, Ondas S.,
Pelachaud C., Urbain J., Yilmaz M. - Generating Robot/Agent Backchannels During a Storytelling Experiment. IEEE International Conference on Robotics and Automation (ICRA'09), Japan,
2009.
39. Riviello M. T., Chetouani M., Cohen D., Esposito A. - On the perception of emotional voices : A
cross-cultural comparison among American, French and Italian subjects. COST 2102 International
Conference on Analysis of Verbal and Nonverbal Communication and Enactment : The Processing
Issues, 2010.
40. Zong, C., Chetouani M., Tapus A. - Automatic Gait Characterization for a Mobility Assistance
System. International Conference on Control, Automation, Robotics and Vision (ICARCV 2010),
2010.
41. Delaherche E., Chetouani M. - Multimodal coordination : exploring relevant features and measures. Second International Workshop on Social Signal Processing, ACM Multimedia, 2010.
42. Chetouani M., Wu Y., Jost C., Le Pevedic B., Fassert C., Cristiancho-Lacroix V., Lassiaille S.,
Granata, C., Tapus, A., Duhaut, D., Rigaud A.S.- Cognitive Services for Elderly People : The
Robadom project. ECCE 2010 Workshop : Robots that Care, European Conference on Cognitive
Ergonomics, 2010.
43. Granata C., Chetouani M., Tapus A., Bidaud P., Dupourque V. - Voice and Graphical based
Interfaces for Interaction with a Robot Dedicated to Elderly and People with Cognitive Disorders.
19th IEEE International Symposium in Robot and Human Interactive Communication (Ro-Man
2010), 2010.
44. Mahdhaoui A., Chetouani M. - Emotional Speech Classication Based On Multi View Characterization. IAPR International Conference on Pattern Recognition (ICPR), 2010.
45. Riviello M. T., Chetouani M., Cohen D., Esposito A. - Inferring emotional information from
vocal and visual cues : A cross-cultural comparison. COST 2102 Final Conference in conjunction
with the 4th Training school on Cognitive Behaviourial Systems, 2011.
46. Wu Y-H, Chetouani M., Cristancho-Lacroix V., Le Maître J., Jost C., Le Pevedic B., Duhaut D.,
Granata C., Rigaud A.S. - ROBADOM : The Impact of a Domestic Robot on Psychological and
Cognitive State of the Elderly with Mild Cognitive Impairment. 5th CRI (Companion Robotics
Institute) Workshop AAL User-Centric Companion Robotics Experimentoria, Supporting Socio-
, 2011.
47. Zong C., Clady C., Chetouani M. - An Embedded Human Motion Capture System for An
Assistive Walking Robot, International Conference on Rehabilitation Robotics (ICORR), 2011.
48. Delaherche E., Chetouani M. - Characterization of coordination in an imitation task : human
evaluation and automatically computable cues. 13th International Conference on Multimodal Interaction (ICMI), 2011.
49. Delaherche E., Chetouani M. - Automatic recognition of coordination level in an imitation task.
ethically Intelligent Assistive Technologies Adoption
ACM International Conference on Multimedia, Third International Workshop on Social Signal
Processing, 2011.
Communications dans des conférences et groupes de travail nationaux :
1. Chetouani M., Gas B., Zarader J.L., - Une architecture modulaire pour l'extraction de caractéristiques en reconnaissance de phonèmes. 19éme Colloque du GRETSI. Paris, France, 2003.
2. Chetouani M., Gas B., Zarader J.L., - Stratégies pour l'extraction de caractéristiques en reconnaissance de phonèmes. RJC'03, Réseau de jeunes chercheurs en parole. Grenoble, France,
2003.
15
134
Curriculum vitæ
3. Chetouani M., Gas B., Zarader J.L., - Coopération entre codeurs neuro-prédictifs pour l'extraction de caractéristiques en reconnaissance de phonèmes. RFIA'04 (Reconnaissance des Formes et
Intelligence Articielle). Toulouse, France, 2004.
4. Gas B., Charbuillet C., Chetouani M., Zarader J.L.- Paramétres NPC pour la segmentation
et le regroupement de locuteurs dans un ux audio. Workshop sur l'Evaluation de Systèmes de
Transcription enrichie d'Emissions Radiophoniques (ESTER), 2005.
5. Ketchazo C., Chetouani M. - Extraction de caractéristiques dans les signaux de parole pathologique. Journées de Phonétique Clinique. Grenoble, France, 2007.
6. Charbuillet C., Gas B., Chetouani M., Zarader J.L. - Combinaison de codeurs par algorithme
génétique : Application à la vérication de locuteur. GRETSI'07. Troyes, France.
7. Dahmani H., Selouani S.A, Chetouani M., Doghmane N. -Ressources linguistiques pour l'assistance aux aphasiques d'une région de l'est algérien. Réseau de jeunes chercheurs en parole,
RJCP'07. Paris, France, 2007.
8. Ringeval F., Chetouani M., Zarader J.L. - Analyse et identication automatique des troubles de
la parole chez les enfants autistes. Réseau de jeunes chercheurs en parole, RJCP'07. Paris, France.
9. Chetouani M., Interaction cognitive, Journées Nationales de Recherche en Robotique (JNRR'09),
2009.
Brevet :
1. Bidaud Ph., Bouzit M., Chetouani M. - Support d'écran interactif. Demande de brevet N10
54317 du 02 juin 2010, étendu à l'Europe, Amérique du Nord et Japon.
Publications soumises :
1. Delaherche, E., Chetouani M., Mahdhaoui A., Saint-Georges C., Viaux S., Cohen D., Evaluationonterpersonal synchrony : multidisciplinary approaches - IEEE Transactions on Aective
Computing.
2. Le Maître J., Chetouani M. - Self-talk discrimination in Human-Robot Interaction Situations
For Engagement Characterization - International Journal of Social Robotics (en révision).
3. Cassel R., Saint-Georges C., Mahdhaoui A., Chetouani M., Laznik M.C., Muratori F., Adrien
J.-L., Cohen D. - Course of maternal prosodic incitation (motherese) during early development in
autism : an exploratory home movie study - Interaction Studies (en révision).
16
136
Pattern recognition
Travaux initiés lors d'une visite (printemps 2005) à l'université de Mataro
(Marcos Faundez-Zanuy) dans le cadre de l'action Européenne COST 277
Non-Linear Speech Processing.
137
Pattern Recognition 42 (2009) 487 -- 494
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
Investigation on LP-residual representations for speaker identification
M. Chetouani a,∗ , M. Faundez-Zanuy b , B. Gas a , J.L. Zarader a
a
b
Université Pierre et Marie Curie (UPMC), 4 Place Jussieu, 75252 Paris Cedex 05, France
Escola Universitària Politècnica de Mataró, Barcelona, Spain
A R T I C L E
I N F O
Article history:
Received 9 February 2007
Received in revised form 23 May 2008
Accepted 5 August 2008
Keywords:
Feature extraction
Speaker identification
LP-residue
Non-linear speech processing
A B S T R A C T
Feature extraction is an essential and important step for speaker recognition systems. In this paper,
we propose to improve these systems by exploiting both conventional features such as mel frequency
cepstral coding (MFCC), linear predictive cepstral coding (LPCC) and non-conventional ones. The method
exploits information present in the linear predictive (LP) residual signal. The features extracted from
the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as
temporal and frequential representations. The first one consists of an auto-regressive (AR) modelling of
the signal followed by a cepstral transformation in a similar way to the LPC–LPCC transformation. In order
to take into account the non-linear nature of the speech signals we used two estimation methods based
on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus secondorder statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher
order). Concerning the frequential approach, we exploit a filter bank method called the power difference
of spectra in sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting
features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers.
The main interest lies in the controlled acquisition conditions: mismatch between the microphones and
the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The
performances of the features are confirmed over this larger corpus. In addition, we propose to compare
traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or
the MFCC shows global improvements in terms of robustness under different mismatches. A comparison
between the residual features under the opinion fusion framework gives us useful information about the
potential of both temporal and frequential representations.
© 2008 Elsevier Ltd. All rights reserved.
1. Introduction
During the last decades, significant efforts have been made for the
design of efficient features for the improvement of speaker recognition systems. As a result, several features have been proposed. For
instance, Jang et al. [1] proposed an approach based on speech signal decomposition by using the independent component analysis
(ICA). It mainly consists of an optimisation of basis functions for statistical independent feature extraction. The resulting features, similar to Gabor wavelets, increase the speaker identification rate by
7.7% compared to the discrete cosine transform (DCT) for a subset
of TIMIT. Following the speech production model (i.e. source–filter
model), some authors attempt to extract features known as speakerdependent such as glottal information [2]. Mary et al. [3] used the
∗
Corresponding author.
E-mail address: [email protected] (M. Chetouani).
0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2008.08.008
potential of auto-associative neural networks for capturing shortsegment (10–30 ms) and sub-segmental (1–5 ms) features extracted
from linear predictive (LP) analysis. This leads to the modelling of
not only traditional spectral features but also source and phase modelling. The results on speaker identification show good performances
in case of combination of these features. Despite these investigations,
state-of-art systems are mostly based on the mel cepstral frequency
coding (MFCC) or the linear predictive cepstral coding (LPCC). Indeed, these short-term features have proven their efficiency in terms
of performances and are adapted for the Gaussian mixture models
(GMMs).
In this contribution, we propose to use additional features with
the traditional ones (MFCC and LPCC) for the improvement of recognition rates. These features are based on the LP-residual signal. The
paper investigates different representations for the design of a useful framework for conventional speaker recognition systems. Indeed,
in the case of LPCC based systems, the extraction of LP-residual features does not need too much computation.
138
488
M. Chetouani et al. / Pattern Recognition 42 (2009) 487 -- 494
Related works on LP-residual analysis are reported in Section 2.
Section 3 presents two different representations based on temporal
and frequential models, respectively. The proposed representations
are tested on two different databases described in Section 4. The
first one is the Gaudi database [4] which allows to control the performances under different conditions: interval between the sessions
and the microphones mismatch. The second one is the well-known
NTIMIT corpus which has been intensely used in speaker recognition
even if there is no mismatch between the sessions. Both databases
are used for speaker identification. The results of the experiments
are discussed in Section 5. Finally, we give conclusions and future
plans for the proposed work.
2. Related works and problem
Concerning the speech production, it is generally assumed that
the signals are the result of the excitation of the vocal tract. Under
the framework of the LP analysis, the vocal tract is associated to the
filter (linear predictive coding, LPC) and the excitation to the residual
signal. The LP analysis consists in the estimation of LPC coefficients
by minimising the prediction error. The predicted sample s results
from a linear combination of the p past samples [5]:
s(n) = −
p
ak s(n − k)
(1)
k=1
The LPC coefficients ak are related to the vocal tract and may also
partly capture speaker-dependent information. Indeed, derived features from these coefficients, namely the LPCC, are intensely used in
speaker recognition tasks. The parameter p (filter order) plays a major role in speech recognition tasks and the best scores are obtained
with 12th order whereas in speaker recognition the most used order
is 16.
Under the traditional LP analysis, the residual is obtained by the
error between the current and the predicted samples:
r(n) = s(n) − s(n)
to model the speech signals have been investigated (for more details
see Ref. [17]). As far as the temporal models are concerned, we previously proposed to extend the auto-regressive (AR) model used in
the LPC analysis (cf. Eq. (1)) by predictive neural networks [18,19].
Given the predicted samples s(n), the residual r is obtained by
subtracting the original signal s to the predicted one (cf. Eq. (2)). The
residual should contain all the information that is not modelled by
the filter (cf. Eq. (1)). The filter coefficients estimation is based on
second-order analysis (i.e. covariance, auto-correlation) which cannot model non-gaussian processes. One can postulate that the residual has not only to be modelled by higher-order statistics but also by
second-order statistics due to the lack of efficiency of the estimation
(p order, algorithm, noise, etc.). From these considerations, several
ways can be followed to model the residual. Non-linear modelling is
one of the solutions used in several applications [11,20,21] due to the
non-linear nature of the residual [18,22]. The results show the potential and confirm the presence of non-linearities. For instance, an
interesting work done by Thyssen et al. [21] suggest the presence of
non-linearities in the residual since several series of LPC analysis are
required to remove all linear information from the residual. However
one has to be careful with this approach because, it has been noticed
by Kubin [7], adaptive methods can lead to nearly Gaussian residual
signals. Other solutions can be used such as wavelet transform as in
wavelet octave coefficients of residues (WOCOR) features [12].
In this contribution, we propose to exploit the fact that the residue
conveys all information that are not modelled by the LPC filter (cf.
Eq. (1)). Unlike to previously proposed methods mainly based on
machine learning [10,20] or signal processing [12], the approach employed in this paper is based on the combination of temporal (second and higher-order statistics for AR models) and frequential (filter banks) models. These investigations aim to show the potential of
residual speech signal processing for speaker recognition tasks. The
features extracted from the residual can be used as complementary
ones with the LPCC or even with the MFCC.
3. Proposed representations for the LP-residue
(2)
Theoretically, the residual is uncorrelated to the speech signal and it
is related to the excitation which is speaker-dependent. These features are known as source features. However, recent works on nonlinear speech processing have shown that the source-filter model is
not suitable for the speech production modelling [6,7]. Different phenomena that occur during the production are non-linear and chaotic.
From these investigations on non-linear processing, one can assume
that there is a dependency between the speech signal and the residual.
Several investigations have been carried out to use this residual
for the improvement of speaker recognition systems [3,8–12]. Thevenaz and Hügli [8] exploit the theoretical orthogonality between the
filter (i.e. the LPC coefficients) model and the residue model. Their
results confirm the complement nature of these representations for
speaker verification. As we mentioned previously, neural networks
have also been tested for the characterisation of the LP residual [3].
In Ref. [11], auto-associative neural networks are used for the characterisation of the linear residue. They show that speaker recognition
systems can attain efficient rates by using only residual features.
For an efficient design, the methods should take into account
the nature of the residual. In the case of an original speech signal,
several investigations have been carried out [6,7,13–15]. The different phenomena (turbulence, chaos, etc.) [13] occurring during the
production, mainly due to physiological reasons, cause the presence
of non-linearities in the speech signals. These non-linearities have
been characterised by statistical tests such as higher-order statistics
and signal distribution confirm the non-linear and non-gaussian assumptions [7,16]. Consequently, several representations attempting
The previous sections have shown the importance of residual signals for speaker recognition tasks. The efficiency of this additional
feature is totally related to a suitable representation. In this contribution, we investigate two different approaches termed as temporal
and frequential ones.
3.1. Temporal approach
The temporal approach is based on an AR model of the LP-residue:
r(n) = −
k r(n − k)
(3)
k=1
where r and , respectively, represent the LP-residue and the filter
order, respectively. To be efficient for speech applications, cepstral
derived features have to be computed. The k coefficients are transformed into cepstral ones k in a similar manner as the LPC–LPCC
transformation.
For the feature extraction process, two methods are investigated:
second and higher-order statistics. The first one basically consists of
a LPC analysis of the residual r followed by a cepstral derivation, resulting in LPCC equivalent features. The features obtained are noted
R-SOS-LPCC features in order to make a difference from the wellknown LPCC features. LP cepstral models of the residue have been
already tested on speaker recognition [23] leading to some improvements. In contrast to what is done in [23] where LP analysis of the
residual is combined to the MFCC by a linear discriminant analysis,
the residual models are considered as additional features (as the Sélection d'articles
139
Residual
signal
Speech
signal
489
Inverse filtering
Second-order
analysis
R-SOS-LPCC
coefficients
Third-order
analysis
R-HOS-LPCC
coefficients
LP analysis
Fig. 1. Temporal processing applied to the residual signal r.
x [n]
V (1)
⎛
⎜
⎝
...
1
Ni
⎛ Hi
⎜ ∏ S (k)
⎝K = Li
11 Hi
N ∑ S (k)
iK=L
V (2)
i
...
V (M)
...
...
Fig. 2. Principle of the power difference of spectra in sub-band (PDSS) applied to the residual signal r.
coefficients). The next method is also based on an AR model (Eq. (3))
but with the estimation of higher-order statistics.
The traditional LPC analysis is based on second-order statistics [5]
such as the covariance and auto-correlation methods. The LPC coefficients (Eq. (1)) are obtained by the resolution of the Yulke–Walker
equations [5] defined as a function of the coefficients ak and the
auto-correlation R (i.e. second-order statistic). A natural extension to
this procedure consists of the definition of equivalent Yulke–Walker
equations but with higher-statistics such as third-order or fourthorder moments. In speech recognition, Paliwal et al. [24] applied
similar ideas for the estimation of an AR model. They used a constrained third-order cumulant approach, noted C, resulting in equivalent Yulke–Walker equations:
p
ak Ck (i, j) = 0
(4)
k=0
with 1 i p, 0 j i.
The third-order cumulant of signal s is defined as
Ck (i, j) =
M
sm−k sm−i sm−j
(5)
m=p+1
where M is the analysis window size which is equivalent to the one
used for the auto-correlation in the LPC computation.
Following this formulation, a traditional recursion algorithm is
used for the estimation of AR coefficients [24]. Derived cepstral features, similar procedure as the LPC–LPCC transformation, are applied
to noisy speech recognition. The results show that at low SNR (20 dB)
the cumulant estimation outperforms the auto-correlation one but
it is not the case for higher SNRs.
In this contribution, we use similar models but, rather than applying them to the signal s (Eq. (1)), we apply them to the residual
r (Eq. (3)). They are named R-HOS-LPCC. Fig. 1 represents the temporal analyses compared in this paper.
3.2. Frequential approach
Unlike the previous approach, in this section, we describe frequential processing of the residual signal r (Eq. (2)). This approach
was originally proposed by Hayakawa et al. [25] and was called the
power difference of spectra in sub-band (PDSS). They tested it on a
speaker identification problem. The R-PDSS features gave a rate of
66.9% and the combination with LPCC features gave 99% (99.8% for
the LPCC alone).
The R-PDSS features are obtained by the following steps (cf.
Fig. 2):
• Calculate the LP-residual r.
• Fast Fourier transform of the residual using zero padding in order
to increase the frequency resolution: S = |fft(residue)|2 .
• Group the power spectrum into M sub-bands.
• Calculate the ratio of the geometric to the arithmetic mean of the
power spectrum of the ith sub-band and subtract from 1:
Hi
( k=L
S(k))1/Ni
i
R − PDSS(i) = 1 −
(6)
Hi
1/Ni k=L S(k)
i
where Ni = Hi − Li + 1 is the number of frequency samples in the ith
sub-band. Li and Hi are, respectively, the lower and upper frequency
limits of the ith sub-band. The same bandwidth is used for all the
sub-bands.
Cepstrum analysis of the residual has been also investigated in
speech recognition [26]: filter bank analysis of the one-sided autocorrelation of the residual r plus a cepstral transformation. The features obtained named as residual cepstrum (RCEP) present some
140
490
linguistic information and in combination to the LPCC, improves the
recognition rates. This result and the previous arguments (cf. Section
2) concerning the source-filter model are interesting because they
prove that linguistic and speaker information are present in both the
features: LPCC and residual. The rest of this contribution is dedicated
to the experiments and the discussion on the proposed features.
4. Experimental conditions
This section is dedicated to the description of the used corpus
and the different tasks that we addressed for the evaluation of the
proposed feature extraction schemes. These features are obviously
compared to the most used methods such as the MFCC and the LPCC.
The dimension of feature vectors is set to 16 for both the traditional
and residual ones (cf. Section 3, = M = 16).
4.1. Databases
4.1.1. Gaudi
The Gaudi database [4,27] was originally designed in order to
measure the performances under different controlled conditions:
language, interval session and microphone. The corpus is composed
of:
• 49 speakers;
• four sessions with different tasks: isolated numbers, connected
numbers, text reading, conversational speech, etc.);
• for each session, the utterances have been acquired in two languages (Catalan and Spanish) and simultaneously with different
microphones as described in Table 1.
In this contribution, the training protocol consists of using one text
reading of an average duration of 1 min using session 1 and MIC1.
Consequently, the training session is always done with M1. Concerning the tests, we use nine phonologically balanced utterances
(Spanish) identical for all the speakers through the sessions (3–5 s):
M1–M6. We focus on the first three sessions with different microphones (cf. Table 2). The number of tests is 49 × 9 = 441 for each
session and the average score is estimated on 49 × 9 × 6 = 2646 tests.
The speech signal has been down-sampled to 8 kHz (producing a
telephonic bandwidth), pre-emphasised by a first-order filter whose
transfer function is H(z)=1−0.95z−1 and normalised between −1, +1
(for cumulant estimation). A 30 ms Hamming window is used, and
the overlapping between adjacent frames is 23 . A parameterised vector of 16th order was computed for each feature extraction method.
SONY ECM 66B
AKG D40S
AKG C420
• Two different sentences SA1 and SA2. They are the same across
the 630 speakers and they have an average duration of 2.9 s.
• Eight sentences different across the speakers: three SI (average
duration of 2.9 s) and five SX sentences (average duration of 3.2 s).
Contrary to the Gaudi database (cf. Section 4.1.1), NTIMIT contains
only single session recordings with a fixed handset. However this
database has been largely used for speaker recognition applications
[29–31]. In spite of these successful applications, results on this
database are useful because they can be compared easily. Lot of training and test protocols have been defined for NTIMIT [29–31]. In this
article, we use the protocol called “long training–short test” initially
proposed by Bimbot et al. [29] which consists of:
• “long training”: the five SX sentences are concatenated as a single
reference pattern for each speaker. As a result, the “long training”
pattern average duration is 14.4 s.
• “short test”: SA and SI sentences are tested separately resulting in
630 × 5 = 3150 tests (with an average duration of 3.2 s).
The training duration is less than the one used for the Gaudi
protocol but with more utterances for test and consequently the
obtained results have a higher statistical significance [29].
The speech signal is recorded through a high quality microphone
and is sampled at 16 kHz but with a bandwidth of 300–3400 Hz
(telephone bandwidth). A 31.5 ms Hamming window is used at a
frame rate of 10 ms.
4.2. Speaker identification method
For the evaluation of feature schemes, we test them on the
speaker identification problem using both the databases: Gaudi (cf.
Section 4.1.1) and NTIMIT (cf. Section 4.1.2).
The speaker models have been designed by a simple secondorder statistic method. A covariance matrix (C) is computed for each
speaker and the arithmetic-harmonic sphericity measure [32] is applied for comparison:
(Cj , Ctest ) = log(tr(Ctest C−1
) tr(Cj C−1
test )) − 2 log(P)
j
(7)
where tr is the trace of the matrix, P is the dimension of feature
vector (P = 16). The number of parameters for each speaker model
is P 2 + P/2 (the covariance matrix is symmetric).
Table 1
The microphones used for the Gaudi database
MIC1
MIC2
MIC3
4.1.2. NTIMIT
The NTIMIT database [28] is a telephonic version of the TIMIT
corpus including local and long distance calls. The database contains
630 speakers (438 male and 192 female) and each of them have
uttered 10 sentences:
Lapel unidirectional electret (≈ 10 cm from the speaker)
Dynamic cardoid (≈ 30 cm from the speaker)
Head-mounted (low-cost microphone)
Table 2
Different sessions and microphones
Ref.
Session
Microphone
M1
M2
M3
M4
M5
M6
1
1
2
2
3
3
MIC1
MIC2
MIC1
MIC2
MIC1
MIC3
5. Results and discussions
5.1. Mismatch identification
Mismatch conditions due to acquisition or interval sessions seriously decrease the recognition rates of speaker recognition systems. As previously described in the experimental section (cf. Section
4.1.1), the Gaudi database is used for speaker identification under
controlled conditions.
Table 3 presents the speaker identification rates for the different conditions. Baseline results are represented by both the MFCC
and the LPCC features. For no mismatch, training and test on M1,
best results are achieved by the MFCC. However, if we add to the
LPCC features residual information (R-SOS-LPCC, R-HOS-LPCC and
R-PDSS), improvements are obtained but the number of features is
141
491
Table 3
Correct speaker identification rates for mismatch training (with M1) and test for temporal, frequential and mixed methods
Feature extraction
M1
M2
M3
M4
M5
M6
Average
Temporal
LPCC
R-SOS-LPCC
R-HOS-LPCC
LPCC + R-SOS-LPCC
LPCC + R-HOS-LPCC
94.78
87.98
83.45
97.5
97.96
73.7
63.72
55.33
81.86
80.04
74.60
60.32
57.14
79.82
80.04
66.213
59.18
50.79
71.43
70.521
55.33
44.45
42.40
56.92
58.05
52.15
43.99
33.10
62.81
59.64
69.46
59.94
53.70
75.05
74.37
Frequential
MFCC
R-PDSS
97.50
82.09
76.64
59.86
78.23
62.36
72.34
60.99
57.59
45.35
62.36
42.18
74.11
58.80
Mixed
LPCC + R-PDSS
99.77
82.54
85.26
83.22
66.43
67.35
80.76
also increased from P(16) to 2 × P(32) resulting in more computation. Looking to the performances of the residual information, temporal and frequential representations (cf. Section 3) alone give non
negligible results: more than 80% of correct speaker identification.
Concerning the mismatch conditions, as it can be expected, for
all the features the identification rates decrease. However, the loss
of performances differs according to the mismatch: interval session
and/or microphone (cf. Section 4.1.1). The impact of the acquisition
is more important than the interval session impact. When the microphone changes for the same session, for instance M2, the performances are degraded and the rates are more or less equivalent
to the interval session mismatch with the same microphone M3
(cf. Table 3). For conventional features, MFCC features give the best
results for these different conditions. The speaker-dependent information contained in the residual are also non negligible even if the
conditions differ seriously. Moreover, when the residual features are
added to the LPCC as complementary features, the robustness under the different mismatches is clearly improved resulting in better
identification rates than the LPCC or the MFCC alone.
The tests carried for M5 and M6 mismatches (long interval,
microphone, cf. Section 4.1.1) are mostly equivalent for all the features (cf. Table 3). These tests are interesting because they give information about the robustness of the features for real applications.
However, for all the features except MFCC, the performance slightly
decreases. Once again, the robustness is improved by using residual information and the conventional LPCC resulting in at least
equivalent MFCC results or even better.
In order to compare the performances of these features under the
different mismatches, we compute the average speaker identification
rate for each feature (cf. Table 3) through the conditions M1–M6 and
they are presented in Table 3. For conventional features, the best
results are achieved by the MFCC. For the residual information, the
performances of the R-SOS-LPCC and R-PDSS are mostly equivalent
and are better than the higher-order statistic based features namely
the R-HOS-LPCC. As it has been previously mentioned (cf. Section
2), after an LPC analysis, linear information are still present in the
LP-residue.
Concerning the additional features, the LPCC plus the residual information improve the recognition rates. The average performances
show that temporal methods are mostly equivalent. This means that
linear and non-linear information, respectively, modelled by R-SOSLPCC and R-HOS-LPCC, carry speaker-dependent information and are
complementary to the LPCC. It can be explained by two main remarks:
• Due to the imperfect LPC analysis, the LP-residue still carries
Gaussian information modelled by the R-SOS-LPCC and features
(cf. Section 2).
• The R-HOS-LPCC model allows to model non-gaussian distributions but it is limited by the fact that it is only a third-order based
model (cf. Section 3).
Table 4
Correct speaker identification rates for the NTIMIT database
MFCC
LPCC
R-SOS-LPCC
R-HOS-LPCC
R-PDSS
27.3
24.6
8.22
5.08
8.73
In order to overcome these limitations, non-linear models have
been directly applied to the speech signal such as predictive neural
networks [9,19] resulting in improvements of the speaker identification rates. Those models are inspired by the LPC analysis since
they are a direct extension of them. For instance in the neural
predictive coding (NPC) scheme [19], the neural weights are used
as features. Furthermore, this model can be initialised by the LPC
analysis.
5.2. Large database
The previous Gaudi database shows that residual information carries speaker-dependent information and it is true for all types of
models (temporal or frequential). In this section, we propose to confirm these results by doing training and test on a larger database
such as the NTIMIT (cf. Section 4.1.2).
Table 4 presents the speaker identification rates for the whole
NTIMIT database (630 speakers) with respect to the feature extraction methods. Baseline results (MFCC and LPCC) are the best ones
and are more-or-less equivalent, for the same “long training–short
test”, to the results obtained in Ref. [29]. One can notice that with a
different protocol or classifier (i.e. GMMs, support vector machines),
better results can be expected as noted in Refs. [29–31].
The results of the residual models for the whole NTMIT database
confirm the presence of speaker-dependent information but as it can
be expected that they are worse than the traditional features (MFCC
and LPCC). Concerning the temporal models, the linear model R-SOSLPCC gives the best results. We previously noticed similar behaviour
which can be justified by the lack of efficiency of the LPCC analysis
and the used non-gaussian model based on third-order statistics (cf.
Section 5.1). The speaker identification rates given by the R-PDSS
method are higher than for the temporal representations.
For the Gaudi database (cf. Section 5.1), we show that residual
models can be used as complementary features for a global improvement of the recognition rates. Rather than doing that, we propose,
in the next section, the fusion of these features in order to evaluate
this complementarity.
5.3. Opinion fusion
Information fusion is an important and effective stage for global
improvements of the recognition rates. In this subsection, our purpose is to evaluate and to compare the features. We combine the
142
492
Table 5
Experimental results for different combinations (temporal)
Feature extraction
Temporal
Temporal
LPCC
R-SOS-LPCC
R-HOS-LPCC
Frequential
R-SOS-LPCC
R-HOS-LPCC
R-PDSS
MFCC
28.06
26.19
12.54
28.09
14.92
12.06
31.75
34.98
33.33
Table 6
Experimental results for different combinations (frequential)
Feature extraction
Temporal
Frequential
Frequential
LPCC
R-SOS-LPCC
R-HOS-LPCC
R-PDSS
MFCC
31.75
34.98
33.33
33.33
R-PDSS
28.09
14.92
12.06
Table 7
Selected combination factor for the results shown in Tables 5 and 6
MFCC
LPCC
R-SOS-LPCC
R-HOS-LPCC
LPCC
R-SOS-LPCC
R-HOS-LPCC
R-PDSS
0.91
0.83
0.57
0.72
0.58
0.28
0.66
0.51
0.29
0.46
The indicated factors give the best scores (following Eq. (9)).
output of the recognizers (i.e. covariance matrix cf. Section 4.2) for
all the features (i.e. conventional and non-conventional ones). This
scheme is known as opinion fusion [33,34].
The opinion fusion procedure mainly consists in the following
steps:
(1) Distance normalisation [35]:
oi =
1
1 + e−ki
(8)
with k = oi − (mi − 2i )/2i . oi is the opinion of the classifier
i. oi ∈ [0, 1] is the normalised opinion, mi , i are the mean and
the standard deviation of the opinions of classifier i using the
genuine speakers (intra-distances).
(2) Weighted sum combination with trained rule [34,35]:
O = o1 + (1 − )o2
(9)
where o1 , o2 are scores (distances) provided by each classifier.
is a weighting or combination factor. A high value of implies a high importance of recognizer 1 (feature extractor plus
classifier).
The fusion scores with the different features are presented in
Tables 5 and 6 and the scores without fusion have been reported in
Table 4. One can expect that the fusion of the best scores such as the
MFCC and LPCC should give the best results. But, in Tables 5 and 6,
the best scores are obtained by the MFCC/R-SOS-LPCC couple and
moreover, the fusion of the MFCC and all the residual features are
better than the MFCC–LPCC fusion. This result shows that the combination of MFCC and residual features is efficient for a global improvement. The combination factor gives useful information about
the contribution of each method (cf. Table 7). Even if the R-SOS-LPCC
gives better scores, the MFCC contribution ( = 0.83) is higher than
the other ones R-HOS-LPCC ( = 0.72) and R-PDSS ( = 0.66). One
can also notice that the robustness of the LPCC (cf. Table 4) is clearly
improved by the proposed schemes (cf. Tables 5 and 6). Regarding
the combination factors (cf. Table 7), the contributions of both LPCC
and residual features are mostly of the same orders.
Concerning the fusion of residual features between them, it allows improvements but the attained scores are clearly less than
the MFCC and LPCC alone (cf. Tables 4–6). However, these experiments have also been carried out in order to compare the residual
models between them. R-SOS-LPCC/R-HOS-LPCC fusion is interesting
because it compares two predictive models based on second and
third-order statistics, respectively (cf. Section 3). For a combination
factor of = 0.28, it seems that the second-order statistic based
model (i.e R-SOS-LPCC) carries less speaker-dependent information
than the third-order one (R-HOS-LPCC) which seems to be in contradiction with the results obtained in Table 4. However, R-SOS-LPCC/RPDSS fusion gives better results with a same behaviour which means
that the speaker-dependent information is not present in the similar way in all the features. These results show that the exploitation
of the complementarity between the features can be improved by
suitable representations.
Finally for temporal/frequential fusion, the best scores are obtained with a small contribution of the R-SOS-LPCC. A more important contribution by the R-HOS-LPCC is needed (cf. Table 7) but a
worse score is obtained for the temporal/frequential fusion.
The results obtained using fusion show that the performances
and the robustness of the traditional features (MFCC and LPCC) are
improved by the residual ones. And, as one can expect, the contribution of the conventional features are higher than the residual ones.
Concerning the combination of residual features, the best scores are
obtained by the fusion of the second-order model (R-SOS-LPCC) and
the frequential one (R-PDSS).
6. Conclusions
In this paper, we proposed to extract features from the LP-residual
for the improvement of speaker identification systems. Several models have been investigated based on temporal and frequential approaches. The temporal models are based on an auto-regressive (AR)
filter and the coefficients of this model are estimated by second (SOS)
or higher-order (HOSs) statistics. The SOS based model is obtained by
the application of a traditional LPC analysis to the residue followed
by a cepstral transformation of the LPC coefficients. The resulting features are termed R-SOS-LPCC features. Following the same scheme
and the recent works on non-linear speech processing, we proposed
to use higher-order statistics for the improvement of the modelling
resulting in features called R-HOS-LPCC features. Concerning the frequential approach, a filter bank is investigated termed as the power
difference of spectra in sub-band (PDSS) which can be interpreted
143
as a sub-band version of the spectral flatness measure. The key idea
is to extract frequential information from the LP-residue.
These temporal and frequential approaches are evaluated in a
speaker identification task. Firstly, we evaluated the robustness of
the features (R-SOS-LPCC, R-HOS-LPCC and R-PDSS) with controlled
conditions: interval between the sessions, microphones. The obtained results show that residual information improve the speaker
identification scores (at least 7% better than the LPCC alone). The
R-HOS-LPCC features give worse results than the R-SOS-LPCC and
it has been partly justified by the presence of linear information
in the LP-residue and the modelling limitation of the R-HOS-LPCC
(third-order based). The best speaker identification rates have been
attained by the combination of the LPCC and the R-PDSS features.
Secondly, the different features have been tested on the well-known
NTIMIT database following the “long training–short test” protocol.
The results on this larger corpus confirm that the LP-residue carries
speaker-dependent information. In order to evaluate the potential of
the residual features for the global improvement of speaker recognition systems, we proposed to compare the recognizers (feature
extractor + classifier) by the opinion fusion framework. Once again
the robustness of the LPCC is clearly improved by the combination
with residual features. And we can notice that the residual features
can also be used with the MFCC, which initially gives best scores
alone, for a global improvement. We also focused on the fusion of the
residual features between them in order to evaluate their respective
performances showing that temporal (R-SOS-LPCC and R-HOS-LPCC)
and frequential (R-PDSS) features convey complementary information due to the different extraction schemes: AR model and bank
filter.
This investigation on LP-residue gives us useful information about
the properties of the signal. Clearly, speaker-dependent information
are present and they have to be used with conventional features such
as the MFCC or the LPCC. Moreover, the robustness over the recognition conditions (interval sessions, microphones and telephone) is
improved. However, one can notice that this last point can be significantly improved by the use of robust methods such as cepstral
mean subtraction (CMS). Concerning the future works, the limitation of the R-HOS-LPCC model mainly due to its estimation (thirdorder statistic) should be investigated. It can be done by the use of
more higher-orders (i.e. fourth) or an association of them. It can also
be done by non-linear models such as neural networks such as the
NPC scheme [36]. Furthermore, in this contribution, we used the LPresidue but other strategies can be followed as the analysis of the
NLP-residue (non-linear) as done in Ref. [9].
Acknowledgements
A part of this research was carried out during a visit at the
Escola Universitària Politècnica de Mataró, Barcelona, Spain, and was
funded by the European COST action. This work has been supported
by FEDER and MEC, TEC2006-13141-C03-02/TCM.
References
[1] G.J. Jang, T.L. Lee, Y.H. Oh, Learning statistically efficient features for speaker
recognition, Neurocomputing 49 (2002) 329–348.
[2] R.E. Slyh, E.G. Hansen, T.R. Anderson, Glottal modeling and closed-phase
analysis for speaker recognition, in: Proceedings of the ISCA Tutorial and
Research Workshop on Speaker and Language Recognition (Odyssey'04), 2004,
pp. 315–322.
[3] L. Mary, K. Sri Rama Murty, S.R. Mahadeva Prasanna, B. Yegnanaraya, Features
for speaker and language identification, in: Proceedings of the ISCA Tutorial
and Research Workshop on Speaker and Language Recognition (Odyssey'04),
2004, pp. 323–328.
[4] J. Ortega, et al., Ahumada: a large speech corpus in Spanish for speaker
identification and verification, in: Proceedings of the IEEE ICASSP'98, vol. 2,
1998, pp. 773–775.
493
[5] B.S. Atal, S.L. Hanauer, Speech analysis and synthesis by linear prediction of
speech wave, J. Acoust. Soc. Am. 50 (1971) 637–655.
[6] M. Faundez-Zanuy, G. Kubin, W.B. Kleijn, P. Maragos, S. McLaughlin, A.
Esposito, A. Hussain, J. Schoentgen, Nonlinear speech processing: overview and
applications, Control Intelligent Syst. 30 (1) (2002) 1–10.
[7] G. Kubin, Nonlinear processing of speech, in: W.B. Kleijn, K.K. Paliwal (Eds.),
Speech Coding and Synthesis, 1995, pp. 557–610.
[8] P. Thevenaz, H. Hügli, Usefulness of the LPC-residue in text-independent speaker
verification, Speech Commun. 17 (1–2) (1995) 145–157.
[9] M. Faundez, D. Rodriguez, Speaker recognition using residual signal of linear
and nonlinear prediction models, ICSLP 2 (1998) 121–124.
[10] B. Yegnanaraya, K.S. Reddy, S.P. Kishore, Source and system features for speaker
recognition using AANN models, in: Proceedings of the IEEE ICASSP, 2001,
pp. 409–412.
[11] S.R. Mahadeva Prasanna, C.S. Gupta, B. Yegnanaraya, Extraction of speakerspecific excitation from linear prediction residual of speech, Speech Commun.
48 (2006) 1243–1261.
[12] N. Zheng, T. Lee, P.C. Ching, Integration of complementary acoustic features for
speaker recognition, IEEE Signal Process. Lett., 2006.
[13] A. Esposito, M. Marinaro, Some notes on nonlinearities of speech, in: G. Chollet,
et al. (Eds.), Nonlinear Speech Modeling, Lecture Notes in Artificial Intelligence,
vol. 3445, 2005, pp. 1–4.
[14] S. McLaughlin, S. Hovell, A. Lowry, Identification of nonlinearities in vowel
generation, in: Proceedings of the EUSIPCO, 1988, pp. 1133–1136.
[15] H. Teager, S. Teager, Evidence for nonlinear sound production mechanisms in
the vocal tract, in: Proceedings of the NATO ASI on Speech Production and
Speech Modeling, vol. II, 1989, pp. 241–261.
[16] S. Gazor, W. Zhang, Speech probability distribution, IEEE Signal Process. Lett.
10 (7) (2003) 204–207.
[17] G. Chollet, A. Esposito, M. Faundez-Zanuy, M. Marinaro, Nonlinear speech
modeling and applications, in: Lecture Notes in Artificial Intelligence, vol. 3445,
2005.
[18] M. Faundez, D. Rodriguez, Speaker recognition by means of a combination of
linear and nonlinear predictive models, in: Proceedings of the IEEE ICASSP'99,
1999.
[19] M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, A new nonlinear speaker
parameterization algorithm for speaker identification, in: Proceedings of the
ISCA Tutorial and Research Workshop on Speaker and Language Recognition
(Odyssey'04), 2004, pp. 309–314.
[20] E. Rank, G. Kubin, Nonlinear synthesis of vowels in the LP residual domain
with a regularized RBF network, in: Proceedings of the IWANN, vol. 2085(II),
2001, pp. 746–753.
[21] J. Thyssen, H. Nielsen, S.D. Hansen, Non-linearities short-term prediction in
speech coding, in: Proceedings of the IEEE ICASSP'94, vol. 1, 1994, pp. 185–188.
[22] C. Tao, J. Mu, X. Xu, G. Du, Chaotic characteristics of speech signal and its LPC
residual, Acoust. Sci. Technol. 25 (1) (2004) 50–53.
[23] S.H. Chen, H.C. Wang, Improvement of speaker recognition by combining
residual and prosodic features with acoustic features, in: Proceedings of the
IEEE ICASSP'04, vol. 1, 2004, pp. 93–96.
[24] K.K. Paliwal, M.M. Sondhi, Recognition of noisy speech using cumulant-based
linear prediction analysis, in: Proceedings of the IEEE ICASSP'91, vol. 1, 1991,
pp. 429–432.
[25] S. Hayakawa, K. Takeda, F. Itakura, Speaker identification using harmonic
structure of LP-residual spectrum, in: Audio Video Biometric Personal
Authentification, Lecture Notes in Computer Science, vol. 1206, Springer, Berlin,
1997, pp. 253–260.
[26] J. He, L. Liu, G. Palm, On the use of residual cepstrum in speech recognition,
in: Proceedings of the IEEE ICASSP'96, vol. 1, 1991, pp. 5–8.
[27] A. Satue-Villar, M. Faundez-Zanuy, On the relevance of language in speaker
recognition, in: Proceedings of the EUROSPEECH'99, vol. 3, 1999, pp. 1231–1234.
[28] C. Jankowski, A. Kalyanswamy, S. Basson, J. Spitz, NTIMIT: a phonetically
balanced, continuous speech, telephone bandwidth speech database, in:
Proceedings of the IEEE ICASSP, vol. 1, 1990, pp. 109–112.
[29] F. Bimbot, I. Magrin-Chagnolleau, L. Mathan, Second-order statistical measures
for text-independent speaker identification, Speech Commun. 17 (1995)
177–192.
[30] D.A. Reynolds, Speaker identification and verification using Gaussian mixture
speaker models, Speech Commun. 17 (1995) 91–108.
[31] L. Besacier, J.F. Bonastre, Subband architecture for automatic speaker
recognition, Signal Process. 80 (2000) 1245–1259.
[32] F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmeticharmonic sphericity measure, in: Proceedings of the EUROSPEECH'91, 1999,
pp. 169–172.
[33] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combining classifiers, IEEE Trans.
Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239.
[34] M. Faundez-Zanuy, Data fusion in biometrics, IEEE Aerosp. Electron. Syst. Mag.
20 (1) (2005) 34–38.
[35] C. Sanderson, Information fusion and person verification using speech and face
information, IDIAP Research Report 02-33, 1–37, September 2002.
[36] M. Chetouani, M. Faundez-Zanuy, B. Gas, J.L. Zarader, Non-linear speech feature
extraction for phoneme classification and speaker recognition, in: G. Chollet
et al. (Eds.), Nonlinear Speech Modeling, Lecture Notes in Artificial Intelligence,
vol. 3445, 2005, pp. 344–350.
144
494
About the Author—M. CHETOUANI received the M.S. degree in Robotics and Intelligent Systems from the University Pierre and Marie Curie (UPMC), Paris, 2001. He received
the Ph.D. degree in Speech Signal Processing from the same university in 2004. In 2005, he was an invited Visiting Research Fellow at the Department of Computer Science
and Mathematics of the University of Stirling, UK. He was also an invited researcher at the Signal Processing Group of Escola Universitaria Politecnica de Mataro, Barcelona,
Spain. He is currently an Associate Professor in Signal Processing and Pattern Recognition at the UPMC. His research activities, carried out at the Institute of Intelligent
Systems and Robotics, cover the areas of non-linear speech processing, feature extraction and pattern classification for speech, speaker and language recognition. He is a
member of different scientific societies (ISCA, AFCP and ISIS). He has also served as chairman, reviewer and member of scientific committees of several journals, conferences
and workshops.
145
Cognitive Computation
Travaux réalisés dans le cadre des thèses de Fabien Ringeval et d'Ammar
Mahdhaoui.
146
Cogn Comput (2009) 1:194–201
DOI 10.1007/s12559-009-9016-9
Time-Scale Feature Extractions for Emotional Speech
Characterization
Applied to Human Centered Interaction Analysis
Mohamed Chetouani Æ Ammar Mahdhaoui Æ
Fabien Ringeval
Published online: 22 April 2009
Ó Springer Science+Business Media, LLC 2009
Abstract Emotional speech characterization is an
important issue for the understanding of interaction. This
article discusses the time-scale analysis problem in feature
extraction for emotional speech processing. We describe a
computational framework for combining segmental and
supra-segmental features for emotional speech detection.
The statistical fusion is based on the estimation of local
a posteriori class probabilities and the overall decision
employs weighting factors directly related to the duration
of the individual speech segments. This strategy is applied
to a real-world application: detection of Italian motherese
in authentic and longitudinal parent–infant interaction at
home. The results suggest that short- and long-term information, respectively, represented by the short-term spectrum and the prosody parameters (fundamental frequency
and energy) provide a robust and efficient time-scale
analysis. A similar fusion methodology is also investigated
by the use of a phonetic-specific characterization process.
This strategy is motivated by the fact that there are variations across emotional states at the phoneme level. A timescale based on both vowels and consonants is proposed and
it provides a relevant and discriminant feature space for
acted emotion recognition. The experimental results on two
different databases Berlin (German) and Aholab (Basque)
show that the best performance are obtained by our phoneme-dependent approach. These findings demonstrate the
relevance of taking into account phoneme dependency
(vowels/consonants) for emotional speech characterization.
M. Chetouani (&) A. Mahdhaoui F. Ringeval
Institut des Systèmes Intelligents et Robotique (ISIR), Université
Pierre et Marie Curie-Paris6 (UPMC), 4 Place Jussieu, 75252
Paris Cedex, France
e-mail: [email protected]
123
Keywords Emotional speech Time-scales analysis Feature extraction Statistical fusion Data-driven approach
Introduction
In the past few years, many attempts have been made to
exploit computational models for human interaction analysis. This interaction can be directed towards other Human
partners but also to machines (computers, virtual agents, or
robots). Computational models aim to characterize signals
emitted by human beings during interaction. Various
frameworks are currently being used to analyze and to
understand the interaction. One of them comes from cognitive psychology and focuses on emotion [1]. The key idea
of this concept, also termed as affective computing, is that
people perceive other’s emotions through stereotyped signals (facial expressions, prosody, gestures, etc.). Another
framework, coming from linguistic field, aims at understanding the meaning of these signals. Indeed, humans
employ different strategies in order to convey the same
message using multi-modal signals such as specific words,
tone of voice, gesture, or more generally body language [2,
3]. Recently, a new framework has been introduced for the
study of interaction termed as Social Signal Processing
(SSP) [4] which focuses on the analysis of social signals by
measuring the amplitude, frequency, and timing of prosody, facial movement, and gesture. SSP is different from
the previously mentioned frameworks in the sense that it
consists of non-linguistic and unconscious signals. More
specifically, SSP aims to predict human behaviors or attitudes (agreement, interest, attention, etc.) by the analysis of
non-verbal signals and it is considered as a separate
channel of communication.
147
Cogn Comput (2009) 1:194–201
Most of the frameworks proposed in the literature for the
understanding of interaction are based on the analysis of
verbal and non-verbal signals [1, 3, 5]. The verbal component has been extensively investigated by the speech
processing community. Non-verbal signals are expressed in
a different way among the modalities. In [5], five different
non-verbal behavioral cues have been defined: physical
appearance, gestures and postures, face and eyes behaviors,
vocal behavior, and space and environment behaviors. The
combination of different codes make it possible to convey
various information such as emotion, intention but are also
useful for managing interaction, and/or sending relational
messages (dominance, persuasion, embarrassment, etc.).
In this article, we focus on the analysis of a specific class
of non-verbal behaviors which accompanies the verbal
message termed as vocal behaviors in [5]. They allow to
group empty speech pauses (silences), non-verbal vocalizations (i.e., filled pauses, laughters, cries, etc.), speaking
styles (i.e., emotion, intention, etc.), and also turn-taking
patterns. Even if these behaviors do not always have lexical
meanings, they play a major role during natural interactions. Many efforts have been taken to extract features with
no clear consensus on the most efficient ones [4, 6].
However, the prosody channel, characterized by the fundamental frequency (f0), the energy and the duration of
sounds, has various functions in human communication
since it serves to convey linguistic information, but also
para-linguistic (e.g., speakers state), and non-linguistic
information (e.g., age) [7, 8].
The remainder of this article presents various strategies
for the fusion of time-scale features in order to study
interactions. Section ‘‘Units for Emotional Speech Characterization’’ reports previous works in the literature
associated with time-scale with a focus on unit selection
problem for emotion recognition. Section ‘‘Combining
Frame-Level and Segment-Based Approach for Intention
Recognition in Infant-Directed Speech’’ describes the statistical framework for the fusion of frame and segment
level features for infant-directed speech discrimination.
Section ‘‘Data-Driven Approach for Time-Scale Feature
Extraction’’ highlights the relevance of the pseudo-phonetic strategy for emotion recognition and provides results
and discussion for time-scale analysis.
Units for Emotional Speech Characterization
The characterization scheme can be divided in two main
steps: feature extraction and pattern classification. Regarding the first step, most methods are based on statistical
measures of pitch, energy, and duration [6]. These statistical
features (e.g., mean, range, max, min, etc.) have also been
found to be related to human perception of emotions [9–11].
195
These features are usually termed as supra-segmental in
contrast to segmental features (short-term) such as the Mel
Frequency Cepstral Coefficients (MFCC) intensively used in
speech processing. The classification step employs traditional machine learning and pattern recognition techniques
such as distance based (nearest neighbor k-nn), decision
trees, Gaussian Mixture Models (GMM), Support Vector
Machines (SVM), and fusion of different methods [12].
One particular aspect of the speech emotion recognition
process is the use of both static features (statistics) and
static classifiers (e.g., k-nn or SVM). Indeed, the standard
unit is the speaker turn level [12–14] which consists in the
characterization of a whole sentence by a large number of
features. This approach assumes that the emotional state is
not changing during the speaker turn level. Even if the turn
level approach has proven its efficiency, other units have
been investigated for the exploitation of dynamical aspects
of emotion. The methods can be divided into two groups:
machine learning and data-driven methods.
Machine Learning Based Units
This approach employs machine learning techniques such as
Hidden Markov Models [13]. Speech and speaker recognition
techniques: short-term features and statistical modeling
(GMM, HMM) have been successfully combined with a
traditional turn based level approach [15]. In [16], a timescale is identified by a the extraction of short-term feature
extraction (25 ms windows, MFCC) and the use of statistical
modeling (HMM). The time-scale is called by the authors
chunk level. Once the HMM are trained (one for each emotion
class), a Viterbi segmentation is applied resulting in specific
sub-turn units that depend on emotion changes. Tested on
emotion recognition tasks, the chunk level approach outperforms syllable based segmentation. This was mainly due to
the fact that the proposed approach produces longer segments
than the syllable segmentation method.
Data-Driven Units
The second approach aims at exploiting various knowledge
about speech signals for the definition of units. For
instance, voiced segments are known to convey more relevant information about emotion and focusing on these
segments has been proven to be efficient [1, 12]. Various
methods have been investigated for combining different
levels [12, 14–17]. In [12], the Segment Based Approach
(SBA) proposes to divide the whole utterance (turn level)
on N voiced segments and then to characterize each voiced
segments. The utterance based approach consists of the
computation of statistical features (F0, energy, spectral
shape) on the whole utterance while the SBA aims at
describing more precisely each voiced segment. From this
123
148
196
Cogn Comput (2009) 1:194–201
local description an estimation of a posteriori class probabilities is done and the whole decision consists in merging
the probabilities.
The SBA technique has been applied to emotion recognition for different well-known corpora and it outperforms the traditional utterance based feature extraction
technique with k-nn classifiers (best classifier for these
databases [12]): BabyEars 61.5% vs. 68.7% (SBA), Kismet
82.2% vs. 86.6% (SBA). However, with the same framework, different corpora (Berlin and Danish), and various
classifiers (k-nn, SVM) different results have been
achieved. For the Berlin corpus, SBA provides similar
performance for both k-nn and SVM but it is outperformed
by the traditional utterance level approach: k-nn 67.7% vs.
59.0% (SBA), SVM 75.5% vs. 65.5% (SBA). Once again
the performance is correlated with the length of the utterance: SBA provides better results for short sentences
(BabyEars, Kismet) while the turn level is more suited for
longer ones (Berlin). Additionally, it should be noted that
the performance also depends on the employed classifier as
it has been found for the Danish corpus for instance: k-nn
49.7% vs. 55.6% (SBA), SVM 63.5% vs. 56.8%.
Data-Fusion Approach
The above experiments highlight the need of investigations
into sub-units for emotional speech processing. In this
article, we propose to address this problem by data-fusion
of features extracted from different time-scales. The
investigations are carried out in two phases:
–
–
no assumption on the sub-unit (see. ‘‘Combining framelevel and segment-based Approach for Intention Recognition in infant-directed Speech’’) Section : the idea is to
exploit speaker recognition techniques which are mainly
based on frame-level modeling (all the frames are
exploited for the characterization) as it is done in [16, 18].
data-driven approach (see ‘‘Data-Driven Approach for
Time-Scale Feature Extraction’’) : speech signals are
characterized by prominent segments such as vowels
which are then employed as sub-units.
The next sections present the two phases applied to
different applications: motherese detection and traditional
emotion recognition tasks.
Combining Frame-Level and Segment-Based Approach
for Intention Recognition in Infant-Directed Speech
Expanded Intonation Contours
Communication of intentions is one of the major functions
of interaction that uses both linguistic (syntax, semantic)
123
and para-linguistic (prosody) elements. In the literature,
communication of intentions with infants has received
substantial attention [19, 20]. The main reason is that
infants are not yet linguistically competent and the communication of intentions is done by prosody. More specifically, the communication is done by the parents by a
specific register termed as infant-directed speech or
motherese [21–23].
From an acoustic point of view, motherese has a clear
signature (high pitch, exaggerated intonation contours).
The phonemes, and especially the vowels, are more clearly
articulated. Motherese has been shown to be preferred by
infants over adult-directed speech and might assist infants
in learning speech sounds. The exaggerated patterns
facilitate the discrimination between the phonemes or
sounds. Similarly to what happens with infants, several
works have investigated modifications of speech registers
when talking to animals [24], foreigners [20], or robots
[25–27]. The important conclusion from this literature is
the existence of common prosodic characteristics usually
termed as expanded intonation contours (or Fernald’s
prototypical contours) [19, 22] due to their exaggerated
contours: modulations of the fundamental frequency (F0)
(mean, range).
Investigations on the characterization of these expanded
contours have identified five categories [19]: rising, falling,
flat, bell-shaped, and complex contours of the F0. These
categories are used for the communication of intents such
as attention, prohibition, approval, or comfort. For
instance, rising contours aim at eliciting attention and
encouraging a response while bell-shaped contours aim at
maintaining attention. Consequently, adults convey intentional messages to infants by the use of these expanded
contours. Among the most characterized speaker’s intentions, one can cite: approval, attention, and prohibition.
The classification of intention from speech signals offers an
interesting application to the time-scale problem. Two
approaches can be investigated: the use of only prosodic
description of expanded intonation contours (voiced segments) or to also extract frame-level segments.
Motherese Detection
In order to study these intentional messages and more
specifically the influence on engagement in an ecological
environment, we followed a method usually employed for
the study of infant development: home movies analysis.
For more than 30 years, interest has been growing about
family home movie of autistic infants. Typically developing infants gaze at people, turn toward voices and express
interest in communication and especially to infant-directed
speech. In contrast, infants who become autistic are characterized by the presence of abnormalities in reciprocal
149
Cogn Comput (2009) 1:194–201
social interactions and in patterns of communications [28].
Recently, researchers in autism pathology and researchers
in early social interactions highlighted the importance of
infant-directed speech for infants who will become autistic
[29, 30]. First manual investigations [31] have shown a
positive impact on the interaction and specially on the
engagement: a response (vocalization, facial expression,
gesture, etc.) by the infant to the production of infantdirected speech by the parents.
The study of home movies is very important for future
research, but the use of this kind of database makes the
study very difficult and long. The manual annotation of
these films is very costly in time and including automatic
detection of relevant events will be of great benefit to the
longitudinal study. For the analysis of the role of infantdirected speech during interaction, we developed an automatic motherese detection system [30, 32]. The speech
corpus used in these experiments is a collection of natural
and spontaneous interactions usually used for child development research (home movies). The corpus consists of
recordings in Italian of some mothers and fathers as they
addressed their infants. The recordings are not carried out
by professionals resulting in adverse conditions (noise,
camera, microphones, etc.). We focus on one home video
totaling 3 h of data describing the first year of an infant.
Verbal interactions of the mother have been carefully
annotated by two psycholinguists on two categories
(j = 0.69) : motherese and normal directed speech. From
this manual annotation, we extracted 100 utterances for
each class. The utterances are typically between 0.5 s and
4 s in length. For all the experiments in this paper a 10-fold
cross-validation method is employed.
System Description
As a starting-point, and following the definition of motherese [21], we characterized the verbal interactions by the
extraction of supra-segmental features (prosody). To
evaluate the impact of frame-level feature extraction, segmental features are also employed. Consequently, the
utterances are characterized by both segmental (short-time
spectrum) and supra-segmental (statistics of fundamental
frequency, energy) features. These features aim at representing the verbal information for the next classification
stage based on machine learning techniques. Figure 1
shows a schematic overview of the final system [30, 32]
which is described in more detail in the following
paragraphs.
Supra-Segmental Characterization
The supra-segmental characterization follows the Segment
Based Approach (see ‘‘Units for Emotional Speech
197
Segmental Feature
Extraction
Classifier
Fusion
Signal
Supra-Segmental
feature extraction
Classifier
Fig. 1 Motherese classification system: fusion of features extracted
from different time-scales
Characterization’’). Previous works on SBA [12] have
shown to be more suited for short sentences as is usually
the case in our corpus. The features consist of statistical
measures (mean, variance and range) of both the fundamental frequency (F0) and the short-time energy estimated
from voiced segments. An utterance Ux is segmented into
N voiced segments (Fxi) obtained by F0 extraction. Local
estimation of a posteriori probabilities is carried out for
each segment. The utterance classification combines the N
local estimations:
PðCm jUx Þ ¼
N
X
PðCm jFxi Þ lengthðFxi Þ
ð1Þ
xi ¼1
where Cm represents the class membership.
The duration of the segments is introduced as weights of
a posteriori probabilities: importance of the measured
voiced segment (length(Fxi)) with respect to the length of
the utterance. The estimation has been carried out for
various classifiers in [30, 32] and GMMs have been found
to give good performance (number of parameters versus
performance).
Segmental Characterization
For the computation of segmental features, a 20 ms window is used, and the overlapping between adjacent frames
is 1/2. Mel Frequency Cepstrum Coefficients (MFCC) of
order 16 were computed. We exploit traditional speaker
recognition techniques [33]. For the whole utterance Ux,
a posteriori probabilities are estimated resulting in the
estimation of Pseg(Cm|Ux). The estimation can be carried
out for different time-scales: voiced, unvoiced, and wholesentence.
To evaluate the system performance we used the
receiver operating characteristic (ROC) methodology [34].
A ROC curve represents the tradeoff between the true
positives (TPR = true positive rate) and false positives
(FPR = false positive rate) as the classifier output threshold
value is varied. A quantitative measure, the area under
ROC curve (AUC), is computed and it represents the
overall performance of the classifier over the entire range
of thresholds. The results for different time-scales are
presented in Table 1. As can be expected voiced segments
provide better results than unvoiced ones. However, the
123
150
198
Cogn Comput (2009) 1:194–201
Table 1 Infant-directed speech discrimination performance of different time-scales for segmental features
Time-scale
Area under the ROC
Voiced
0.78
Unvoiced
0.55
Whole sentence
0.93
best results are obtained by using the whole-sentence as is
usually done in speaker recognition showing that authentic
emotional speech recognition is still an open issue compared to acted speech.
Fusion of Time-Scales
The segmental and supra-segmental characterizations provide different temporal information and a combination of
them should improve the accuracy of the detector. Many
decision techniques can be employed [35, 36] but we
investigated a simple weighted sum of likelihoods from the
different classifiers:
Cl ¼ k log Pseg ðCm jUx Þ þ ð1 kÞ log Psupra ðCm jUx Þ
ð2Þ
with l = 1 (motherese) or 2 (normal directed speech). k
denotes the weighting coefficient.
For the GMM classifier, the likelihoods can be easily
computed from a posteriori probabilities (Pseg(Cm|Ux),
Psupra(Cm|Ux))[37]. The weighting factor k is automatically
optimized in order to obtain the best results on the training
part of the database. Since we employed a 10-fold crossvalidation methodology, we present the means of the
weighting factors.
Figure 2 presents the obtained ROC curves for segmental and supra-segmental features and the best combination (k = 0.6). The weighting factor reveals a balance
between the two different time-scales.
The above experiment results clearly show that even if
motherese is defined as the modulation of supra-segmental
features, using this basic definition does not produce efficient results (supra-segmental models). Real-world applications, such as analysis of home movies with authentic
interactions and with a noisy environment, require the
combination of the initial definition (supra-segmental features) with short-term features such as the MFCC as details
of the short-term spectrum. Once again, for an efficient
characterization, one should employ several features from
different time-scales.
In this section we used short- and long-term features
extracted from the short-term spectrum (MFCC) and from
the evolution of supra-segmental features (statistics of F0,
energy). By definition, the last set of features are extracted
123
Fig. 2 ROC curve for segmental and supra-segmental systems
only from the voiced segments. Consequently all the
voiced segments are processed identically even if very
well-known distinctions exist between them (e.g., vowels
versus consonants).
Data-Driven Approach for Time-Scale Feature
Extraction
Nature of the Segments
The last section showed the relevance of combining frame
and turn level approaches for emotional speech processing.
One of the main limitations of this method relies on the fact
that no sub-units are clearly identified: all the frames are
exploited as it is usually done in speech and speaker recognition tasks. In this section, we propose to extract the
frame levels on specific units defined here by taking into
account the nature of the segments: vowel or consonant.
Several investigations have been carried out on the relation
of the nature of phonemes and emotional/affective states
[17, 38–41]. All these works highlight the dependency
between emotional states and the produced phonemes. In
addition, vowel sounds seem to convey more emotional
information than voiced consonant sounds [40]. These
results motivate the need of different time-scale analysis
for emotional speech processing.
We recently proposed a new feature extraction scheme
aiming at exploiting the nature of phonemes [41]. The
approach, described in Fig. 3, uses a first segmentation
phase by the help of the Divergence Forward Backward
(DFB) algorithm [42]. The resulting stationary segments
are then classified as vowels by a criterion based on a
spectral structure measure. This process is language independent and does not aim at the exact identification of
151
Cogn Comput (2009) 1:194–201
199
reduced set of sentences have been extracted keeping the
original frequency of the diphonemes as far as was possible. Then, a lexical balance has been processed to get the
702 sentences. Concerning the emotions, two gender
equilibrated professional speakers acted out the sentences
in a semi-professional studio. The Aholab corpus has a
lexicon of 35 phonemes (5 vowels and 30 consonants).
Classification With the Vowel–Consonant Time-Scale
Fig. 3 Pseudo-phonetic approach: feature extraction, classification
and fusion
phonemes as this could be done by a phonetic alignment.
As a result, the obtained segments are termed as pseudophonetic units. This method has been introduced for
automatic language identification [43] and consists in
characterizing pseudo-syllables which have been defined
by gathering the consonants preceding the detected vowels
(CnV structure). The study of these pseudo-syllables made
possible the characterization of two main groups of language described in the literature: stressed (English, German) and syllabic (French and Spanish). We recently
evaluated this segmentation system for both emotional and
non-emotional speech with an average vowel error rate of
23.29% [41].
Corpora
We evaluate a time-scale analysis by using transcripted
emotional databases: Berlin and Aholab. The Berlin corpus
[44] is commonly used for emotion recognition. Ten
utterances (five short and five long) that could be used in
everyday communication have been emotionally colored
by 10 gender equilibrated native German actors, with high
quality recording equipment (anechoic chamber). A total of
535 sentences marked as minimum 60% natural and minimum 80% recognizable by 20 listeners in a perception test
have been kept and phonetically labeled in a narrow transcription. The Berlin corpus has a lexicon of 59 phonemes
(24 vowels and 35 consonants). The Aholab corpus [45] is
composed of 702 sentences coming from a set of different
sources: Basque newspapers, texts from several novels and
others. From all these corpora (over 580,000 sentences), a
The vowel–consonant time-scale is now exploited for
emotion recognition problem by the use of the automatic
pseudo-phonetic characterization (Fig. 3). We followed a
segment-based approach (SBA) (equation 1) similar to what
has been done for infant-directed speech discrimination (see
‘‘Combining Frame-Level and Segment-Based Approach
for Intention Recognition in Infant-Directed Speech’’). But
here the segments are categorized as vowels and consonants.
The utterance decision is made by the fusion of the local
a posteriori class probabilities. This approach can be viewed
as a segment dependent based approach:
Ei ¼ arg maxfkVow PðCi jVowÞ þ kCons PðCi jConsÞg
i
ð3Þ
where P(Ci|Vow) and P(Ci|Cons) denote the local a posteriori class probabilities respectively estimated from vowel
and consonant segments. kVow and kCons represent the
weighting factors for the fusion process. Different strategies have been employed for the estimation of the
weighting factors [41]: static and adaptative (depending on
the vowel–consonant duration ratio). Here, we report
results for the static fusion process and the optimization is
done on training data (as previously described in Section ‘‘Combining Frame-Level and Segment-Based
Approach for Intention Recognition in Infant-Directed
Speech’’).
The segment dependent approach has been used for
classification [41] and we report the results for only segmental characterization (MFCC) and with a k-nn classifier
for different times-scales. Table 2 presents the obtained
classification scores for both Berlin and Aholab databases.
Obviously, the extraction of segmental features from
voiced segments gives better results than unvoiced ones
and the fusion of them does not improve the performance.
Similar results have been also found for the communicative
intent classification (see ‘‘Combining Frame-Level and
Segment-Based Approach for Intention Recognition in
Infant-Directed Speech’’) but the main difference relies on
the impact of taking all the frames (voiced and unvoiced)
for authentic and noisy data as it is the case for the
motherese application (see Table 1).
By using the transcription, we extracted the same features but from vowel and consonant segments. Promising
123
152
200
Cogn Comput (2009) 1:194–201
Table 2 Segmental based emotion recognition rates for different
time-scales
Time-scale
Berlin (%)
Aholab (%)
Voiced
73.80
99.08
Unvoiced
49.00
87.35
Static fusion
73.80
99.83
Vowels (transcription)
76.90
99.46
Consonants (transcription)
69.66
97.60
Static fusion
78.51
99.47
Vowels (detected)
Consonants (detected)
73.20
65.60
98.47
98.25
Static fusion
77.80
99.59
results are obtained by the vowel time-scale for emotional
speech processing: for the Berlin corpus, we obtained
76.90% for the vowel time-scale and 69.66% for the consonant time-scale. And by using the automatic and non
perfect segmentation procedure (Fig. 3), we, respectively,
obtain 73.20% for vowels and 65.60% for consonants. In
addition, we also investigated the fusion of these dependent
segment levels and the best results are still obtained by the
transcription (78.51%) but the pseudo-phonetic approach
(77.80%) is more efficient than the initial voiced segment
(73.80%).
The classification results can be correlated to the number
of speakers in the databases (Berlin: 10 versus Aholab: 2).
The Aholab corpus presents less confusions between
durations than the Berlin corpus and consequently the
results are better.
Conclusion and Perspectives
This article presents a method for the combination of timescale features: segmental (acoustic)/supra-segmental features (prosody) and also vowel/consonant phonemes. The
cases studies provided (authentic and longitudinal interactions, acted corpus) illustrate the usefulness of combining
different time-scale feature extractions for emotional
speech classification. The advantages of this approach are
the increase in robustness and also the integration of perceptual knowledge related to emotional sounds. The literature has shown the relative prominence of vowel sounds
in the perception of emotions [9–11] and the reported
framework makes it possible to employ this phenomenon.
Our future works will be devoted to the characterization
of another important phenomenon such as the rhythm. The
role of rhythm in the perception of sounds is very important
[46] and it has been shown to be efficient for language
identification [43, 47]. Most of the models proposed in the
literature for the extraction of rhythmic features require the
123
definition of a rhythmic unit (e.g., vowels, syllable) and a
metric (inter, intra units)[48, 49]. A first application of
these models to emotional speech processing reveals
promising results [41].
References
1. Picard R. Affective computing. Cambridge, MA: MIT Press;
1997.
2. Argyle M. Bodily communication. 2nd edn. Madison: International Universities Press; 1988.
3. Kendon A, Harris RM, Key MR. Organization of behavior in face
to face interactions. The Hague: Mouton; 1975.
4. Pentland A. Social signal processing. IEEE Signal Process Mag.
2007;24(4):108–11.
5. Vinciarelli A, Pantic M, Bourlard H, Pentland A. Social signals,
their function, and automatic analysis: a survey. In: IEEE international conference on multimodal interfaces (ICMI’08). 2008. p.
61–8.
6. Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, et al.
The relevance of feature type for the automatic classification of
emotional user states: low level descriptors and functionals. In:
Proceedings of interspeech; 2007. p. 2253–6.
7. Keller E. The Analysis of voice quality in speech processing. In:
Chollet G, Esposito A, Faundez-Zanuy M, et al. editors. Lecture
notes in computer science, vol. 3445/2005. New York: Springer;
2005. p. 54–73.
8. Campbell N. On the use of nonverbal speech sounds in human
communication. In: Esposito A, et al. editors. Verbal and nonverbal communicational behaviours, LNAI 4775. Berlin, Heidelberg: Springer; 2007. p. 117–128.
9. Williams CE, Stevens KN. Emotions and speech: some acoustic
correlates. J Acoust Soc Am. 1972;52:1238–50.
10. Sherer KR. Vocal affect expression: a review and a model for
future research. Psychol Bull. 1986;99(2):143–65.
11. Murray IR, Amott JL. Toward the simulation of emotion in
synthetic speech: a review of the literature on human vocal
emotion. J Acoust Soc Am. 1993;93(2):1097–108.
12. Shami M, Verhelst W. An evaluation of the robustness of existing
supervised machine learning approaches to the classification of
emotions, speech. Speech Commun. 2007;49(3):201–12.
13. Schuller B, Rigoll G, Lang M. Hidden Markov model-based
speech emotion recognition. In: Proceedings of ICASSP’03, vol.
2. 2003. p. 1–4.
14. Lee Z, Zhao Y. Recognizing emotions in speech using short-term
and long-term features. In: Proceedings ICSLP 98; 1998. p.
2255–58.
15. Vlasenko B, Schuller B, Wendemuth A, Rigoll G. Frame vs. turnlevel: emotion recognition from speech considering static and
dynamic processing. Affect Comput Intell Interact. 2007;139–47.
16. Schuller B, Vlasenko B, Minguez R, Rigoll G, Wendemuth A.
Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: Proceedings of IEEE automatic
speech recognition and understanding workshop (ASRU 2007),
9–13 Dec 2007, Kyoto, Japan; 2007. p. 596–600.
17. Jiang DN, Cai L-H. Speech emotion classification with the
combination of statistic features and temporal features. In: Proceedings of ICME 2004 IEEE, Taipei, Taiwan; 2004. p. 1967–71.
18. Kim S, Georgiou P, Lee S, Narayanan S. Real-time emotion
detection system using speech: multi-modal fusion of different
timescale features. In: IEEE international workshop on multimedia signal processing; 2007.
Cogn Comput (2009) 1:194–201
19. Fernald A, Simon T. Expanded intonation contours in mother’s
speech to newborns. Dev Psychol.1987;20(1):104–13.
20. Uther M, Knoll MA, Burnham D. Do you speak E-NG-L-I-SH? A
comparison of foreigner- and infant directed speech. Speech
Commun. 2007;49:2–7.
21. Fernald A, Kuhl P. Acoustic determinants of infant preference for
Motherese speech. Infant Behav Dev. 1987;10:279–93.
22. Fernald A. Intonation and communication intent in mothers
speech to infants: is the melody the message? Child Dev.
1989;60:1497–510.
23. Slaney M, McRoberts G. Baby ears: a recognition system for
affective vocalizations. Speech Commun. 2003;39(3–4):367–84.
24. Burnham D, Kitamura C, Vollmer-Conna U. What’s new,
Pussycat? On talking to babies and animals. Science.
2002;296:1435.
25. Varchavskaia P, Fitzpatrick P, Breazeal C. Characterizing and
processing robot-directed speech. In: Proceedings of the IEEE/
RAS international conference on humanoid robots. Tokyo, Japan,
22–24 Nov 2001.
26. Batliner A, Biersack S, Steidl S. The prosody of pet robot
directed speech: evidence from children. In: Proceedings of
speech prosody; 2006. p. 1–4.
27. Breazeal C, Aryananda L. Recognition of affective communicative
intent in robot-directed speech. Auton Robots. 2002;12:83–104.
28. Maestroa S, et al. Early behavioral development in autistic children: the first 2 years of life through home movies. Psychopathology. 2001;34:147–52.
29. Muratori F, Maestro S. Autism as a downstream effect of primary
difficulties in intersubjectivity interacting with abnormal development of brain connectivity. Int J Dialog Sci Fall. 2007;2(1):93–118.
30. Mahdhaoui A, Chetouani M, Zong C, Cassel RS, Saint-Georges
C, Laznik M-C, et al. Automatic Motherese detection for face-toface interaction analysis. In: Anna Esposito, et al. editors. Multimodal signals: cognitive and algorithmic issues. Berlin:
Springer; 2009. p. 248–55.
31. Laznik MC, Maestro S, Muratori F, Parlato E. Les interactions
sonores entre les bebes devenus autistes et leur parents. In:
Castarde MF, Konopczynski G, editors. Au commencement tait
la voix. Ramonville Saint-Agne: Eres; 2005. p. 171–81.
32. Mahdhaoui A, Chetouani M, Zong C. Motherese detection based
on segmental and supra-segmental features. In: IAPR international conference on pattern recognition, ICPR 2008; 2008.
33. Chetouani M, Faundez-Zanuy M, Gas B, Zarader JL. Investigation on LP-residual representations for speaker identification.
Pattern Recogn. 2009;42(3):487–94.
153
201
34. Duda RO, Hart PE, Stork DG. Pattern classification. 2nd edn.
New York: Wiley; 2000.
35. Kuncheva I. Combining pattern classifiers: methods and algorithms. Wiley-Interscience; 2004.
36. Monte-Moreno E, Chetouani M, Faundez-Zanuy M, Sole-Casals
J. Maximum likelihood linear programming data fusion for
speaker recognition. Speech Commun; 2009 (in press).
37. Reynolds D. Speaker identification and verification using
Gaussian mixture speaker models. Speech Commun.
1995;17:91108.
38. Leinonen L, Hiltunen T, Linnankoski I, Laakso MJ. Expression
or emotional–motivational connotations with a one-word utterance. J Acoust Soc Am. 1997;102(3):1853–63.
39. Pereira C, Watson C. Some acoustic characteristics of emotion.
In: International conference on spoken language processing
(ICSLP98); 1998. p. 927–30.
40. Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng Z,
Lee S, Narayanan S. Effects of emotion on different phoneme
classes. J Acoust Soc Am. 2004;116:2481.
41. Ringeval F, Chetouani M. A vowel based approach for acted
emotion recognition. In: Proceedings of interspeech’08; 2008.
42. Andr-Obrecht R. A new statistical approach for automatic speech
segmentation. IEEE Trans ASSP. 1988;36(1):29–40.
43. Rouas JL, Farinas J, Pellegrino F, Andr-Obrecht R. Rhythmic
unit extraction and modelling for automatic language identification. Speech Commun. 2005;47(4):436–56.
44. Burkhardt F. et al. A database of German emotional speech. In:
Proceedings of Interspeech; 2005. p. 1517–20.
45. Saratxaga I, Navas E, Hernaez I, Luengo I. Designing and
recording an emotional speech database for corpus based synthesis in Basque. In: Proceedings of LREC; 2006. p. 2126–9.
46. Keller E, Port R. Speech timing: Approaches to speech rhythm.
Special session on timing. In: Proceedings of the international
congress of phonetic sciences; 2007. p. 327–29.
47. Tincoff R, Hauser M, Tsao F, Spaepen G, Ramus F, Mehler J.
The role of speech rhythm in language discrimination: further
tests with a nonhuman primate. Dev Sci. 2005;8(1):26–35.
48. Ramus F, Nespor M, Mehler J. Correlates of linguistic rhythm in
the speech signal. Cognition. 1999;73(3):265–92.
49. Grabe E, Low EL. Durational variability in speech and the
rhythm class hypothesis. Papers in Laboratory Phonology 7,
Mouton; 2002.
123
154
IEEE Transactions on Audio, Speech and Language Processing
Travaux réalisés dans le cadre de la thèse de Fabien Ringeval, du mémoire d'orthophonie de Julie Demouy et de la visite de György Szaszak (postdoctorant au laboratoire d'acoustique de la parole, Budapest, Hongrie).
1328
155
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY 2011
Automatic Intonation Recognition for the Prosodic
Assessment of Language-Impaired Children
Fabien Ringeval, Julie Demouy, György Szaszák, Mohamed Chetouani, Laurence Robel, Jean Xavier,
David Cohen, and Monique Plaza
Abstract—This study presents a preliminary investigation into
the automatic assessment of language-impaired children’s (LIC)
prosodic skills in one grammatical aspect: sentence modalities.
Three types of language impairments were studied: autism disorder (AD), pervasive developmental disorder-not otherwise
specified (PDD-NOS), and specific language impairment (SLI).
A control group of typically developing (TD) children that was
both age and gender matched with LIC was used for the analysis.
All of the children were asked to imitate sentences that provided
different types of intonation (e.g., descending and rising contours).
An automatic system was then used to assess LIC’s prosodic
skills by comparing the intonation recognition scores with those
obtained by the control group. The results showed that all LIC
have difficulties in reproducing intonation contours because they
achieved significantly lower recognition scores than TD children
. Regarding the
on almost all studied intonations
“Rising” intonation, only SLI children had high recognition
scores similar to TD children, which suggests a more pronounced
pragmatic impairment in AD and PDD-NOS children. The automatic approach used in this study to assess LIC’s prosodic skills
confirms the clinical descriptions of the subjects’ communication
impairments.
duces a raw message composed of textual information when
he or she speaks but also transmits a wide set of information
that modulates and enhances the meaning of the produced
message [1]. This additional information is conveyed in speech
by prosody and can be directly (e.g., through sentence modality
or word focus) or indirectly (e.g., idiosyncrasy) linked to the
message. To properly communicate, knowledge of the pre-established codes that are being used is also required. Indeed, the
richness of social interactions shared by two speakers through
speech strongly depends on their ability to use a full range of
pre-established codes. These codes link acoustic speech realization and both linguistic- and social-related meanings. The
acquisition and correct use of such codes in speech thus play
an essential role in the inter-subjective development and social
interaction abilities of children. This crucial step of speech
acquisition relies on cognition and is supposed to be functional
in the early stages of a child’s life [2].
Index Terms—Automatic intonation recognition, prosodic skills
assessment, social communication impairments.
A. Prosody
I. INTRODUCTION
S
PEECH is a complex waveform that conveys a lot of
useful information for interpersonal communication and
human–machine interaction. Indeed, a speaker not only pro-
Manuscript received April 17, 2010; revised August 15, 2010 and October
15, 2010; accepted October 18, 2010. Date of publication October 28, 2010;
date of current version May 13, 2011. This work was supported in part by the
French Ministry of Research and Superior Teaching and by the Hubert–Curien
partnership between France (EGIDE www.egide.asso.fr) and Hungary (TéT,
OMFB-00364/2008). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Renato De Mori.
F. Ringeval and M. Chetouani are with the Institute of Intelligent Systems
and Robotics, University Pierre and Marie Curie, 75005 Paris, France (e-mail:
[email protected]; [email protected]).
J. Demouy and J. Xavier are with the Department of Child and Adolescent
Psychiatry, Hôpital de la Pitié-Salpêtrière, University Pierre and Marie Curie,
75013 Paris, France (e-mail: [email protected]; [email protected]).
G. Szaszák is with the Department for Telecommunication and Media Informatics, Budapest University of Technology and Economics, H-1117 Budapest,
Hungary (e-mail: [email protected]).
L. Robel is with the Department of Child and Adolescent Psychiatry, Hôpital
Necker-Enfants Malades, 75015 Paris, France (e-mail: [email protected]).
D. Cohen and M. Plaza are with the Department of Child and Adolescent
Psychiatry, Hôpital de la Pitié-Salpêtrière, University Pierre and Marie Curie,
75013 Paris, France, and also with the Institute of Intelligent Systems and
Robotics, University Pierre and Marie Curie, 75005 Paris, France (e-mail:
[email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2010.2090147
Prosody is defined as the supra-segmental properties of the
speech signal that modulate and enhance its meaning. It aims to
construct discourse through expressive language at several communication levels, i.e., grammatical, pragmatic, and affective
prosody [3]. Grammatical prosody is used to signal syntactic
information within sentences [4]. Stress is used to signal, for
example, whether a token is being used as a noun (convict) or a
verb (convict). Pitch contours signal the ends of utterances and
denote whether they are, for example, questions (rising pitch)
or statements (falling pitch). Pragmatic prosody conveys the
speaker’s intentions or the hierarchy of information within the
utterance [3] and results in optional changes in the way an utterance is expressed [5]. Thus, it carries social information beyond
that conveyed by the syntax of the sentence. Affective prosody
serves a more global function than those served by the prior two
forms. It conveys a speaker’s general state of feeling [6] and includes associated changes in register when talking to different
listeners (e.g., peers, young children or people of higher social
status) [3].
Because prosodic deficits contribute to language, communication and social interaction disorders and lead to social
isolation, the atypical prosody in individuals with communication disorders became a research topic. It appears that
prosodic awareness underpins language skills, and a deficiency
in prosody may affect both language development and social
interaction.
1558-7916/$26.00 © 2010 IEEE
156
RINGEVAL et al.: AUTOMATIC INTONATION RECOGNITION FOR THE PROSODIC ASSESSMENT OF LANGUAGE IMPAIRED CHILDREN
B. Prosodic Disorders in Language-Impaired Children
Most children presenting speech impairments have limited
social interactions, which contributes to social isolation. A developmental language disorder may be secondary to hearing
loss or acquired brain injury and may occur without specific
cause [7]. In this case, international classifications distinguish
specific language impairment (SLI), on one hand, and language
impairment symptomatic of a developmental disorder (e.g., Pervasive Developmental Disorders-PDD) on the other. The former
can affect both expressive and receptive language and is defined
as a “pure” language impairment [8]. The latter, PDD, is characterized by severe deficits and pervasive impairment in several areas of development such as reciprocal social interactions,
communication skills and stereotyped behaviors, interests, and
activities [9]. Three main disorders have been described [7]: 1)
autistic disorder (AD), which manifests as early onset language
impairment quite similar to that of SLI [10] and symptoms in
all areas that characterize PDD; 2) Asperger’s Syndrome, which
does not evince language delay; and 3) pervasive developmental
disorder-not otherwise specified (PDD-NOS), which is characterized by social, communicative and/or stereotypic impairments that are less severe than in AD and appear later in life.
Language-impaired children (LIC) may also show prosodic
disorders: AD children often sound differently than their peers,
which adds a barrier to social integration [11]. Furthermore,
the prosodic communication barrier is often persistent while
other language skills improve [12]. Such disorders notably affect acoustic features such as pitch, loudness, voice quality, and
speech timing (i.e., rhythm).
The characteristics of the described LIC prosodic disorders
are various and seem to be connected with the type of language
impairment.
Specific Language Impairment: Intonation has been studied
very little in children with SLI [13]. Some researchers hypothesized that intonation provides reliable cues to grammatical
structure by referring to the theory of phonological bootstrapping [14], which claims that prosodic processing of spoken
language allows children to identify and then acquire grammatical structures as inputs. Consequently, difficulties in the
processing of prosodic feature such as intonation and rhythm
may generate language difficulties [15]. While some studies
concluded that SLI patients do not have significant intonation
deficits and that intonation is independent of both morphosyntactic and segmental phonological impairments [16]–[18],
some others have shown small but significant deficits [13], [19],
[20]. With regards to intonation contours production, Wells and
Peppé [13] found that SLI children produced less congruent
contours than typically developing children. The authors hypothesized that SLI children understand the pragmatic context
but fail to select the corresponding contour. On the topic of
intonation imitation tasks, the results seem contradictory.
Van der Meulen et al. [21] and Wells and Peppé [13] found
that SLI children were less able to imitate prosodic features.
Several interpretations were proposed: 1) the weakness was
due to the task itself rather than to a true prosodic impairment
[21]; 2) a failure in working memory was more involved than
prosodic skills [21]; and 3) deficits in intonation production
1329
at the phonetic level were sufficient to explain the failure to
imitate prosodic features [13]. Conversely, Snow [17] reported
that children with SLI showed a typical use of falling tones
and Marshall et al. [18] did not find any difference in the
ability to imitate intonation contours between SLI and typically
developing children.
Pervasive Developmental Disorders: Abnormal prosody was
identified as a core feature of individuals with autism [22]. The
observed prosodic differences include monotonic or machinelike intonation, aberrant stress patterns, deficits in pitch and intensity control and a “concerned” voice quality. These inappropriate patterns related to communication/sociability ratings
tend to persist over time even while other language skills improve [23]. Many studies have tried to define the prosodic features in Autism Spectrum Disorder (ASD) patients (for a review see [13]). With regards to intonation contours production
and intonation contours imitation tasks, the results are contradictory. In a reading-aloud task, Fosnot and Jun [24] found that
AD children did not distinguish questions and statements; all utterances sounded like statements. In an imitation condition task,
AD children performed better. The authors concluded that AD
subjects can produce intonation contours although they do not
use them or understand their communicative value. They also
observed a correlation between intonation imitation skills and
autism severity, which suggests that the ability to reproduce intonation contours could be an index of autism severity. Paul et
al. [3] found no difference between AD and TD children in the
use of intonation to distinguish questions and statements. Peppé
and McCann [25] observed a tendency for AD subjects to utter
a sentence that sounds like a question when a statement was appropriate. Le Normand et al. [26] found that children with AD
produced more words with flat contours than typically developing children. Paul et al. [27] documented the abilities to reproduce stress in a nonsense syllable imitation task of an ASD
group that included members with high-functioning autism, Asperger’s syndrome and PDD-NOS. Perceptual ratings and instrumental measures revealed small but significant differences
between ASD and typical speakers.
Most studies have aimed to determine whether AD or SLI
children’s prosodic skills differed from those of typically
developing children. They rarely sought to determine whether
the prosodic skills differed between diagnostic categories. We
must note that whereas AD diagnostic criteria are quite clear,
PDD-NOS is mostly diagnosed by default [28]; its criteria are
relatively vague, and it is statistically the largest diagnosed
category [29].
Language researchers and clinicians share the challenging
objective of evaluating LIC prosodic skills by using appropriate
tests. They aim to determine the LIC prosodic characteristics
to improve diagnosis and enhance children’s social interaction
abilities by adapting remediation protocols to the type of disorder. In this study, we used automated methods to assess one
aspect of the grammatical prosodic functions: sentence modalities (cf. Section I-A).
C. Prosody Assessment Procedures
Existing prosody assessment procedures such as the American ones [3], [30], the British PROP [31], the Swedish one [20],
1330
157
and the PEPS-C [32] require expert judgments to evaluate the
child’s prosodic skills. For example, prosody can be evaluated
by recording a speech sample and agreeing on the transcribed
communicative functions and prosody forms. This method,
based on various protocols, requires an expert transcription.
As the speech is unconstrained during the recording of the
child, the sample necessarily involves various forms of prosody
between the speakers, which complicates the acoustic data
analysis. Thus, most of the prosodic communication levels (i.e.,
grammatical, pragmatic and affective, cf. Section I-A) are assessed using the PEPS-C with a constrained speech framework.
The program delivers pictures on a laptop screen both as stimuli
for expressive utterances (output) and as response choices to
acoustic stimuli played by the computer (input). For the input
assessment, there are only two possible responses for each
proposed item to avoid undue demand on auditory memory.
As mentioned by the authors, this feature creates a bias that
is hopefully reduced by the relatively large number of items
available for each task. For the output assessment, the examiner
has to judge whether the sentences produced by the children
can be matched with the prosodic stimuli of each task. Scoring
options given to the tester are categorized into two or three
possibilities to score the imitation such as “good/fair/poor” or
“right/wrong.” As the number of available items for judging
the production of prosody is particularly low, this procedure
does not require a high level of expertise. However, we might
wonder whether the richness of prosody can be evaluated (or
categorized) in such a discrete way. Alternatively, using many
more evaluation items could make it difficult for the tester to
choose the most relevant ones.
Some recent studies have proposed automatic systems to assess prosody production [33], speech disorders [34] or even
early literacy [35] in children. Multiple challenges will be faced
by such systems in characterizing the prosodic variability of
LIC. Whereas acoustic characteristics extracted by many automatic speech recognition (ASR) systems are segmental (i.e.,
computed over a time-fixed sliding window that is typically 32
ms with an overlap ratio of 1/2), prosodic features are extracted
in a supra-segmental framework (i.e., computed over various
time scales). Speech prosody concerns many perceptual features
(e.g., pitch, loudness, voice quality, and rhythm) that are all included in the speech waveform. Moreover, these acoustic correlates of prosody present high variability due to a set of contextual (e.g., disturbances due to the recording environment) and
speaker’s idiosyncratic variables (e.g., affect [36] and speaking
style [37]). Acoustic, lexical, and linguistic characteristics of solicited and spontaneous children’s speech were also correlated
with age and gender [38].
As characterizing speech prosody is difficult, six design principles were defined in [33]: 1) highly constraining methods to
reduce unwanted prosodic variability due to assessment procedure contextual factors; 2) a “prosodic minimal pairs” design
for one task to study prosodic contrast; 3) robust acoustic features to ideally detect automatically the speaker’s turns, pitch
errors and mispronunciations; 4) fusion of relevant features to
find the importance of each on the other in these disorders; 5)
both global and dynamical features to catch specific contrasts
of prosody; and 6) parameter-free techniques in which the algo-
rithms either are based on established facts about prosody (e.g.,
the phrase-final lengthening phenomenon) or are developed in
exploratory analyses of a separate data set whose characteristics
are quite different from the main data in terms of speakers.
The system proposed by van Santen et al. [33] assesses
prosody on grammatical (lexical stress and phrase boundary),
pragmatic (focus and style), and affective functions. Scores
are evaluated by both humans and a machine through spectral,
fundamental frequency and temporal information. In almost
all tasks, it was found that the automated scores correlated
with the mean human judgments approximately as well as the
judges’ individual scores. Similar results were found with the
system termed PEAKS [34] wherein speech recognition tools
based on hidden Markov models (HMMs) were used to assess
speech and voice disorders in subjects with conditions such as
a removed larynx and cleft lip or palate. Therefore, automatic
assessments of both speech and prosodic disorders are able to
perform as well as human judges specifically when the system
tends to include the requirements mentioned by [33].
D. Aims of This Study
Our main objective was to propose an automatic procedure
to assess LIC prosodic skills. This procedure must differentiate LIC patients from TD children using prosodic impairment,
which is a known clinical characteristic of LIC (cf. Section I-B).
It should also overcome the difficulties created by categorizing
the evaluations and by human judging bias (cf. Section I-C). The
motives of these needs were twofold: 1) the acoustic correlates
of prosody are perceptually much too complex to be fully categorized into items by humans; and 2) these features cannot be
reliably judged by humans who have subjective opinions [39]
in as much as inter-judge variability is also problematic. Indeed, biases and inconsistencies in perceptual judgment were
documented [40], and the relevant features for characterizing
prosody in speech were defined [41], [42]. However, despite
progress in extracting a wide set of prosodic features, there is
no clear consensus today about the most efficient features.
In the present study, we focused on the French language and
on one aspect of the prosodic grammatical functions: sentence
modalities (cf. Section I-A). As the correspondences between
“prosody” and “sentence-type” are language specific, the intonation itself was classified in the present work. We aimed to
compare the performances among different children’s groups
(e.g., TD, AD, PDD-NOS and SLI) in a proposed intonation
imitation task by using automated approaches.
Imitation tasks are commonly achieved by LIC patients even
with autism [43]. In a patient, this ability can be used to test
the prosodic field without any limitations due to their language
disability. Imitation tasks introduce bias in the data because the
produced speech is not natural and spontaneous. Consequently,
the intonation contours that were reproduced by subjects may
not correspond with the original ones. However, all subjects
were confronted with the same task of a single protocol of data
recording (cf. Section V-B). Moreover, the prosodic patterns that
served to characterize the intonation contours were collected
from TD children (cf. Section III-D). In other words, the bias
introduced by TD children in the proposed task was included in
the system’s configuration. In this paper, any significant devia-
158
1331
tion from this bias will be considered to be related to grammatical prosodic skill impairments, i.e., intonation contours imitation deficiencies.
The methodological novelty brought by this study lies in the
combination of static and dynamic approaches to automatically
characterize the intonation contours. The static approach corresponds to a typical state-of-the-art system: statistical measures
were computed on pitch and energy features, and a decision was
made on a sentence. The dynamic approach was based on hidden
Markov models wherein a given intonation contour is described
by a set of prosodic states [44].
The following section presents previous works that accomplished intonation contours recognition. Systems that were used
in this study are described in Section III. The recruitment and the
clinical evaluation of the subjects are presented in Section IV.
The material used for the experiments is given in Section V. Results are provided in Section VI while Section VII is devoted to
a discussion, and Section VIII contains our conclusions.
II. RELATED WORKS IN INTONATION RECOGNITION
The automatic characterization of prosody was intensively
studied during the last decade for several purposes such
as emotion, speaker, and speech recognition [45]–[47] and
infant-directed speech, question, dysfluency, and certainty detection [48]–[51]. The performance achieved by these systems
is clearly degraded when they deal with spontaneous speech or
certain specific voice cases (e.g., due to the age of a child [52] or
a pathology [53]). The approaches used for automatically processing prosody must deal with three key questions: 1) the time
scale to define the extraction locus of features (e.g., speaker
turn and specific acoustic or phonetic containers such as voiced
segments or vowels) [54]; 2) the set of prosodic descriptors
used for characterizing prosody (e.g., low-level descriptors or
language models); and 3) the choice of a recognition scheme
for automatic decisions on the a priori classes of the prosodic
features. Fusion techniques were proposed to face this apparent
complexity [55], [56]. A fusion can be achieved on the three
key points mentioned above, e.g., unit-based (vowel/consonant)
fusion [57], features-based (acoustic/prosodic) fusion [58], and
classifier-based fusion [59].
Methods that are used to characterize the intonation should be
based on pitch features because the categories they must identify
are defined by the pitch contour. However, systems found in the
literature have shown that the inclusion of other types of information such as energy and duration is necessary to achieve good
performance [60], [61]. Furthermore, detection of motherese,
i.e., the specific language characterized by high pitch values and
variability that is used by a mother when speaking to her child,
requires others types of features than those derived from pitch
to reach satisfactory recognition scores [59].
Narayanan et al. proposed a system that used features derived
from the Rise-Fall-Connection (RFC) model of pitch with an
-gram prosodic language model for four-way pitch accent labeling [60]. RFC analysis considers a prosodic event as being
comprised of two parts: a rise component followed by a fall
component. Each component is described by two parameters:
amplitude and duration. In addition, the peak value of pitch for
the event and its position within the utterance is recorded in
Fig. 1. Scheme of the intonation recognition system.
the RFC model. A recognition score of 56.4% was achieved
by this system on the Boston University Radio News Corpus
(BURNC), which includes 3 hours of read speech (radio quality)
produced by six adults.
Rosenberg et al. compared the discriminative usefulness of
units such as vowels, syllables, and word levels in the analysis
of acoustic indicators of pitch accent [61]. Features were derived from pitch, energy, and duration through a set of statistical measures (e.g., max, min, mean, and standard deviation)
and normalized to speakers by a z-score. By using logistic regression models, word level was found to provide the best score
on the BURNC corpus with a recognition rate of 82.9%.
In a system proposed by Szaszák et al. [44], an HMM-based
classifier was developed with the aim of evaluating intonation
production in a speech training application for hearing impaired
children. This system was used to classify five intonation classes
and was compared to subjective test results. The automatic classifier provided a recognition rate of 51.9%, whereas humans
achieved 69.4%. A part of this work was reused in this study as a
so-called “dynamic pitch contour classifier” (cf. Section III-B).
III. INTONATION CONTOURS RECOGNITION
The processing stream proposed in this study includes steps
of prosodic information extraction and classification (Fig. 1).
However, even if the data collection phase is realized up-stream
(cf. Section V-B), the methods used for characterizing the intonation correspond to a recognition system. As the intonation
contours analyzed in this study were provided by the imitation
of prerecorded sentences, the speaker turn unit was used as a
data input for the recognition system. This unit refers to the moment where a child imitates one sentence. Therefore, this study
does not deal with read or spontaneous speech but rather with
1332
159
constrained speech where spontaneity may be found according
to the child.
During the features extraction step, both pitch and energy
features, i.e., low-level descriptors (LLDs), were extracted from
the speech by using the Snack toolkit [62]. The fundamental
frequency was calculated by the ESPS method with a frame
rate of 10 ms. Pre-processing steps included an anti-octave
jump filter to reduce pitch estimation errors. Furthermore, pitch
was linearly extrapolated on unvoiced segments (no longer than
250 ms, empirically) and smoothed by an 11-point averaging
filter. Energy was also smoothed with the same filter. Pitch and
energy features were then normalized to reduce inter-speaker
and recording-condition variability. Fundamental frequency
values were divided by the average value of all voiced frames,
and energy was normalized to 0 dB. Finally, both first-order and
) were computed from the
second-order derivates ( and
pitch and energy features so that a given intonation contour was
described by six prosodic LLDs, as a basis for the following
characterization steps.
Intonation contours were then separately characterized by
both static and dynamic approaches (cf. Fig. 1). Before the
classification step, the static approach requires the extraction
of LLD statistical measures, whereas the dynamic approach is
optimized to directly process the prosodic LLDs. As these two
approaches were processing prosody in distinctive ways, we assumed that they were providing complementary descriptions of
the intonation contours. Output probabilities returned by each
system were thus fused to get a final label of the recognized
intonation. A ten-fold cross-validation scheme was used for
the experiments to reduce the influence of data splitting in both
the learning and testing phases [63]. The folds were stratified,
i.e., intonation contours were equally distributed in the learning
data sets to insure that misrepresented intonation contours were
not disadvantaged during the experiments.
A. Static Classification of the Intonation Contour
This approach is a typical system for classifying prosodic
information by making an intonation decision on a sentence using LLD statistical measures concatenated into a
super-vector. Prosodic features, e.g., pitch, energy and their
derivates ( and
), were characterized by a set of 27 statistical measures (Table I) such that 162 features in total composed
the super-vector that was used to describe the intonation in the
static approach. The set of statistical measures included not
only traditional ones such as maximum, minimum, the four first
statistical moments, and quartiles but also perturbation-related
coefficients (e.g., jitter and shimmer), RFC derived features
(e.g., the relative positions of the minimum and maximum
values) and features issued from question detection systems
(e.g., the proportion/mean of rising/descending values) [49].
The ability of these features to discriminate and characterize
the intonation contours was evaluated by the RELIEF-F algorithm [64] in a ten-fold cross-validation framework. RELIEF-F
was based on the computation of both a priori and a posteriori
entropy of the features according to the intonation contours.
This algorithm was used to initialize a sequential forward selection (SFS) approach for the classification step. Ranked features
were sequentially inserted in the prosodic features super-vector,
TABLE I
SET OF STATISTICAL MEASURES USED FOR STATIC MODELING OF PROSODY
and we only kept those that created an improvement in the classification task. This procedure has permitted us to identify the
relevant prosodic features for intonation contour characterization. However, the classification task was done 162 times, i.e.,
the number of extracted features in total. A -nearest-neighbors algorithm was used to classify the features ( was set to
three); the -nn classifier estimates the maximum likelihood on
a posteriori probabilities of recognizing an intonation contour
(
intonation classes) on a tested sentence by
labels (issued from a learning phase) that consearching the
tain the closest set of prosodic features to those issued from the
was obtained
tested sentence . The recognized intonation
by an
function on the estimates of the a posteriori prob(1) [63]:
abilities
(1)
B. Dynamic Classification of the Intonation Contour
The dynamic pitch contour classifier used hidden Markov
models (HMMs) to characterize the intonation contours by
using prosodic LLDs provided by the feature extraction steps.
This system was analogous to an ASR system; however, the
features were based on pitch and energy, and the prosodic
contours were thus modeled instead of phoneme spectra or
160
1333
Fig. 2. Principle of HMM prosodic modeling of pitch values extracted from a sentence.
cepstra. The dynamic description of intonation requires a determination of both the location and the duration of the intonation
units that represent different states in the prosodic contours
(Fig. 2). Statistical distributions of the LLDs were estimated by
Gaussian mixture models (GMMs) as mixtures of up to eight
Gaussian components. Observation vectors (prosodic states in
Fig. 2) were six-dimensional, i.e., equal to the number of LLDs.
Because some sentences were conveying intonation with much
shorter duration than others, both a fixed and a varying number
of states was used according to sentence duration to set the
HMMs for the experiments. A fixed number of 11-state models
patterned by eight Gaussian mixtures were found to yield the
best recognition performance in empirical optimization for
Hungarian. In this case, the same configuration was applied
to French because the intonations we wished to characterize
were identical to those studied in [44]. Additionally, a silence
model was used to set the HMM’s configuration states for
the beginning and the ending of a sentence. The recognized
was obtained by an
function on the
intonation
a posteriori probabilities
(2)
(2)
The estimation of
was decomposed in the same
manner as in speech recognition; according to Bayes’ rule,
specifies the prosodic probability of observations
is the probaextracted from a tested sentence , where
is the
bility associated with the intonation contours and
probability associated with the sentences.
C. Fusion of the Classifiers
Because the static and dynamic classifiers provide different
information by using distinct processes to characterize the in-
tonation, a combination of the two should improve recognition
performance. Although many sophisticated decision techniques
do exist to fuse them [55], [56], we used a weighted sum of the
a posteriori probabilities:
(3)
This approach is suitable because it provides the contribution
of each classifier used in the fusion. In (3), the label
of the
final recognized intonation contour is attributed to a sentence
by weighting the a posteriori probabilities provided by both
static and dynamic based classifiers by a factor
. To assess the similarity between these two classifiers, we
calculated the statistic [50]:
(4)
where
is the number of times both classifiers are wrong,
is the number of times both classifiers are correct,
is
the number of times when the first classifier is correct and the
is the number of times when the first
second is wrong and
classifier is wrong and the second classifier is correct. The
statistic takes values between [ 1; 1] and the closer the value
is to 0, the more dissimilar the classifiers are. For example,
represents total dissimilarity between the two
classifiers. The
statistic was used to evaluate how complementarity the audio and visual information is for dysfluency detection in a child’s spontaneous speech [50].
D. Recognition Strategies
Recognition systems were first used on the control group data
to define the target scores for the intonation contours. To achieve
this goal, TD children’s sentences were stratified according to
1334
161
TABLE II
SOCIODEMOGRAPHIC AND CLINICAL CHARACTERISTICS OF SUBJECTS
Fig. 3. Strategies for intonation contours recognition.
the intonation in a ten-fold cross-validated fashion and the a
posteriori probabilities provided by both static and dynamic intonation classifiers were fused according to (3). LIC prosodic
abilities were then analyzed by testing the intonation contours
whereas those produced by the control group were learned by
the recognition system (Fig. 3).
The TD children’s recognition scheme was thus cross-validated with those of LIC: testing folds of each LIC group were
all processed with the ten learning folds that were used to classify the TD children’s intonation contours. Each testing fold
provided by data from the LIC was thus processed ten times.
For comparison, the relevant features set that was obtained for
TD children by the static classifier was used to classify the LIC
intonation contours. However, the optimal weights for fusion
of both static and dynamic classifiers were estimated for each
group separately, i.e., TD, AD, PDD-NOS, and SLI.
IV. RECRUITMENT AND CLINICAL EVALUATIONS OF SUBJECTS
A. Subjects
Thirty-five monolingual French-speaking subjects aged 6 to
18 years old were recruited in two university departments of
child and adolescent psychiatry located in Paris, France (Université Pierre et Marie Curie/Pitié-Salpêtière Hospital and Université René Descartes/Necker Hospital). They consulted for patients with PDD and SLI, which were diagnosed as AD, PDDNOS, or SLI according to the DSM-IV criteria [8]. Socio-demographic and clinical characteristics of the subjects are summarized in Table II.
To investigate whether prosodic skills differed from those of
matched for
TD children, a monolingual control group
chronological age (mean age
years; standard deviation
years) with a ratio of 2 TD to 1 LIC child was recruited
in elementary, secondary, and high schools. None of the TD
subjects had a history of speech, language, hearing, or general
learning problems.
AD and PDD-NOS groups were assigned from patients’
scores on the Autism Diagnostic Interview-Revised [66]
and the Child Autism Rating Scale [67]. The psychiatric
assessments and parental interviews were conducted by four
child-psychiatrists specialized in autism. Of note, all PDD-NOS
also fulfilled diagnostic criteria for Multiple Complex Developmental Disorder [68], [69], a research diagnosis used to limit
Statistics are given in the following style: Mean
;
AD: autism disorder; PDD-NOS: pervasive developmental
disorder-not otherwise specified; SLI: specific language impairment;
SD: standard deviation; ADI-R: autism diagnostic interview-revised [66];
CARS: child autism rating scale [67].
PDD-NOS heterogeneity and improve its stability overtime
[70]. SLI subjects were administered a formal diagnosis of
SLI by speech pathologists and child psychiatrists specialized
in language impairments. They all fulfilled criteria for Mixed
Phonologic–Syntactic Disorder according to Rapin and Allen’s
classification of Developmental Dysphasia [9]. This syndrome
includes poor articulation skills, ungrammatical utterances and
comprehension skills better than language production although
inadequate overall for their age. All LIC subjects received a
psychometric assessment for which they obtained Performance
Intellectual Quotient scores above 70, which meant that none
of the subjects showed mental retardation.
B. Basic Language Skills of Pathologic Subjects
To compare basic language skills between pathological
groups, all subjects were administered an oral language assessment using three tasks from the ELO Battery [71]: 1) Receptive
Vocabulary; 2) Expressive Vocabulary; and 3) Word Repetition.
ELO is dedicated to children 3–11 years old. Although many
subjects of our study were older than 11, their oral language
difficulties did not allow the use of other tests because of an
important floor-effect. Consequently, we adjusted the scoring
system and determined the severity levels. We determined for
each subject the corresponding age for each score and calculated the discrepancy between “verbal age” and “chronological
age.” The difference was converted into severity levels using
a five-level Likert-scale with 0 standing for the expected level
at that chronological age, 1 standing for a 1-year deviation
from the expected level at that chronological age, 2 for 2-years
deviation, 3 for 3-years deviation, and 4 standing for 4 or more
years of deviation.
Receptive Vocabulary: This task containing 20 items requires
word comprehension. The examiner gives the patient a picture
booklet and tells him or her: “Show me the picture in which there
is a .” The subject has to select from among four pictures the
one corresponding to the uttered word. Each correct identification gives one point, and the maximum score is 20.
Expressive Vocabulary: This task containing 50 items calls
for the naming of pictures. The examiner gives the patient a
booklet comprised of object pictures and asks him or her “What
is this?” followed by “What is he/she doing?” for the final ten
pictures, which show actions. Each correct answer gives one
162
TABLE III
BASIC LANGUAGE SKILLS OF PATHOLOGIC SUBJECTS
1335
TABLE IV
SPEECH MATERIAL FOR THE INTONATION IMITATION TASK
Statistics are given in the following style: Mean
;
AD: autism disorder; PDD-NOS: pervasive developmental disorder-not
otherwise specified; SLI: specific language impairment.
point and the maximum score for objects is 20 for children from
3 to 6, 32 for children from 6 to 8, and 50 for children over 9.
Word Repetition: This task is comprised of 2 series of 16
words and requires verbal encoding and decoding. The first series contains disyllabic words with few consonant groups. The
second contains longer words with many consonant groups,
which allows the observation of any phonological disorders.
The examiner says “Now, you are going to repeat exactly what
I say. Listen carefully, I won’t repeat.” Then, the patient repeats
the 32 words, and the maximum score is 32.
As expected given clinical performance skills in oral communication, no significant differences were found in vocabulary
tasks depending on the groups’ mean severity levels (Table III):
for the receptive task and
for the expressive
task. All three groups showed an equivalent delay of 1 to 2 years
relative to their chronological ages. The three groups were similarly impaired in the word repetition task, which requires phonological skills. The average delay was 3 years relative to their
chronological ages
.
V. DATABASE DESIGN
A. Speech Materials
Our main goal was to compare the children’s abilities to reproduce different types of intonation contours. In order to facilitate reproducibility and to avoid undue cognitive demand,
the sentences were phonetically easy and relatively short. According to French prosody, 26 sentences representing different
modalities (Table IV) and four types of intonations (Fig. 4)
were defined for the imitation task. Sentences were recorded by
means of the Wavesurfer speech analysis tool [72]. This tool was
also used to validate that the intonation contour of the sentences
matched the patterns of each intonation category (Fig. 4) The
reader will have to be careful with the English translations of
the sentences given in Table IV as they may provide different
intonation contours due to French prosodic dependencies.
B. Recording the Sentences
Children were recorded in their usual environment, i.e., the
clinic for LIC and elementary school/high school for the control
group. A middle quality microphone (Logitech USB Desktop)
plugged to a laptop running Audacity software was used for the
recordings. In order to limit the perception of the intonation
groups among the subjects, sentences were randomly played
with an order that was fixed prior to the recordings. During the
imitation task, subjects were asked to repeat exactly the sentences they had heard even if they did not catch one or several
words. If the prosodic contours of the sentences were too exaggeratedly reproduced or the children showed difficulties, then
the sentences were replayed a couple of times.
To ensure that clean speech was analyzed in this study, the
recorded data were carefully controlled. Indeed, the reproduced
sentences must as much as possible not include false-starts, repetitions, noises from the environment or speech not related to
the task. All of these perturbations were found in the recordings. As they might influence the decision taken on the sentences
when characterizing their intonation, sentences reproduced by
the children were thus manually segmented and post-processed.
Noisy sentences were only kept when they presented false-starts
or repetitions that could be suppressed without changing the
intonation contour of the sentence. All others noisy sentences
were rejected so that from a total of 2813 recorded sentences,
2772 sentences equivalent to 1 hour of speech in total were kept
for analysis (Table V).
1336
163
Fig. 4. Groups of intonation according to the prosodic contour: (a) “Descending pitch,” (b) “Falling pitch,” (c) “Floating pitch” and (d) “Rising pitch.” (a): “That’s
Rémy whom will be content.,” (b): “As I’m happy!,” (c): “Anna will come with you.,” (d): “Really?” Estimated pitch values are shown as solid lines while the
prosodic prototypes are shown as dashed lines.
TABLE V
QUANTITY OF ANALYZED SENTENCES
REF: speech material; TD: typically developing; AD: autism disorder;
PDD: pervasive developmental disorders not-otherwise specified;
SLI: specific language impairment.
TABLE VI
SENTENCE DURATION STATISTICS OF TYPICALLY DEVELOPING CHILDREN
TABLE VII
STATIC, DYNAMIC AND FUSION INTONATION RECOGNITION PERFORMANCES
FOR TYPICALLY DEVELOPING CHILDREN
Performances are given as percentage of recognition
from a stratified ten-fold cross-validation based
approach.
used as an alternative hypothesis where there is less than 5% of
chance that the data have issued from an identical population.
A. Typically Developing Children
Statistics for sentence duration (in s,) are given in the following style:
Mean
; REF: reference sentences; TD: typically
developing.
VI. RESULTS
Experiments conducted to study the children’s prosodic
abilities in the proposed intonation imitation task were divided
into two main steps. The first step was composed of a duration
analysis of the reproduced sentences by means of statistical
measures such as mean and standard deviation values. In the
second step, we used the classification approaches described
in Section III to automatically characterize the intonation. The
recognition scores of TD children are seen as targets to which
we can compare the LIC. Any significant deviation from the
mean TD children’s score will be thus considered to be relevant
to grammatical prosodic skill impairments, i.e., intonation contours imitation deficiencies. A non-parametric method was used
to make a statistical comparison between children’s groups, i.e.,
a p-value was estimated by the Kruskal–Wallis method. The
p-value corresponds to the probability that the compared data
is commonly
have issued from the same population;
Sentence Duration: Results showed that the patterns of sentence duration were conserved for all intonation groups when
. Conthe sentences were reproduced by TD children
sequently, the TD children’s imitations of the intonation contours have conserved the duration patterns of the original sentences (Table VI).
Intonation Recognition: Recognition scores on TD children’s intonation contours are given in Table VII. For comparison, we calculated the performance of a naïve classifier, which
always attributes the label of the most represented intonation,
statistics (cf.
e.g., “Descending,” to a given sentence. The
Section III-C) were computed for each intonation to evaluate
the similarity between classifiers during the classification task.
The naïve recognition rate of the four intonations studied in
this paper was 31%. The proposed system raises this to 70%,
i.e., more than twice the chance score, for 73 TD subjects aged
6 to 18. This recognition rate is equal to the average value of
scores that were obtained by other authors on the same type of
task, i.e., intonation contours recognition, but on adult speech
data and for only six speakers [60], [61]. Indeed, the age effect on the performance of speech processing systems has been
shown to be a serious disturbing factor especially when dealing
with young children [52]. Surprisingly, the static and dynamic
classifiers were similar for the “Floating” intonation even when
the dynamic recognition score was clearly higher than the static
164
1337
TABLE IX
RELEVANT PROSODIC FEATURES SET IDENTIFIED BY STATIC RECOGNITION
Fig. 5. Fusion recognition scores as function of weight alpha attributed to both
and dynamic classifier
.
static
R: raw data (i.e., static descriptor), : first-order derivate,
derivate ( , and
are both dynamic descriptor).
TABLE VIII
CONFUSION MATRIX OF THE INTONATION RECOGNITION FOR
TYPICALLY DEVELOPING CHILDREN
Tested intonations are given in rows while recognized ones
are given in
columns. Diagonal values from top-left to bottom-right thus correspond to
sentences that were correctly recognized by the system while all others are
miscategorized.
one (Table VII). However, because this intonation contains the
smallest set of sentences (cf. Table IV), a small dissimilarity
between classifiers was sufficient to improve the recognition
performance. The concept of exploiting the complementarity
of the classifiers used to characterize the intonation contours
(cf. Section III-C) was validated as some contours were better
recognized by either the static or dynamic approach. Whereas
both “Rising” and “Floating” intonations were very well recognized by the system, “Descending” and “Falling” intonations
provided the lowest recognition performances. The low recognition score of the “Falling” intonation may be explained by
the fact that this intonation was represented by sentences that
contained too many ambiguous modalities (e.g., question/order/
counseling etc.) compared with the others.
The best recognition scores provided by the fusion of the
two classifiers were principally conveyed by the static approach
rather than by the dynamic one (Fig. 5).
As the “Floating” intonation had a descending trend, it was
confused with the “Descending” and “Falling” intonations but
never with “Rising” (Table VIII). The “Rising” intonation appeared to be very specific because it was very well-recognized
and was only confused with “Falling.” Confusions with respect
to the “Falling” intonation group were numerous as shown by
the scores, and were principally conveyed by both the “Descending” and “Floating” intonations.
: second-order
The set of relevant prosodic features that was provided by
the SFS method, which was used for the static-based intonation classification (cf. Section III-A), is mostly constituted of
derivates (Table IX): 26 of the 27 relevant feaboth and
tures were issued from these measures. Features extracted from
pitch are more numerous than those from energy, which may be
due to the fact that we exclusively focused on the pitch contour
when recording the sentences (cf. Section V-A). About half of
the features set include measures issued from typical question
detection systems, i.e., values or differences between values at
onset/target/offset and relative positions of extrema in the sentence. The others are composed of traditional statistical measures of prosody (e.g., quartiles, slope, and standard deviation
values). All 27 relevant features provided by the SFS method
during static classification were statistically significant for char.
acterizing the four types of intonation contours
B. Language-Impaired Children
Sentence Duration: All intonations that were reproduced by
LIC appeared to be strongly different from those of TD children when comparing sentence duration
: the duration was lengthened by 30% for the three first intonations and
by more than 60% for the “Rising” contour (Table X). Moreover, the group composed of SLI children produced significantly
longer sentences than all other groups of children except for the
case of “Rising” intonation.
Intonation Recognition: The contributions from the two classification approaches that were used to characterize the intonation contours were similar among all pathologic groups but
; dynamic,
different from that for TD children: static,
(Fig. 6). The dynamic approach was thus found
to be more efficient than the static one for comparing the LIC’s
intonation features with those of TD children.
The statistics between the classifiers were higher for LIC
than TD children so that even after recognizing that dynamic
processing was most suitable for LIC, both the static and
1338
165
TABLE X
SENTENCE DURATION STATISTICS OF THE GROUPS
Statistics for sentence duration (in s,) are given in the following style:
;
: alternative hypothesis is
true when comparing data between child groups, i.e., T, A, P, and S;
REF: reference sentences; TD (T): typically developing; AD (A): autism
disorder; PDD (P): pervasive developmental disorders not otherwise specified;
SLI (S): specific language impairment.
TABLE XII
FUSION INTONATION RECOGNITION PERFORMANCES
Performances are given as percentage of recognition;
:
alternative hypothesis is true when comparing data from child groups,
i.e., T, A, P, and S; TD (T): typically developing; AD (A): autism disorder;
PDD (P): pervasive developmental disorders not-otherwise specified;
SLI (S): specific language impairment.
TABLE XIII
AUTISTIC DIAGNOSED CHILDREN
are given in
miscategorized.
TABLE XIV
PERVASIVE-DEVELOPMENTAL-DISORDER DIAGNOSED CHILDREN
Fig. 6. Fusion recognition scores as function of weight alpha attributed to both
static
and dynamic classifier
.
TABLE XI
Q STATISTICS BETWEEN STATIC AND DYNAMIC CLASSIFIERS
are given in
miscategorized.
TABLE XV
CONFUSION MATRIX OF THE INTONATION RECOGNITION FOR SPECIFIC
LANGUAGE IMPAIRMENT DIAGNOSED CHILDREN
dynamic intonation recognition methods had less dissimilarity
than for TD children (Table XI).
LIC recognition scores were close to those of TD children
and similar between LIC groups for the “Descending” intonation while all other intonations were significantly different
between TD children and LIC (Table XII). However, the system had very high recognition rates for the “Rising”
intonation for SLI and TD children whereas it performed signif. Although
icantly worse for both AD and PDD-NOS
some differences were found between LIC groups for this intonation, the LIC global mean scores only showed dissimilarity
with TD.
The misjudgments made by the recognition system for LIC
were approximately similar to those seen for TD children
(Tables XIII–XV). For all LIC, the “Floating” intonation was
similarly confused with “Descending” and “Falling” and was
never confused with “Rising.” However, the “Rising” intonation
are given in
miscategorized.
was rarely confused when two other intonations were tested.
This intonation appeared to be very different from the other
three but not for the TD group in which more errors were found
when the “Falling” intonation was tested.
VII. DISCUSSION
This study investigated the feasibility of using an automatic recognition system to compare prosodic abilities of LIC
(Tables II and III) to those of TD children in an intonation
166
imitation task. A set of 26 sentences, including statements and
questions (Table IV) over four intonation types (Fig. 4), was
used for the intonation imitation task. We manually collected
2772 sentences from recordings of children. Two different
approaches were then fused to characterize the intonation
contours through prosodic LLD: static (statistical measures)
and dynamic (HMM features). The system performed well
for TD children excepted in the case of the “Falling” intonation, which had a recognition rate of only 55%. This low
score may be due to the fact that too many ambiguous speech
modalities were included in the “Falling” intonation group
(e.g., question/order/counseling etc.). The static recognition
approach provided a list of 27 features that almost represented
dynamic descriptors, i.e., delta and delta-delta. This approach
was contributed more than the dynamic approach (i.e., HMM)
to the fusion.
Concerning LIC (AD, PDD-NOS, and SLI), the assessment
of basic language skills [71] showed that 1) there was no significant difference among the groups’ mean severity levels and 2)
all three groups presented a similar delay when compared to TD
children. In the intonation imitation task, the sentence duration
of all LIC subjects was significantly longer than for TD children. The sentence lengthening phenomenon added about 30%
for the first three intonations and more than 60% for the “Rising”
intonation. Therefore, all LIC subjects presented difficulties in
imitating intonation contours with respect to duration especially
for the “Rising” intonation (short questions). This result correlates with the hypothesis that rising tones may be more difficult
to produce than falling tones in children [16]. It also correlates
with the results of some clinical studies for SLI [13], [19]–[21],
AD [24]–[26], and PDD-NOS [27] children although some contradictory results were found for SLI [18].
The best approach to recognize LIC intonation was clearly
based on a dynamic characterization of prosody, i.e., using
HMM. On the contrary, the best fusion approach favored static
characterization of prosody for TD children. Although scores
of the LIC’s intonation contours recognition were similar to
those of TD children for the “Descending” sentences group,
i.e., statements in this study, these scores have not yet been
achieved in the same way. This difference showed that LIC
reproduced statement sentences similar to TD children, but
they all tended to use prosodic contour transitions rather than
statistically specific features to convey the modality.
All other tested intonations were significantly different
. LIC demonstrated
between TD children and LIC
more difficulties in the imitation of prosodic contours than
TD children except for the “Descending” intonation, i.e.,
statements in this study. However, SLI and TD children had
very high recognition rates for the “Rising” intonation whereas
both AD and PDD-NOS performed significantly worse. This
result is coherent with studies that showed PDD children have
more difficulties at imitating questions than statements [24] as
well as short and long prosodic items [25], [27]. As pragmatic
prosody was strongly conveyed by the “Rising” intonation due
to the short questions, it is not surprising that such intonation
recognition differences were found between SLI and the PDDs.
Indeed, both AD and PDD-NOS show pragmatic deficits
in communication, whereas SLI only expose pure language
1339
impairments. Moreover, Snow hypothesized [16] that rising
pitch requires more effort in physiological speech production
than falling tones and that some assumptions could be made
regarding the child’s ability or intention to match the adult’s
speech. Because the “Rising” intonation included very short
sentences (half the duration) compared with others, which
involves low working memory load, SLI children were not
disadvantaged compared to PDDs as was found in [13].
Whereas some significant differences were found in the LIC’s
groups with the “Rising” intonation, the global mean recognition scores did not show any dissimilarity between children. All
LIC subjects showed similar difficulties in the administered intonation imitation task as compared to TD children, whereas
differences between SLI and both AD and PDD-NOS only appeared on the “Rising” intonation; the latter is probably linked to
deficits in the pragmatic prosody abilities of AD and PDD-NOS.
The automatic approach used in this study to assess LIC
prosodic skills in an intonation imitation task confirms the
clinical descriptions of the subjects’ communication impairments. Consequently, it may be a useful tool to adapt prosody
remediation protocols to improve both LIC’s social communication and interaction abilities. The proposed technology could
be thus integrated into a fully automated system that would
be exploited by speech therapists. Data acquisition could be
manually acquired by the clinician while reference data, i.e.,
provided by TD children, would have already been collected
and made available to teach the prosodic models required by
the classifiers. However, because intonation contours and the
associated sentences proposed in this study are language dependent, they eventually must be adapted to intonation studies
in other languages than French.
Future research with examine the affective prosody of LIC
and TD children. Emotions were elicited during a story-telling
task with an illustrated book that contains various emotional
situations. Automatic systems will serve to characterize and
compare the elicited emotional prosodic particulars of LIC and
TD children. Investigations will focus on several questions:
1) can LIC understand depicted emotions and convey relevant
prosodic features for emotional story-telling; 2) do TD children
and LIC groups achieve similarly in the task; and 3) are there
some types of prosodic features that are preferred to convey
emotional prosody (e.g., rhythm, intonation, or voice quality)?
VIII. CONCLUSION
This study addressed the feasibility of designing a system that
automatically assesses a child’s grammatical prosodic skills,
i.e., intonation contours imitation. This task is traditionally administered by speech therapists, but we proposed the use of automatic methods to characterize the intonation. We have compared the performance of such a system on groups of children,
i.e., TD and LIC (e.g., AD, PDD-NOS, and SLI).
The records on which this study was conducted include the
information based on both perception and production of the intonation contour. The administered task was very simple because it was based on the imitation of sentences conveying different types of modality through the intonation contour. Consequently, the basic skills of the subjects in the perception and
the reproduction of prosody were analyzed together. The results
1340
167
conveyed by this study have shown that the LIC have the ability
to imitate the “Descending” intonation contours similar to TD.
Both groups got close scores given by the automatic intonation
recognition system. LIC did not yet achieve those scores as the
TD children. Indeed, a dynamic modeling of prosody has led to
superior performance on the intonation recognition of all LIC’s
groups, while a static modeling of prosody has provided a better
contribution for TD children. Moreover, the sentence duration
of all LIC subjects was significantly longer than the TD subjects
(the sentence lengthening phenomenon was about 30% for first
three intonations and more than 60% for the “Rising” intonation
that conveys pragmatic). In addition, this intonation has not led
to degradations in the performances of the SLI subjects unlike
to PDDs as they are known to have pragmatic deficiencies in
prosody.
The literature has shown that a separate analysis of the
prosodic skills of LIC in the production and the perception
of the intonation leads to contradictory results; [16]–[18]
versus [13]–[15] and [19]–[21] for SLI children, and [3] versus
[24]–[27] for the PDDs. Consequently, we used a simple technique to collect data for this study. The data collected during
the imitation task include both perception and production of the
intonation contours, and the results obtained by the automatic
analysis of the data have permitted to obtain those descriptions
that are associated with the clinical diagnosis of the LIC. As
the system proposed in this study is based on the automatic
processing of speech, its interest for the diagnosis of LIC
through prosody is thus fully justified. Moreover, this system
could be integrated into software, such as the SPECO [73],
that would be exploited by speech therapists to use prosodic
remediation protocols adapted to the subjects. It would thus
serve to improve both the LIC’s social communication and
interaction abilities.
REFERENCES
[1] S. Ananthakrishnan and S. Narayanan, “Unsupervised adaptation of
categorical prosody models for prosody labeling and speech recognition,” IEEE Trans. Audio, Speech Lang. Process., vol. 17, no. 1, pp.
138–149, Jan. 2009.
[2] P. K. Kuhl, “Early language acquisition: Cracking the speech code,”
Nature Rev. Neurosci., vol. 5, pp. 831–843, Nov. 2004.
[3] R. Paul, A. Augustyn, A. Klin, and F. R. Volkmar, “Perception and
production of prosody by speakers with autism spectrum disorders,” J.
Autism Develop. Disorders, vol. 35, no. 2, pp. 205–220, Apr. 2005.
[4] P. Warren, “Parsing and prosody: An introduction,” Lang. Cognitive
Process., Psychol. Press, vol. 11, pp. 1–16, 1996.
[5] D. Van Lancker, D. Canter, and D. Terbeek, “Disambiguation of
ditropic sentences: Acoustic and phonetic cues,” J. Speech Hear. Res.,
vol. 24, no. 3, pp. 330–335, Sep. 1981.
[6] E. Winner, The Point of Words: Children’s Understanding of Metaphor
and Irony. Cambridge, MA: Harvard Univ. Press, 1988.
[7] D. Bolinger, Intonation and Its Uses: Melody in Grammar and Discourse. Stanford, CA: Stanford Univ. Press, Aug. 1989.
[8] Diagnostic and Statistical Manual of Mental Disorders, 4th
ed. Washington, DC: American Psychiatric Assoc., 1994.
[9] I. Rapin and D. A. Allen, “Developmental language: Nosological consideration,” in Neuropsychology of Language, Reading and Spelling,
V. Kvik, Ed. New York: Academic Press, 1983.
[10] L. Wing and J. Gould, “Severe impairments of social interaction and
associated abnormalities in children: Epidemiology and classification,”
J. Autism Develop. Disorders, vol. 9, no. 1, pp. 21–29, Mar. 1979.
[11] D. A. Allen and I. Rapin, “Autistic children are also dysphasic,” in
Neurobiology of Infantile Autism, H. Naruse and E. M. Ornitz, Eds.
Amsterdam, The Netherlands: Excerpta Medica, 1992, pp. 157–168.
[12] J. McCann and S. Peppé, “Prosody in autism: A critical review,” Int. J.
Lang. Commun. Disorders, vol. 38, no. 4, pp. 325–350, May 2003.
[13] B. Wells and S. Peppé, “Intonation abilities of children with speech and
language impairments,” J. Speech, Lang. Hear. Res., vol. 46, pp. 5–20,
Feb. 2003.
[14] J. Morgan and K. Demuth, Signal to Syntax: Bootstrapping From
Speech to Grammar in Early Acquisition. Mahwah, NJ: Erlbaum,
1996.
[15] S. Weinert, “Sprach- und Gedächtnisprobleme dysphasischsprachgestörter Kinder: Sind rhytmisch-prosodische Defizite eine
Ursache?,” in [Language and Short-Term Memory Problems of Specifically Language Impaired Children: Are Rhythmic Prosodic Deficits a
Cause?] Rhytmus Ein interdisziplinäres Handbuch, K. Müller and G.
Aschersleben, Eds. Bern, Switzerland: Huber, 2000, pp. 255–283.
[16] D. Snow, “Children’s imitations of intonation contours: Are rising
tones more difficult than falling tones?,” J. Speech, Lang. Hear. Res.,
vol. 41, pp. 576–587, Jun. 1998.
[17] D. Snow, “Prosodic markers of syntactic boundaries in the speech of
4-year-old children with normal and disordered language development,” J. Speech, Lang. Hear. Res., vol. 41, pp. 1158–1170, Oct. 1998.
[18] C. R. Marshall, S. Harcourt Brown, F. Ramus, and H. J. K Van der
Lely, “The link between prosody and language skills in children with
SLI and/or dyslexia,” Int. J. Lang. Commun. Disorders, vol. 44, no. 4,
pp. 466–488, Jul. 2009.
[19] P. Hargrove and C. P. Sheran, “The use of stress by language impaired
children,” J. Commun. Disorders, vol. 22, no. 5, pp. 361–373, Oct.
1989.
[20] C. Samuelsson, C. Scocco, and U. Nettelbladt, “Towards assessment
of prosodic abilities in Swedish children with language impairment,”
Logopedics Phoniatrics Vocology, vol. 28, no. 4, pp. 156–166, Oct.
2003.
[21] S. Van der Meulen and P. Janssen, “Prosodic abilities in children with
Specific Language Impairment,” J. Commun. Disorders, vol. 30, pp.
155–170, May–Jun. 1997.
[22] L. Kanner, “Autistic disturbances of affective contact,” Nervous Child,
vol. 2, pp. 217–250, 1943.
[23] R. Paul, L. Shriberg, J. Mc Sweeny, D. Ciccheti, A. Klin, and F.
Volkmar, “Brief report: Relations between prosodic performance and
communication and socialization ratings in high functioning speakers
with autism spectrum disorders,” J. Autism Develop. Disorders, vol.
35, no. 6, pp. 861–869, Dec. 2005.
[24] S. Fosnot and S. Jun, “Prosodic characteristics in children with
stuttering or autism during reading and imitation,” in Proc. 14th
Annu. Congr. Phonetic Sci., San Francisco, CA., Aug. 1–7, 1999, pp.
103–115.
[25] J. McCann, S. Peppé, F. Gibbon, A. O’Hare, and M. Rutherford,
“Prosody and its relationship to language in school-aged children with
high functioning autism,” Int. J. Lang. Commun. Disorders, vol. 47,
no. 6, pp. 682–702, Nov. 2007.
[26] M. T. Le Normand, S. Boushaba, and A. Lacheret-Dujour, “Prosodic
disturbances in autistic children speaking French,” in Proc. Speech
Prosody, Campinas, Brazil, May 6–9, 2008, pp. 195–198.
[27] R. Paul, N. Bianchi, A. Agustyn, A. Klin, and F. Volkmar, “Production
of syllable stress in speakers with autism spectrum disorders,” Research
in Autism Spectrum Disorders, vol. 2, pp. 110–124, Jan.–Mar. 2008.
[28] F. Volkmar, Handbook of Autism and Pervasive Develop. Disorders.
Hoboken, NJ: Wiley, 2005.
[29] E. Fombonne, “Epidemiological surveys of autism and other pervasive
developmental disorders: An update,” J. Autism Develop. Disorders,
vol. 33, no. 4, Aug. 2003.
[30] L. D. Schriberg, J. Kwiatkowski, and C. Rasmussen, The ProsodyVoice Screening Profile. Tuscon, AZ: Communication Skill Builders,
1990.
[31] D. Crystal, Profiling Linguist. Disability. London, U.K.: Edward
Arnold, 1982.
[32] P. Martínez-Castilla and S. Peppé, “Developing a test of prosodic
ability for speakers of Iberian-Spanish,” Speech Commun., vol. 50, no.
11–12, pp. 900–915, Mar. 2008.
[33] J. P. H. van Santen, E. T. Prud’hommeaux, and L. M. Black, “Automated assessment of prosody production,” Speech Commun., vol. 51,
no. 11, pp. 1082–1097, Nov. 2009.
[34] A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, M.
Schuster, and E. Nöth, “PEAKS—A system for the automatic evaluation of voice and speech disorder,” Speech Commun., vol. 51, no. 5, pp.
425–437, May 2009.
[35] M. Black, J. Tepperman, A. Kazemzadeh, S. Lee, and S. Narayanan,
“Automatic pronunciation verification of English letter-names for early
literacy assessment of preliterate children,” in Proc. ICASSP, Taipei,
Taiwan, Apr. 19–24, 2009, pp. 4861–4864.
168
[36] C. Min Lee and S. Narayanan, “Toward detecting emotions in spoken
dialogs,” IEEE Trans. Speech Audio Process., vol. 13, no. 2, pp.
293–303, Mar. 2005.
[37] G. P. M. Laan, “The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and
read speaking style,” Speech Commun., vol. 22, pp. 43–65, Mar.
1997.
[38] A. Potamianos and S. Narayanan, “A review of the acoustic and linguistic properties of children’s speech,” in Proc. IEEE 9th Workshop
Multimedia Signal Process., Chania, Greece, Oct. 23, 2007, pp. 22–25.
[39] R. D. Kent, “Hearing and believing: Some limits to theauditory-perceptual assessment of speech and voice disorders,” Amer. J. Speech-Lang.
Pathol., vol. 5, no. 3, pp. 7–23, Aug. 1996.
[40] A. Tversky, “Intransitivity of preferences,” Psychol. Rev., vol. 76, pp.
31–48, Jan. 1969.
[41] A. Pentland, “Social signal processing,” IEEE Signal Process. Mag.,
vol. 24, no. 4, pp. 108–111, Jul. 2007.
[42] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L.
Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “The
relevance of feature type for the automatic classification of emotional
user states: Low level descriptors and functionals,” in Proc. Interspeech
ICSLP, Antwerp, Belgium, Aug. 27–31, 2007, pp. 2253–2256.
[43] J. Nadel, “Imitation and imitation recognition: Functional use in preverbal infants and nonverbal children with autism,” in The Imitative
Mind: Development, Evolution and Brain Bases, A. N. Meltzoff and
W. Prinz, Eds. Cambridge, MA: Cambridge Univ. Press, 2002, pp.
2–14.
[44] G. Szaszák, D. Sztahó, and K. Vicsi, “Automatic intonation classification for speech training systems,” in Proc. Interspeech, Brighton, U.K.,
Sep. 6–10, 2009, pp. 1899–1902.
[45] D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features and methods,” Speech Commun., vol. 48, no. 9, pp.
1162–1181, Sep. 2006.
[46] A. G. Adami, “Modeling prosodic differences for speaker recognition,”
Speech Commun., vol. 49, no. 4, pp. 1162–1181, Apr. 2007.
[47] D. H. Milone and A. J. Rubio, “Prosodic and accentual information
for automatic speech recognition,” IEEE Trans. Speech Audio Process.,
vol. 11, no. 4, pp. 321–333, Jul. 2003.
[48] A. Mahdhaoui, M. Chetouani, C. Zong, R. S. Cassel, C. Saint-Georges,
M.-C. Laznik, S. Maestro, F. Apicella, F. Muratori, and D. Cohen, “Automatic motherese detection for face-to-face interaction analysis,” Multimodal Signals: Cognitive and Algorithmic Issues, vol. LNAI 5398,
pp. 248–255, Feb. 2009, Springer-Verlag.
[49] V.-M. Quang, L. Besacier, and E. Castelli, “Automatic question
detection: Prosodic-lexical features and crosslingual experiments,” in
Proc. Interspeech ICSLP, Antwerp, Belgium, Aug. 27–31, 2007, pp.
2257–2260.
[50] S. Yildirim and S. Narayanan, “Automatic detection of disfluency
boundaries in spontaneous speech of children using audio-visual
information,” IEEE Trans. Audio Speech Lang. Process., vol. 17, no.
1, pp. 2–12, Jan. 2009.
[51] H. Pon-Barry and S. Shieber, “The importance of sub-utterance
prosody in predicting level of certainty,” in Proc. Human Lang. Tech.
Conf., Poznan, Poland, May 31–Jun. 5 2009, pp. 105–108.
[52] D. Elenius and M. Blomberg, “Comparing speech recognition for
adults and children,” in Proc. FONETIK, Stockholm, Sweden, May
26–28, 2004, pp. 105–108.
[53] J.-F. Bonastre, C. Fredouille, A. Ghio, A. Giovanni, G. Pouchoulin,
J. Révis, B. Teston, and P. Yu, “Complementary approaches for voice
disorder assessment,” in Proc. Interspeech ICSLP, Antwerp, Belgium,
Aug. 27–31, 2007, pp. 1194–1197.
[54] M. Chetouani, A. Mahdhaoui, and F. Ringeval, “Time-scalefeature extractions for emotional speech characterization,” Cognitive Comp., vol.
1, no. 2, pp. 194–201, 2009, Springer.
[55] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms. Hoboken, NJ: Wiley, 2004.
[56] E. Monte-Moreno, M. Chetouani, M. Faundez-Zanuy, and J.
Sole-Casals, “Maximum likelihood linear programming data fusion for speaker recognition,” Speech Commun., vol. 51, no. 9, pp.
820–830, Sep. 2009.
[57] F. Ringeval and M. Chetouani, “A vowel based approach for acted
emotion recognition,” in Proc. Interspeech, Brisbane, Australia, Sep.
22–26, 2008, pp. 2763–2766.
[58] A. Mahdhaoui, F. Ringeval, and M. Chetounani, “Emotional speech
characterization based on multi-features fusion for face-to-face communication,” in Proc. Int. Conf. SCS, Jerba, Tunisia, Nov. 6–8,
2009.
1341
[59] A. Mahdhaoui, M. Chetouani, and C. Zong, “Motherese detection
based on segmental and supra-segmental features,” in Proc. Int. Conf.
Pattern Recogn., Tampa, FL., Dec. 8–11, 2008.
[60] S. Ananthakrishnan and S. Narayanan, “Fine-grained pitch accent and
boundary tones labeling with parametric f0 features,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process., Las Vegas, NV, Mar. 30–Apr.
4 2008, pp. 4545–4548.
[61] A. Rosenberg and J. Hirschberg, “Detecting pitch accents at the word,
syllable and vowel level,” in Proc. Human Lang. Tech.: 2009 Annu.
Conf. North Amer. Chapter Assoc. for Comput. Ling., Boulder, CO,
May 31–Jun. 5 2009, pp. 81–84.
[62] Snack Sound Toolkit [Online]. Available: http://www.speech.kth.se/
snack/
[63] R.-O. Duda, P.-E. Hart, and D.-G. Stork, Pattern Classification, 2nd
ed. New York: Wiley, 2000.
[64] M. Robnik and I. Konenko, “Theoretical and empirical analysis of ReliefF and RReliefF,” Mach. Learn. J., vol. 53, pp. 23–69, Oct.–Nov.
2003.
[65] L. Kuncheva and C. Whitaker, “Measure of diversity in classifier ensembles,” Mach. Learn., vol. 51, no. 2, pp. 181–207, May 2003.
[66] C. Lord, M. Rutter, and A. Le Couteur, “Autism diagnostic interviewrevised: A revision version of a diagnostic interview for caregivers of
individuals with possible pervasive developmental disorders,” J. Autism
Develop. Disorders, vol. 24, no. 5, pp. 659–685, 1994.
[67] E. Schopler, R. Reichler, R. Devellis, and K. Daly, “Toward objective classification of childhood autism: Childhood Autism Rating Scale
(CARS),” J. Autism Develop. Disorders, vol. 10, no. 1, pp. 91–103,
1980.
[68] R. Van der Gaag, J. Buitelaar, E. Van den Ban, M. Bezemer, L. Njio,
and H. Van Engeland, “A controlled multivariate chart review of multiple complex developmental disorder,” J. Amer. Acad. Child Adolesc.
Psychiatry, vol. 34, pp. 1096–1106, 1995.
[69] J. Buitelaar and R. Van der Gaag, “Diagnostic rules for children with
PDD-NOS and multiple complex developmental disorder,” J. Child
Psychol. Psychiatry, vol. 39, pp. 91–919, 1998.
[70] E. Rondeau, L. Klein, A. Masse, N. Bodeau, D. Cohen, and J. M. Guilé,
“Is pervasive developmental disorder not otherwise specified less stable
than autistic disorder?,” J. Autism Develop. Disorder, 2010, to be published.
[71] A. Khomsi, Evaluation du Langage Oral. Paris, France: ECPA, 2001.
[72] K. Sjölander and J. Beskow, “WaveSurfer—An open source speech
tool,” in Proc. 6th ICSLP, Beijing, China, Oct. 2000, vol. 4, pp.
464–467 [Online]. Available: http://www.speech.kth.se/wavesurfer/
[73] K. Vicsi, A Multimedia Multilingual Teaching and Training System
for Speech Handicapped Children Univ. of Technol. and Economics, Dept. of Telecommunications and Telematics, Final Annual
Report, Speech Corrector, SPECO-977126 [Online]. Available:
http://alpha.tmit.bme.hu/speech/speco/index.html, 09.1998–08.2001
Fabien Ringeval received the B.S. degree in
electrics, electronic and informatics engineering
from the National Technologic Institute (IUT) of
Chartres, Chartres, France, in 2003, and the M.S.
degree in speech and image signal processing from
the University Pierre and Marie Curie (UPMC),
Paris, France, in 2006.
He has been with the Institute of Intelligent
Systems and Robotics, UPMC, since 2006. He
is currently a Teaching and Research Assistant
with this institute. His research interests concern
automatic speech processing, i.e., the automatic characterization of both the
verbal (e.g., intonation recognition) and the nonverbal communication (e.g.,
emotion recognition). He is a member of the French Association of Spoken
Communication (AFCP), of the International Speech Communication Association (ISCA) and of the Workgroup on Information, Signal, Image and Vision
(GDR-ISIS).
Julie Demouy received the degree of Speech and Language Therapist from the
School of Medicine of Paris, University Pierre and Marie Curie (UPMC), Paris,
France, in 2009.
She is currently with the University Department of Child and Adolescent Psychiatry at La Pitié Salpêtrière Hospital, Paris.
1342
169
György Szaszák received the M.S. degree in electrical engineering from the Budapest University
for Technology and Economics (BUTE), Budapest,
Hungary, 2002 and the Ph.D. degree from Laboratory
of Speech Acoustics, Department of Telecommunications and Media Informatics, BUTE in 2009.
His Ph.D. dissertation addresses the exploitation of
prosody in speech recognition systems with a focus
on the agglutinating languages.
He has been with the Laboratory of Speech
Acoustics, Department of Telecommunications and
Media Informatics, BUTE, since 2002. His main research topics are related
to speech recognition, prosody and databases, and both the verbal and the
nonverbal communication.
Dr. Szaszák is a member of the International Speech Communication Association (ISCA).
Mohamed Chetouani received the M.S. degree in
robotics and intelligent systems from the University
Pierre and Marie Curie (UPMC), Paris, France, 2001
and the Ph.D. degree in speech signal processing
from UPMC in 2004.
In 2005, he was an invited Visiting Research
Fellow at the Department of Computer Science and
Mathematics, University of Stirling, Stirling, U.K.
He was also an invited Researcher at the Signal
Processing Group, Escola Universitaria Politecnica
de Mataro, Barcelona, Spain. He is currently an
Associate Professor in Signal Processing and Pattern Recognition at the
UPMC. His research activities cover the areas of nonlinear speech processing,
feature extraction, and pattern classification for speech, speaker, and language
recognition.
Dr. Chetouani is a member of different scientific societies (e.g., ISCA, AFCP,
ISIS). He has also served as chairman, reviewer, and member of scientific committees of several journals, conferences, and workshops.
Laurence Robel received the M.D. and Ph.D. degrees in both molecular neuropharmacology and developmental biology from the University Pierre and
Marie Curie (UPMC), Paris, France.
She is currently coordinating the autism and
learning disorders clinics for young children in the
Department of Child and Adolescent Psychiatry,
Hôpital Necker-Enfants Malades, Paris, France, as a
Child Psychiatrist.
Jean Xavier received the Ph.D. degree in psychology
from the University Paris Diderot, Paris, France, in
2008.
He is specialized in child and adolescent psychiatry and was certified in 2000. He is an M.D. in the
Hôpital de la Pitié-Salpêtrière, Paris, France, and is
head of an outpatient child unit dedicated to PDD
including autism. He also works in the field of
learning disabilities.
Dr. Xavier is a member of the French Society of Child and Adolescent
Psychiatry.
David Cohen received the M.S. degree in neurosciences from the University Pierre and Marie Curie
(UPMC), Paris, France, and the Ecole Normale
Supérieure, Paris, in 1987, and the M.D. degree from
the Hôpital Necker-Enfants Malades, Paris, France,
in 1992.
He specialized in child and adolescent psychiatry
and was certified in 1993. His first field of research
was severe mood disorders in adolescent, topic of
his Ph.D. degree in neurosciences (2002). He is
Professor at the UPMC and head of the Department
of Child and Adolescent Psychiatry, La Salpêtrière hospital, Paris. His group
runs research programs in the field of autism and other pervasive developmental disorders, severe mood disorder in adolescents, and childhood onset
schizophrenia and catatonia.
Dr. Cohen is a member of the International Association of Child and Adolescent Psychiatry and Allied Disciplines, the European College of Neuro-Psychopharmacology, the European Society of Child and Adolescent Psychiatry,
and the International Society of Adolescent Psychiatry.
Monique Plaza received the Ph.D. degree in psychology from the University Paris Ouest Nanterre La
Défence, Nanterre, France, in 1984.
She is a Researcher in the National Center for
Scientific Research (CNRS), Paris, France. She develops research topics about intermodal processing
during the life span, and in developmental, neurological, and psychiatric pathologies. In childhood,
she studies specific (oral and written) language
difficulties, PDD, and PDD-NOS. In adulthood, she
works with patients suffering from Grade II gliomas
(benign cerebral tumors), which the slow development allows the brain to
compensate for the dysfunction generated by the tumor infiltration. Working
in an interdisciplinary frame, she is specifically interested in brain models
emphasizing plasticity and connectivity mechanisms and thus participates
in studies using fMRI and cerebral stimulation during awake surgery. She
develops psychological models emphasizing the interactions between cognitive
functions and the interfacing between emotion and cognition. As a clinical
researcher, she is interested in the practical applications of theoretical studies
(diagnosis and remediation).
170
Speech Communication
Travaux réalisés dans le cadre de la thèse d'Ammar Mahdhaoui.
171
Available online at www.sciencedirect.com
Speech Communication 53 (2011) 1149–1161
www.elsevier.com/locate/specom
Supervised and semi-supervised infant-directed speech classification
for parent-infant interaction analysis
Ammar Mahdhaoui ⇑, Mohamed Chetouani
Univ Paris 06, F-75005, Paris, France CNRS, UMR 7222, ISIR, Institut des Systèmes Intelligents et de Robotique, F-75005, Paris, France
Available online 14 May 2011
Abstract
This paper describes the development of an infant-directed speech discrimination system for parent–infant interaction analysis. Different feature sets for emotion recognition were investigated using two classification techniques: supervised and semi-supervised. The
classification experiments were carried out with short pre-segmented adult-directed speech and infant-directed speech segments extracted
from real-life family home movies (with durations typically between 0.5 s and 4 s). The experimental results show that in the case of
supervised learning, spectral features play a major role in the infant-directed speech discrimination. However, a major difficulty of using
natural corpora is that the annotation process is time-consuming, and the expression of emotion is much more complex than in acted
speech. Furthermore, interlabeler agreement and annotation label confidences are important issues to address. To overcome these problems, we propose a new semi-supervised approach based on the standard co-training algorithm exploiting labelled and unlabelled data. It
offers a framework to take advantage of supervised classifiers trained by different features. The proposed dynamic weighted co-training
approach combines various features and classifiers usually used in emotion recognition in order to learn from different views. Our experiments demonstrate the validity and effectiveness of this method for a real-life corpus such as home movies.
Ó 2011 Elsevier B.V. All rights reserved.
Keywords: Infant-directed speech; Emotion recognition; Face-to-face interaction; Data fusion; Semi-supervised learning
1. Introduction
Parent–infant interactions play a major role in the development of the cognitive, perceptual and motor skills of
infants, and this role is emphasised for developmental disorders. Typically developing infants gaze at people, turn
toward voices and express interest in communication. In
contrast, infants who will later become autistic are characterised by the presence of abnormalities in reciprocal social
interactions and by a restricted, stereotyped and repetitive
repertoire of behaviours, interests and activities (autism
pathology is defined by ICD 10: International classification
of diseases and related health problems1 and DSM IV:
⇑ Corresponding author. Tel.: +33 6 70 20 12 92; fax: +33 1 44 27 44 38.
E-mail addresses: [email protected] (A. Mahdhaoui),
[email protected] (M. Chetouani).
URL: http://people.isir.upmc.fr/mahdhaoui (A. Mahdhaoui).
1
http://www.who.int/classifications/icd/en/.
0167-6393/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.specom.2011.05.005
Diagnostic and statistical manual of mental disorders2)
(Association, 1994). The quality of parent-infant interaction depends on a reciprocal process, an active dialogue
between parent and child based on the infant’s early competencies and the mother’s (or father’s) stimulations. In
addition, the infant’s development depends on social interaction with a caregiver who serves the infant’s needs for
emotional attachment.
Researchers in language acquisition and researchers in
early social interactions have identified an important peculiarity that affects both the language and social development of infants; i.e., the way adults speak to infants. The
special kind of speech that is directed towards infants,
called infant-directed speech or “motherese” is a simplified
language/dialect/register (Fernald and Kuhl, 1987) that has
recently been shown to be crucial for engaging interactions
2
http://www.psych.org/mainmenu/research/dsmiv.aspx.
172
1150
A. Mahdhaoui, M. Chetouani / Speech Communication 53 (2011) 1149–1161
between parents and infant and very important for language acquisition (Kuhl, 2004). Moreover, this speech register has been shown to be preferred by infants over adultdirected speech (Cooper and Aslin, 1990) and might assist
infants in learning speech sounds (Fernald, 1985). From an
acoustic point of view, infant-directed speech has a clear
signature (high pitch, exaggerated intonation contours)
(Fernald, 1985; Grieser and Kuhl, 1988). The phonemes,
and especially the vowels, are more clearly articulated
(Burnham et al., 2002).
The importance of infant-directed speech has also been
highlighted by recent research on autism (Muratori and
Maestro, 2007; Mahdhaoui et al., 2011; Mahdhaoui
et al., 2009). Manual investigations (i.e., manual annotations) (Laznik et al., 2005), of parent-infant interactions
in home movies have shown that most positive sequences
(i.e., multimodal responses of the infant: vocalisation, gaze,
facial expression) were induced by infant-directed speech.
To study more specifically the influence on engagement in
an ecological environment, we followed a method usually
employed for the study of infant development: home movie
analysis (C.Saint-Georges et al., 2010).
The study of home movies is very important for future
research, but the use of this kind of database makes the
work very difficult and time-consuming. The manual annotation of these films is very costly, and the automatic detection of relevant events would be of great benefit to
longitudinal studies. For the analysis of the role of
infant-directed speech during interaction, we developed
an automatic infant-directed speech detection system
(Mahdhaoui et al., 2009; Mahdhaoui et al., 2008;
Chetouani et al., 2009), to enable emotion classification.
Motherese or infant-directed speech has been highly
studied by psychological community. However, in our
knowledge there are no studies of infant-directed speech,
in real-life interaction, employing machine learning techniques. In the literature, researchers in affective computing
and in emotion recognition have studied infant-directed
speech from acted databases (Slaney and McRoberts,
2003); the speech samples were recorded in laboratory.
Recently, Inouea et al. (2011) have developed a novel
approach to discriminate between infant-directed speech
and adult-directed speech by using mel-frequency cepstrum
coefficient and a hidden Markov model-based speech discrimination algorithm. The average discrimination accuracy of the proposed algorithm is 84.34%, but still in
laboratory conditions (acted data). Paralinguistic characteristics of motherese motivate several researchers to
employ recognition systems intially developped for emotion processing (Slaney and McRoberts, 2003; Shami and
Verhelst, 2007).
In this paper, we implemented a traditional supervised
method. We tested different machine learning techniques,
both statistical and parametric, with different feature
extraction methods (time/frequency domains). The GMM
classifier with cepstral MFCC (Mel-frequency cepstral coding) features was found to be most efficient.
However, the supervised methods still have some significant limitations. Large amounts of labelled data are usually required, which is difficult in real-life applications;
manual annotation of data are very costly and time consuming. Therefore, we investigate a semi-supervised
approach that does not require a large amount of annotated data for training. This method combines labelled
and unlabelled utterances to learn to discriminate between
infant-directed speech and adult-directed speech.
In the area of classification, many semi-supervised learning algorithms have been proposed, one of which is the cotraining approach (Blum and Mitchell, 1998). Most applications of co-training algorithm have been devoted to text
classification (Nigam et al., 2000; Zhu et al., 2003) and web
page categorisation (Blum and Mitchell, 1998; Zhou et al.,
2005). However, there are a few studies related to semisupervised learning for emotional speech recognition. The
co-training algorithm proposed by Blum and Mitchell
(1998) is a prominent achievement in semi-supervised
learning. It initially defines two classifiers on distinct attribute views of a small set of labelled data. Either of the
views is required to be conditionally independent to the
other and sufficient for learning a classification system.
Then, iteratively the predictions of each classifier on unlabelled examples are selected to increase the training data
set. This co-training algorithm and its variations (Goldman
and Zhou, 2000) have been applied in many areas because
of their theoretical justifications and experimental success.
In this study, we propose a semi-supervised algorithm
based on multi-view characterisation, which combines the
classification results of different views to obtain a single
estimate for each observation. The proposed algorithm is
a novel form of co-training, which is more suitable for
problems involving both classification and data fusion.
Algorithmically, the proposed co-training algorithm is
quite similar to other co-training methods available in the
literature. However, a number of novel improvements,
using different feature sets and dynamic weighting classifier
fusion, have been incorporated to make the proposed algorithm more suitable for multi-view classification problems.
The paper is organised as follows. Section 2 presents the
longitudinal speech corpus. Section 3 presents the different
feature extraction methods. Sections 4 and 5 present the
supervised and the semi-supervised methods. Section 6 presents the details of the proposed method of semi-supervised
classification of emotional speech with multi-view features.
Section 7 reports experimental comparisons of supervised
and semi-supervised methods on a discrimination task. In
the last section, some concluding remarks and the direction
for future works are presented.
2. Home movie: speech corpus
The speech corpus used in our study contains real parent/child interactions and consists of recordings of Italian
mothers as they addressed their infants. It is a collection
of natural and spontaneous interactions. This corpus
173
contains expressions of non-linguistic communication
(affective intent) conveyed by a parent to a preverbal child.
We decided to focus on the analysis of home movies
(real-life data) as it enables longitudinal study (months or
years) and gives information about the early behaviours
of autistic infants long before the diagnosis was made by
clinicians. However, this large corpus makes it inconvenient for people to review. Additionally, the recordings
were not made by professionals (they were made by parents), resulting in adverse conditions (noise and camera
and microphones limitations, etc). In addition, the recordings were made randomly in diverse conditions and situations (interaction situation, dinner, birthday, bath, etc.),
and only parents and other family members (e.g., grandparent, uncle) are present during the recordings.
All sequences were extracted from the Pisa home movies
database, which includes home movies from the first 18
months of life for three groups of children (typically developing, autistic, mentally retarded) (Maestro et al., 2005).
The home movies were recorded by the parents themselves. Each family uses his personal camera with only
one microphone. Due to the naturalness of home movies
(uncontrolled conditions: TV, many speakers, etc.), we
manually selected a set of videos with at least understandable audio data. The verbal interactions of the infant’s
mother were carefully annotated by two psycholinguists,
independently, into two categories: infant-directed speech
and adult-directed speech. To estimate the agreement
between the two annotators, we computed the Cohen’s
kappa (Cohen, 1960) as a measure of the intercoder agreement. Cohen’s kappa agreement is given by the following
equation:
kappa ¼
pðaÞ pðeÞ
;
1 pðeÞ
ð1Þ
where p(a) is the observed probability of agreement
between two annotators, and p(e) is the theoretical probability of chance agreement, using the annotated sample of
data to calculate the probabilities of each annotator. We
found a Cohen’s kappa equal to 0.82 (CI for Confidence
Interval: [95%CI: 0.75–0.90]), measured on 500 samples,
which corresponds to good agreement between the two
annotators.
From this manual annotation, we randomly extracted
250 utterances for each category. The utterances are typically between 0.5 s and 4 s in length. Fig. 1 shows a distribution of infant-directed speech and adult-directed speech
utterances from 3 periods of the child’s life (0–6 months,
6–12 months and 12–18 months). The total duration of
utterances is about 15 min. Fig. 2 shows the duration distribution of infant-directed speech and adult-directed speech
utterances. It shows that there is no significant difference
between the durations of infant-directed speech and
adult-directed speech utterances.
We randomly divided the database into two parts: unlabelled data U (400 utterances balanced between motherese
and adult-directed speech) and labelled data L
1151
Fig. 1. Distribution of infant-directed speech and adult-directed speech
utterances during 3 periods of infant development.
(100 utterances balanced between motherese and adult-directed speech).
3. Emotional speech characterisation
Feature extraction is an important stage in emotion recognition, and it has been shown that emotional speech can
be characterised by a large number of features (acoustics,
voice quality, prosodic, phonetic, lexical) (Schuller et al.,
2007). However, researchers on speech characterisation
and feature extraction show that is difficult to have a consensus for emotional speech characterisation.
In this study, we computed temporal and frequential
features, which are usually investigated in emotion recognition (Truong and van Leeuwen, 2007; Shami and Verhelst,
2007). Moreover, different statistics are applied, resulting in
16 cepstral (f1), 70 prosodic (f2, f3, f4 and f5) and 96 perceptive features (f6, f7, f8 and f9), all of which have been
shown to be the most efficient (Truong and van Leeuwen,
2007; Kessous et al., 2007; Mahdhaoui et al., 2008). We
obtained 9 different feature vectors with different dimensions, which are presented in Table 1.
3.1. Cepstral features
Cepstral features such as MFCC are often successfully
used in speech and emotion recognition. The short-term
cepstral signatures of both infant-directed speech and
adult-directed speech are characterised by 16 MFCC features (often used for emotion recognition) and are
extracted each 20 ms, so the number of the resulting feature
vectors is variable and depends on the length of the utterance (Frame-level).
3.2. Prosodic features
Several studies have shown the relevance of both the
fundamental frequency (F0) and energy features for emotion recognition applications (Truong and van Leeuwen,
2007). F0 and energy were estimated every 20 ms (Boersma
174
1152
Fig. 2. Duration distribution of infant-directed speech and adult-directed speech utterances.
Table 1
Different features sets.
f1
f2
f3
f4
f5
f6
f7
f8
f9
16 MFCCs
Pitch(Min, Max, Range) + Energy(Min, Max, Range)
35 statistics on the pitch
35 statistics on the energy
35 statistics on the pitch + 35 statistics on the energy
Bark TL + SL + MV(96 statistics)
Bark TL (32 statistics)
Bark SL (32 statistics)
Bark MV (32 statistics)
Table 2
32 statistics.
Maximum, minimum and mean value
Standard deviation
Variance
Skewness
Kurtosis
Interquartile range
Mean absolute deviation (MAD)
MAD based on medians,i.e. MEDIAN(ABS(X-MEDIAN(X)))
First and second coeficients of linear regression,
First, second and third coefficients of quadratic regression
9 quantiles corresponding to the following cumulative probability
values: 0.025, 0.125,
0.25, 0.375, 0.50, 0.625, 0.75, 0.875, 0.975
Quantile for cumulative probability values 1% and 9% and
interquantile range between
this two values
Absolute and sign of time interval between maximum and minimum
appearances
and Weenink, 2005), and we computed 3 statistics for each
voiced segment (segment-based method) (Shami and
Kamel, 2005): the mean, variance and range, for both F0
and short-time energy, resulting in a 6-dimensional vector.
In addition, 32 statistical features, presented in Table 2,
are extracted from the pitch contour and the loudness contour. Three other features are also extracted from these
contours with a histogram and by considering the maximum, the bin index of the maximum and the centre value
of the corresponding bin. These 3 features are relevant
for pitch and energy contour characterisation.
3.3. Perceptive features
Infant-directed speech and adult-directed speech sound
perceptually different, (Cooper et al., 1997), and in this
work bark filters spectral representation are employed to
investigate these perceptual differences.
The features based on the bark scale are considered to
provide more information by characterising the human
auditory system (Zwicker, 1961; Esposito and Marinaro,
2005). We extracted the bark time/frequency representation using an analysis window duration of 15 ms and a time
step of 5 ms with filters equally spaced by 1 bark (first filter
centred on first bark critical band) (Zwicker and Fastl,
1999). This computation on the full spectrum results in
29 filter bands. This representation can be described as a
discrete perceptive representation of the energy spectrum,
which can be qualified as a perceptive spectrogram. We
then extracted statistical features from this representation
either along the time axis or along the frequency axis, as
shown in Fig. 3. We also considered the average of energy
of the bands (a perceptive Long Term Average Spectrum)
and extracted statistical features from it. Thirty-two statistical features were used and applied (a) along the time axis
(Approach TL), (b) along the frequency axes (Approach
SL) and (c) on the average perceptive spectrum to obtain
a first set of 32 features (Approach MV).
a) Approach TL (for ‘Time Line’) Fig. 3. a.: (step 1)
extracting 32 features on the spectral vector of each
time frame, then (step 2) averaging the values for
each of 32 features along the time axis to obtain a second set of 32 features.
b) Approach SL (‘for Spectral Line’) Fig. 3. b.: (step 1)
extracting 32 features along the time axis for each
spectral band and (step 2) averaging the 32 features
along the frequency axis to obtain a third set of 32
features.
c) Approach MV (for ‘Mean Values’): (step 1) averaging
the energy values of the bark spectral bands along the
time axis to obtain a long term average spectrum
using 29 bark bands and (step 2) extracting the 32 statistical features from this average spectrum.
175
(a)
1153
(b)
Fig. 3. Method for extraction of bark-based features: along time axis (a) and along frequency axis (b).
The 32 statistical features, presented in Table 2, were
computed to model the dynamic variations of the bark
spectral perceptive representation.
4. Supervised classification
The supervised classification assumes that there is
already an existing categorisation of the data. In this classification form, the training data D are presented by an
ensemble X of feature vectors and their corresponding
labels Y:
n
D ¼ fðxi ; y i Þjx 2 X ; y 2 Y gi¼1 :
ð2Þ
4.1. Gaussian mixture models
A Gaussian mixture model is a statistics based model for
modelling a statistical distribution of Gaussian Probability
Density Function (PDF). A Gaussian mixture density is a
weighted sum of M component densities (Reynolds, 1995)
given by:
M
X
xi gðli ;Ri Þ ðxÞ;
gðl;RÞ ðxÞ ¼
ð2pÞ
d=2
T 1
1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi e1=2ðxlÞ R ððxlÞÞ
detðRÞ
ð4Þ
with mean vector li and covariance matrix Ri. The mixture
weights xi satisfy the following constraint:
M
X
xi ¼ 1:
ð5Þ
i¼1
Supervised classification consists of two steps: feature
extraction and pattern classification. The features extraction step consists of characterising the data. After the
extraction of features, supervised classification is used to
categorise the data into classes corresponding to userdefined training classes. This can be implemented using
standard machine learning methods. In this study, four different classifiers, Gaussian mixture models (GMM)
(Reynolds, 1995), k-nearest neighbour (k-NN) (Duda
et al., 2000) classifiers), SVM (Chang and Lin, 2001; Vapnik, 1995) and Neural networks (MLP) (Eibe and Witten,
1999), were investigated.
In our work, all the classifiers were adapted to provide a
posterior probability to maintain a statistical classification
framework.
pðxjC m Þ ¼
we define C1 as the “infant-directed speech” class and C2
as “adult-directed speech”. The vector x is a d-dimensional
vector, g(l,R)(x) are the component densities, and xi are the
mixture weights. Each component density is a d-variate
Gaussian function:
ð3Þ
The feature vector x is then modelled by the following posterior probability:
P gmm ðC m jxÞ ¼
pðxjC m ÞP ðC m Þ
;
pðxÞ
ð6Þ
where P(Cm) is the prior probability for class Cm, assuming
equal prior probabilities, and p(x) is the overall PDF evaluated at x.
4.2. k-nearest neighbours
The k-NN classifier (Duda et al., 2000) is a non-parametric technique that classifies the input vector with the
label of the majority of the k-nearest neighbours (prototypes). To maintain a common framework with the statistical classifiers, we estimate the posterior probability that a
given feature vector x belongs to class Cm using k-NN estimation (Duda et al., 2000):
P knn ðC m jxÞ ¼
km
;
k
ð7Þ
where km denotes the number of prototypes that belong to
the class Cm among the k nearest neighbours.
4.3. Support vector machines
i¼1
where p(xjCm) is the probability density function of class
Cm evaluated at x. Due to the binary classification task,
The support vector machine (SVM) is the optimal margin linear discriminant trained from a sample of l
176
1154
independent and identically distributed instances:
(x1, y1), . . ., (xl, yl), where xi is the d-dimensional input
and yi 2 {1, +1} its label in a two-class problem is
yi = +1 if is a positive (+) example, and yi = 1 if xi is a
negative example.
The basic idea behind SVM is to solve the following
model:
l
X
1
min kxk2 x þ C
ni
2
i¼1
ð8Þ
8i; y i ðxxi þ bÞ P 1 ni
ð9Þ
which is a C-soft margin algorithm where x and b are the
weight coefficients and bias term of the separating hyperplane, C is a predefined positive real number and ni are
slack variables (Vapnik, 1998). The first term of the objective function given in (8) ensures the regularisation by minimising the norm of the weight coefficients. The second
term tries to minimise the classification errors by introducing slack variables to allow some classification errors and
then minimising them. The constraint given in (9) is the
separation inequality, which tries to locate each instance
on the correct side of the separating hyperplane. Once x
and b are optimised, during the test, the discrimination is
used to estimate the labels:
^y ¼ signðxx þ bÞ
ð10Þ
and we choose the positive class if ^y ¼ þ1 and the negative
class if ^y ¼ 1. This model is generalised to learn nonlinear
discriminants with kernel functions to map x to a new
space and learning a linear discriminant there.
The standard SVM does not provide posterior probabilities. However, to maintain a common framework with
other classifiers, the output of a classifier (SVM) should
be a posterior probability to enable post-processing. Consequently, to map the SVM outputs into probabilities, as
presented in Platt (1999), we must first train an SVM,
and then train the parameters of an additional sigmoid
function. In our work, we used LIBSVM (Chang and
Lin, 2001) with posterior probabilities outputs Psvm(Cmjx).
4.4. Neural network
The Neural Network structure used in this paper was
the Multilayer Perceptron (MLP). An MLP is a network
of simple neurons called perceptrons. The perceptron computes a single output from multiple real-valued inputs by
forming a linear combination according to its input weights
and then possibly transforming the output by some nonlinear activation function. Mathematically this can be written
as:
!
n
X
y¼u
xi xi þ b ¼ uðwT x þ bÞ
ð11Þ
i¼1
where w denotes the vector of weights, x is the vector of
inputs, b is the bias and u is the activation function.
It is proved in Bishop (1995) that for various parameter
optimisation strategies (such as gradient descent) with minimisation of the Mean Square Error function or CrossEntropy Error function and the back-propagation technique used to compute derivatives of the error function
with respect to each of the free parameters, the trained network estimates the posterior probabilities of class membership Pmlp(Cmjx) directly.
5. Semi-Supervised classification
Supervised methods require a large number of labelled
utterances to enable efficient learning in real emotional
speech classification systems. However, the manual annotation of data is very costly and time consuming, so an extensive manual annotation of all the home movies is
unrealistic. Therefore, a learning algorithm with only a
few labelled data is required; i.e., a semi-supervised learning algorithm. In this section, we briefly describe two techniques for semi-supervised learning, namely, self-training
and co-training. Self-training and co-training algorithms
allow a classifier to start with a few labelled examples to
produce an initial weak classifier and later to combine
labelled and unlabelled data to improve the performance.
In the following, let us assume that we have a set L (usually
small) of labelled data, and a set U (usually large) of unlabelled data.
5.1. Self-training
The definition of self-training can be found in different
forms in the literature; however, we adopted the definition
of Nigam and Ghani (2000). In this method, we need only
one classifier and then only one feature set. For several iterations, the classifier labels the unlabelled data and converts
the most confidently predicted examples of each class into a
labelled training example.
Table 3 shows the pseudo-code for a typical self-training
algorithm. The self-training starts with a set of labelled
data L, and builds a classifier h, which is then applied to
the set of unlabelled data U. Only the n best classified utterances are added to the labelled set. The classifier is then
retrained on the new set of labelled examples, and the
Table 3
Self-training algorithm.
Given:
a set L of Labelled examples
a set U of Unlabelled examples
a number n of examples to be added to L in each iteraction
Loop:
Use L to train the classifier h
Allow h to label U
Let T be the n examples in U on which h makes the most confident
predictions
Add T to L
Remove T from U
End
177
process continues for several iterations. Notice that only
one classifier is required, with no split of the features.
5.2. Co-training
The co-training algorithm proposed in Blum and Mitchell (1998) is a prominent achievement in semi-supervised
learning. This algorithm and the related multi-view learning methods (Brefeld et al., 2006) assume that various classifiers are trained over multiple feature views of the same
labelled examples. These classifiers are encouraged to make
the same prediction on any unlabelled example.
As shown in Table 4, the method initially defines two
classifiers (h1 and h2) on distinct attribute views of a small
set of labelled data (L). Either of the views is required to be
conditionally independent of the other and sufficient for
learning a classification system. Then, iteratively, each classifier’s predictions on the unlabelled examples are selected
to increase the training data set. For each classifier, the
Table 4
Co-training algorithm.
Given:
a set L of Labelled examples
a set U of Unlabelled examples
Loop:
Use L to train each classifier h1
Use L to train each classifier h2
Allow h1 to label p1 positive and n1 negative examples from U
Allow h2 to label p2 positive and n2 negative examples from U
Add these self-labelled examples to L
Remove these self-labelled examples from U
End
1155
unlabelled examples classified with the highest confidence
are added to the labelled data set L, so that the two classifiers can contribute to increase the data set L. Both classifiers are re-trained on this augmented data set, and the
process is repeated a given number of times. The rationale
behind co-training is that one given classifier may assign
correct labels to certain examples, while it may be difficult
for others to do so. Therefore, each classifier can increase
the training set by adding examples that are very informative for the other classifier.
This method can be generalised to be used with a large
number of views. Fig. 4 shows the general architecture of
a generalised co-training method based on multi-view characterisation. It considers v different views. For each iteration, we select an ensemble of pi positive examples and ni
negative examples that are classified withP
the highest confidence. Then, we add the ensemble T ¼ vi¼1 pi þ ni to the
labelled data set L.
These semi-supervised algorithms and their variations
(Goldman and Zhou, 2000) have been applied in many
application areas because of their theoretical justifications
and experimental success.
6. Co-training algorithm based on multi view
characterisation
Many researchers have shown that multiple-view algorithms are superior to single-view method in solving
machine learning problems (Blum and Mitchell, 1998;
Muslea et al., 2000; Zhang and Sun, 2010). Different feature
sets and classifiers (views) can be employed to characterize
speech signals, and each of them may yield different
Fig. 4. Standard existing co-training algorithm based on multi-view characterization.
178
1156
Table 5
The proposed co-training algorithm.
Given:
a set Lof m Labelled examples fðl11 ; . . . ; lv1 ; y 1 Þ; . . . ; ðl1m ; . . . ; lvm ; y m Þg with labels yi = {1, 2}
a set U of n Unlabelled examples fðx11 ; . . . ; xv1 Þ; . . . ; ðx1n ; . . . ; xvn Þg
v = number of view(classifier)
Initialization:
xk(weights of classifier) = 1/v for all the view
While U not empty
A. Classify all the example of the test database: Do for k = 1, 2, . . . , v
1. Use L to train each classifier hk
2. Classify all examples of U by each hk
3. Calculate the probability of classification for each example xi from U, pðC j jxi Þ ¼
Pv
k¼1 xk
hk ðC j jxki Þ
4. Labels(xi) = argmax(p(Cjjxi))
End for
B. Updte the training (L) and test (U) databases:
Uj = {z1, . . . , znj} the ensemble of example classified Cj
Do for i = 1, 2, . . . , nj
Pv
h ðC jzk Þ
Pv k j i ;
k¼1 xk
k¼1 xk
pðC j jzi Þ ¼
End for
Pnj
marginj ¼
1
pðC j jzi Þ
:
nj
Take Tj from Uj the examples which has classified on Cj with a probability upper to marginj.
T ¼
X
T j:
Add T to L and remove it from U
PsizeðT Þhk ðzki Þ
i¼1
C. Upate weights: xk ¼ Pv P
sizeðT Þ
hk ðzik Þ
k¼1
i¼1
End While
prediction results. Therefore, the best solution is to use multiple-characterisation (views = feature + classifier) together
to predict the common class variable. Thus, the generalised
co-training algorithm shown in Fig. 4 uses different views for
classification. In the multi-view
approach,
1
the labelled data
are represented by
x1 ; . . . ; xv1 ; y 1 ; . . . ; x1m ; . . . ; xvm ; y m g,
where v is the number of views and yi are the corresponding
labels, m is the number of labels.
However, the standard co-training algorithm does not
allow the fusion of different views in the same framework
to produce only one prediction per utterance. It takes the
prediction of each classifier separately. To overcome this
problem, we propose a co-training procedure that iteratively trains a base classifier within each view and then combines the classification results to obtain a single estimate for
each observation. The proposed algorithm is a novel form
of co-training, which is more suitable for problems involving both semi-supervised classification and data fusion.
The goal of the proposed co-training method is to incorporate all the information available from the different views
to accurately predict the class variable. Each group of features provides its own perspective, and the performance
improvements are obtained through the synergy between
the different views. The co-training framework is based on
the cooperation of different classifiers for the improvement
of classification rates. Each of them gives an individual prediction weighted by its classification confidence value. This
problem has a strong similarity to data fusion, which
involves incorporating several disparate groups of views
into a common framework for modelling data.
This algorithm is designed to improve the performance
of a learning machine with a few labelled utterances and
a large number of cheap unlabelled utterances.
Given a set L of labelled utterances, a set U of unlabelled
utterances, and a set of different feature views Vi, the algorithm works as described in Table 5 and Fig. 5. First, to initialise the algorithm, we found the best feature set for each
classifier, as presented in Table 8. Second, we set all of the
initial weights equally so that xk = 1/v, where v is the number of views (9 in our case). Third, while the unlabelled
database U is not empty, we repeat the following:
Classification: to classify all the unlabelled utterances,
the class of each utterance is obtained using a decision
function. In our case we compute the maximum likelihood; otherwise we can use other decision functions.
179
Update the labelled and unlabelled databases: first we
take as U1 the utterances from U classified on Class 1
and U2 classified on Class 2, after that we calculate the
classification confidence for each utterance that we
called margin. This step consists of cooperating all the
classifiers to have once prediction by combining the classifiers outputs using a simple weighted sum.
Pv
xk hk ðC j jzki Þ
pðC j jzi Þ ¼ k¼1 Pv
;
ð12Þ
k¼1 xk
Pnj
pðC j jzi Þ
marginj ¼ 1
;
ð13Þ
nj
1157
7. Experimental results
(TN, FN). For supervised classification, we evaluated, from
a 10 folds cross validation, the accuracy rate to compare the
performances of different separate classifiers: (TP + TN)/
(TP + TN + FP + FN). We optimized the parameters of
the different classifiers; such as M component densities for
GMM, k optimal number of neighbours for k-NN, optimal
kernel for SVM and the number of cells for MLP.
For the semi-supervised classification, the performance
of the classification system is given for different data set.
First we randomly selected an ensemble U containing 400
examples and an ensemble L containing 100 examples balanced between motherese and adult-directed speech. Then,
in order to study the implication of the quantity of supervised learning data, we perform several experiments with
different number of labelled data; from 10% (10 examples)
to 100% (100 examples).
Notice that for the standard co-training algorithm, first
we compute the standard algorithm with only two classifiers (the two best classifiers) such as proposed in Blum and
Mitchell (1998) (Table 4), then we perform this algorithm
using all the classifiers as shown in the Fig. 4.
The supervised and semi-supervised classification systems were performed on multi-speaker data (speaker-independent). The speech segments were randomly extracted
from 10 home movies (10 different mothers). In addition,
as shown in Fig. 1, the speech segments were extracted from
three different periods of time (semester 1, semester 2, semester 3), which will augment the data diversity since the voice
of the mothers changes from one semester to another.
7.1. Experimental setup
7.2. Results of supervised classifiers
Motherese detection is a binary classification problem
and from given confusion matrix we have different decisions: true/false positive (TP, FP), and true/false negative
The performance of the different classifiers, each trained
with different feature sets (f1, f2, . . . , f9), were evaluated on
the home movies database.
where zki is the feature view to be classified on the class Cj,
xk is the weight of the classifier hk, v is the number of views
and nj is the number of segments classified on class Cj. The
margin value is in the interval [0, 1]. This number can be
interpreted as a measure of confidence, as is done for
SVM (Schapire and Singer, 1999). Then we take Tj to be
the utterances from Uj that were classified on Classj with
a probability greater than the mean value of classification
confidence (margin) of the Classj.
Update weights: finally, we update the weights of each
view, as described in Table 5. The new weight of each
classifier is proportional to its contribution to the final
classification. In other words, the weights of efficient
classifiers will be increased.
Fig. 5. Structure of the proposed co-training algorithm.
180
1158
Table 6 shows the best results of all the classifiers trained
with different feature sets. The best result was obtained
with GMM trained with cepstral MFCC (72.8% accuracy),
and second best result was obtained with k-NN trained
with f4 (35 statistics on energy). Therefore, Table 6 shows
that cepstral MFCC outperforms the other features.
Regarding the prosodic features, best results are not
obtained with a GMM classifier but with k-NN and
SVM classifiers. In addition to the GMM, perceptive features provide satisfactory results using the MLP classifier.
To summarise, comparing the results of different feature
sets and taking into account the different classifiers, the
best performing feature set for infant-directed speech discrimination appears to be the cepstral MFCC. Regarding
the classifiers, we can observe that GMMs generalise better
over different test cases than the other classifiers do.
7.3. Results of semi-supervised classifiers
The algorithm works as described in Fig. 5. To initialise
the co-training algorithm, we consider the best configuration of each features trained with all supervised classifiers,
using 10 folds cross validation. We obtained 9 classifiers
(views) h1 to h9 as described in Table 8.
The classification accuracy of the co-training algorithm
using multi-view feature sets with different number of
annotations is presented in Fig. 6 and Table 7. It can be
Table 6
Accuracy of separate classifier using 10 folds cross validation.
Feature set
GMM
k-NN
SVM
MLP
Cepstral feature
f1
72.8
57.7
59.4
61.4
Prosodic features
f2
f3
f4
f5
59.5
54.7
67.0
62.1
55.7
55.0
68.5
65.5
54.7
50.0
65.5
65.5
50.2
50.0
58.5
54.5
Perceptive features
f6
f7
f8
f9
61.0
55.5
65.0
58.8
50.5
51.0
52.0
50.5
49.0
52.0
50.5
50.5
54.5
58.5
55.5
64.0
seen that our method can achieve efficient results in
infant-directed speech discrimination.
To further illustrate the advantage of the proposed
method, Table 7 and Fig. 6 show a direct comparison
between our co-training algorithm and the standard cotraining algorithm. It shows that our method performs better results, 75.8% vs. 71.5%, using 100 labelled utterances.
In addition, Fig. 6 and Table 7 show that the performance
of the standard co-training algorithm that uses all the classifiers is worse than the performance of the algorithm using
only two classifiers, especially when we dispose of few
labelled data for training. Although, the standard co-training algorithm was shown promising for different classification problems, it suffers from issues of divergence, where
errors in the newly classified data could cause the system
to run off track (Carlson, 2009). One approach to overcome this problem is combining the different predictions
given by the different classifiers; such as all the classifiers
cooperate to obtain only one prediction per utterance.
The proposed co-training algorithm offers a framework
to take advantage of co-training learning and data fusion.
It combines the various features and classifiers in a cotraining framework.
In addition, to illustrate the advantage of the proposed
multi-view method, especially in cases with very few annotations, we compare our method with the self-training
method with a single view. In our study, we investigated
the basic self-training algorithm, which replaces multiple
classifiers in the co-training procedure with the best classifier that employs the most efficient feature. We computed
GMM with the cepstral MFCC (h1) and prosodic features
(h2), and at each iteration we take only the utterance with
the best posterior probability.
Fig. 6 and Table 7 show a comparison between our cotraining method and the self-training method. It can be
seen that our method outperforms the self-training
method, 75.8% vs. 70.3%, with 100 labelled utterances. In
addition, the proposed co-training method gives a satisfactory result in the case of very few annotations, 66.8% with
10 labelled utterances vs. 52.0% for the self-training
method. Comparing self-training and supervised methods,
Fig. 6 shows that supervised algorithm outperforms
Fig. 6. Classification accuracy with different numbers of annotations.
181
1159
Table 7
Classification accuracy with different numbers of annotations for training and 400 utterances for testing
Number of annotations for training
10
20
30
40
50
60
70
80
90
100
Proposed Co-training method
Co-trainingstandard(using h1 and h4)
Co-trainingstandard(using all the classifiers h1-h9)
Self-training(using h1: MFCC-GMM)
Self-training(using h2: prosody-GMM)
Supervised method:MFCC-GMM(best configuration)
66.8
63.5
57.0
52.0
54.0
55.0
65.3
62.5
58.5
50.0
52.5
59.3
63.5
62.0
58.5
50.0
53.0
59.5
67.0
64.5
61.0
54.0
52.0
61.5
69.8
68.5
64.0
55.0
53.5
68.5
72.3
69.8
67.0
62.5
58.0
71.0
72.5
71.3
67.3
61.0
59.0
70.0
71.8
69.5
68.0
65.0
62.0
69.8
74.0
71.0
69.0
69.0
64.5
71.5
75.8
71.5
68.5
70.3
67.8
72.8
Moreover, Fig. 8 demonstrates that the proposed cotraining algorithm performs better in the first several iterations (93.5% accuracy in the first iteration). This result is
quite reasonable because, as shown in Fig. 7, there are
many more correctly classified than falsely classified utterances in the first iteration (101 correctly classified utterances vs. 7 falsely classified utterances). However, the
performance of the classification decreases in the last iterations because we are retraining the system on misclassified
utterances detected incorrectly in previous iterations.
Table 8
Initialization of co-training algorithm.
Classifiers(views)
Combination
h1
h2
h3
h4
h5
h6
h7
h8
h9
GMM trained withf1
GMM trained withf2
k-NN trained withf3
k-NN trained withf4
SVM trained withf5
GMM trained withf6
MLP trained withf7
GMM trained withf8
MLP trained withf9
8. Conclusion
self-training algorithm since that self-training method suffers from issues of divergence (hight risk of divergence)
(Carlson, 2009). The self-training algorithm makes error
in the first iteration, therefore the error rate becomes
important, and then the classifier will learn on falsely classified examples. The risk of divergence is the major problem of the self-training algorithm (Carlson, 2009).
In addition, to illustrate the importance of the use of the
semi-supervised method, we compared the performance of
the proposed semi-supervised method and the best supervised method (GMM-MFCC) using different numbers of
annotations (from 10 labelled data to 100 labelled data).
Fig. 6 and Table 7 show that the proposed co-training
method outperforms the supervised method especially with
limited labelled data for training (always 400 utterances for
testing), 66.8% vs. 55.0% with 10 labelled utterances.
In this article, a co-training algorithm was presented to
combine different views to predict the common class variable for emotional speech classification. Our goal was to
develop a motherese detector by computing multi-features
and multi-classifiers to automatically discriminate pre-segmented infant-directed speech segments from manually
pre-segmented adult-directed segments, so as to enable
the study of parent-infant interactions and the investigation of the influence of this kind of speech on interaction
engagement. By using the more conventional features often
used in emotion recognition, such as cepstral MFCC, and
other features, including prosodic features with some statistics on the pitch and energy and bark features, we were
able to automatically discriminate infant-directed speech
segments. Using classification techniques that are often
Number of true and false classified utterances
120
true classified utterances
false classified utterances
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Iterations
Fig. 7. Number of accurately and falsely classified utterances by iteration.
15
182
1160
to David Cohen and his staff, Raquel Sofia Cassel and
Catherine Saint-Georges, from the Department of Child
and Adolescent Psychiatry, AP-HP, Groupe Hospitalier
Pitié-Salpétrière, Université Pierre et Marie Curie, Paris
France, for their collaboration and the manual database
annotation and data analysis. Finally, this work has been
partially funded by La Fondation de France.
0.94
0.92
0.9
Accuray
0.88
0.86
0.84
0.82
0.8
0.78
References
0.76
0.74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Iterations
Fig. 8. Accuracy by iteration.
used in speech/emotion recognition (GMM, k-NN, SVM
and MLP) we developed different classifiers and we have
tested them on real-life home movies database. Our experimental results show that spectral features alone contain
much useful information for discrimination because they
outperform all other features investigated in this study.
Thus, we can conclude that cepstral MFCC alone can be
used effectively to discriminate infant-directed speech.
However, this method requires a large amount of
labelled data. Therefore, we investigated a semi-supervised
approach that combines labelled and unlabelled data for
classification. The proposed semi-supervised classification
framework allows the combination of multi-features and
the dynamic penalisation of each classifier by iteratively
calculating its classification confidence. The experimental
results demonstrate the efficiency of this method.
For our infant-directed speech classification experiments, we used only utterances that were already segmented (based on a human transcription). In other
words, the automatic segmentation of infant-directed
speech was not investigated in this study, but it can be
addressed in a follow-up study. Automatic infant-directed
speech segmentation can be seen as a separate problem,
which gives rise to other interesting questions, such as
how to define the beginning and the end of infant-directed
speech, and what kind of evaluation measures to use.
In addition, other issues remain to be investigated in the
future. We plan to test our semi-supervised classification
method on larger emotional speech databases. Then it will
be interesting to investigate the complementarities of the
different views by analysing the evolution of weights of
each classifier and to compare our algorithm with other
semi-supervised algorithms, especially algorithms using
multi-view features.
Acknowledgments
The authors would like to thank Filippo Muratori and
Fabio Apicella from Scientific Institute Stella Maris of
University of Pisa, Italy, who have provided data; family
home movies. We would also like to extend our thanks
Association, A.P., 1994. The Diagnostic and Statistical Manual of Mental
Disorders, IV, Washington, D.C.
Bishop, C., 1995. Neural Networks for Pattern Recognition. Oxford
University Press.
Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with
co-training. In: Conf. on Computational Learning Theory.
Boersma, P., Weenink, D., Praat, doing phonetics by computer, Tech.
rep., Institute of Phonetic Sciences, University of Amsterdam, PaysBas., 2005. URL <www.praat.org>.
Brefeld, U., Gaertner, T., Scheffer, T., Wrobel, S., 2006. Efficient co-regularized
least squares regression. In: Internat. Conf. on Machine Learning.
Burnham, C., Kitamura, C., Vollmer-Conna, U., 2002. What’s new
pussycat: on talking to animals and babies. Science 296, 1435.
Carlson, A., 2009. Coupled semi-supervised learning, Ph.D. thesis,
Carnegie Mellon University, Machine Learning Department.
C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines, Tech.
rep., Department of Computer Science, National Taiwan University, Taipei
(2001). URL http://www.csie.ntu.edu.tw/cjlin/libsvm/.
Chetouani, M., Mahdhaoui, A., Ringeval, F., 2009. Time-scale feature
extractions for emotional speech characterization. Cognitive Computation 1, 194–201.
Cohen, J., 1960. Educational and Psychological Measurement, Ch. A
coefficient of agreement for nominal scales, p. 3746.
Cooper, R., Aslin, R., 1990. Preference for infant-directed speech in the
first month after birth. Child Development 61, 1584–1595.
Cooper, R., Abraham, J., Berman, S., Staska, M., 1997. The development
of infantspreference for motherese. Infant Behavior and Development
20 (4), 477–488.
Duda, R., Hart, P., Stork, D., 2000. Pattern Classification, second ed.
Wiley, interscience.
Eibe, F., Witten, I., 1999. Data mining: practical machine learning tools
and techniques with Java implementations. The Morgan Kaufmann
Series in Data Management Systems.
Esposito, A., Marinaro, M., 2005. Nonlinear speech modeling and
applications. Springer, Berlin, Ch. Some notes on nonlinearities of
speech, pp. 1–14.
Fernald, A., 1985. Four-month-old infants prefer to listen to motherese.
Infant Behavior and Development 8, 181–195.
Fernald, A., Kuhl, P., 1987. Acoustic determinants of infant preference for
motherese speech. Infant Behavior and Development 10, 279–293.
Goldman, S., Zhou, Y., 2000, Enhancing supervised learning with
unlabeled data. In: Internat. Conf. on Machine Learning, pp. 327–334.
Grieser, D., Kuhl, P., 1988. Maternal speech to infants in a tonal
language: support for universal prosodic features in motherese.
Developmental Psychology 24, 14–20.
Inouea, T., Nakagawab, R., Kondoua, M., Kogac, T., Shinoharaa, K.,
2011. Discrimination betweenmothersinfant-andadult-directedspeechusinghidden Markov models. Neuroscience Research, 1–9.
Kessous, L., Amir, N., Cohen, R., 2007. Evaluation of perceptual time/
frequency representations for automatic classification of expressive
speech. In: paraling.
Kuhl, P., 2004. Early language acquisition: cracking the speech code.
Nature Reviews Neuroscience 5, 831–843.
Laznik, M., Maestro, S., Muratori, F., Parlato., E., 2005. Au commencement tait la voix, Ramonville Saint-Agne: Eres, Ch. Les interactions
sonores entre les bebes devenus autistes et leur parents, pp. 81–171.
183
Maestro, S., Muratori, F., Cavallaro, M., Pecini, C., Cesari, A., Paziente,
A., Stern, D., Golse, B., Palasio-Espasa, F., 2005. How young children
treat objects and people: an empirical study of the first year of life in
autism. Child psychiatry and Human Development 35 (4), 83–396.
Mahdhaoui, A., Chetouani, M., Zong, C., 2008. Motherese detection
based on segmental and supra-segmental features. In: Internat. Conf.
on Pattern Recognition-ICPR, pp. 8–11.
Mahdhaoui, A., Chetouani, M., Zong, C., Cassel, R., Saint-Georges,
M.-C., Laznik, C., Maestro, S., Apicella, F., Muratori, F., Cohen,
D., 2009. Multimodal signals: cognitive and algorithmic issues.
Springer, Ch. Automatic Motherese detection for face-to-face
interaction analysis, pp. 248–55.
Mahdhaoui, A., Chetouani, M., Zong, C., Cassel, R., Saint-Georges,
M.-C., Laznik, C., Maestro, S., Apicella, F., Muratori, F., Cohen,
D., 2011. Computerized home video detection for motherese may
help to study impaired interaction between infants who become
autistic and their parents. International Journal of Methods in
Psychiatry 20 (1), e6–e18.
Muratori, F., Maestro, S., 2007. Autism as a downstream effect of primary
difficulties in intersubjectivity interacting with abnormal development
of brain connectivity. International Journal Dialogical Science Fall 2
(1), 93–118.
Muslea, I., Minton, S., Knoblock, C., 2000. Selective sampling with
redundant views. In: Proc. Association for the Advancement of
Artificial Intelligence, pp. 621–626.
Nigam, K., Ghani, R. 2000. Analyzing the effectiveness and applicability
of co-training. In: 9th Internat. Conf. on Information and Knowledge
Management, pp. 86–93.
Nigam, K., McCallum, A., Thrun, S., Mitchell, T., 2000. Text classification from labeled and unlabeled document using em In: Internat. Conf.
on Machine Learning.
Platt, J., 1999. Advances in Large Margin Classifiers. MIT Press,
Cambridge, MA, Chapter: Probabilistic outputs for SVM and comparison to regularized likelihood methods.
Reynolds, D., 1995. Speaker identifcation and verification using gaussian
mixture speaker models. Speech Communication 17 (1-2), 91–108.
1161
Saint-Georges, C., Cassel, R., Cohen, D., Chetouani, M., Laznik, M.,
Maestro, S., Muratori, F., 2010. What studies of family home movies
can teach us about autistic infants: a literature review. Research in
Autism Spectrum Disorders 4 (3), 355–366.
Schapire, R., Singer, Y., 1999. Improved boosting algorithms using
confidence-rated predictions. Machine Learning 37 (3), 297–336.
Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J.,
Devillers, L., Vidrascu, L., Amir, L., Kessous, N., Aharonson,V.,
2007. The relevance of feature type for the automatic classification of
emotional user states: low level descriptors and functionals. In:
Interspeech, pp. 2253–2256.
Shami, M., Kamel, M., 2005. Segment-based approach to the recognition
of emotions in speech. In: IEEE Multimedia and Expo.
Shami, M., Verhelst, W., 2007. An evaluation of the robustness of existing
supervised machine learning approaches to the classification of
emotions in speech. Speech Communication 49, 201–212.
Slaney, M., McRoberts, G., 2003. Babyears: a recognition system for
affective vocalizations. Speech Communication 39, 367–384.
Truong, K., van Leeuwen, D., 2007. Automatic discrimination between
laughter and speech. Speech Communication 49, 144–158.
Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer,
New York.
Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York.
Zhang, Q., Sun, S., 2010. Multiple-view multiple-learner active learning.
Pattern Recognition 43 (9), 3113–3119.
Zhou, D., Schlkopf, B., Hofmann, T., 2005. Advances in Neural
Information Processing Systems (NIPS) 17. MIT Press, Cambridge,
MA, Ch. Semi-Supervised Learning on Directed Graphs, pp. 1633–
1640.
Zhu, X., Lafferty, J., Ghahramani, Z., 2003. Semi-supervised learning
using gaussian fields and harmonic functions. In: Internat. Conf. on
Machine Learning, pp. 912–919.
Zwicker, E., 1961. Subdivision of the audible frequency range into critical
bands. Acoustical Society of America 33 (2), 248.
Zwicker, E., Fastl, H., 1999. Psychoacoustics: Facts and Models. Springer
Verlag, Berlin.
184
PLOS ONE
Travaux réalisés dans le cadre des thèses de Catherine Saint-Georges et
d'Ammar Mahdhaoui.
185
Do Parents Recognize Autistic Deviant Behavior Long
before Diagnosis? Taking into Account Interaction Using
Computational Methods
Catherine Saint-Georges1,2, Ammar Mahdhaoui2, Mohamed Chetouani2, Raquel S. Cassel1,2, MarieChristine Laznik3, Fabio Apicella4, Pietro Muratori4, Sandra Maestro4, Filippo Muratori4, David Cohen1,2*
1 Department of Child and Adolescent Psychiatry, AP-HP, Groupe Hospitalier Pitié-Salpêtrière, Université Pierre et Marie Curie, Paris, France, 2 Institut des Systèmes
Intelligents et de Robotique, CNRS UMR 7222, Université Pierre et Marie Curie, Paris, France, 3 Department of Child and Adolescent Psychiatry, Association Santé Mentale
du 13ème, Paris, France, 4 Division of Child Neurology and Psychiatry, Stella Maris Scientific Institute, University of Pisa, Calombrone, Italy
Abstract
Background: To assess whether taking into account interaction synchrony would help to better differentiate autism (AD)
from intellectual disability (ID) and typical development (TD) in family home movies of infants aged less than 18 months, we
used computational methods.
Methodology and Principal Findings: First, we analyzed interactive sequences extracted from home movies of children
with AD (N = 15), ID (N = 12), or TD (N = 15) through the Infant and Caregiver Behavior Scale (ICBS). Second, discrete
behaviors between baby (BB) and Care Giver (CG) co-occurring in less than 3 seconds were selected as single interactive
patterns (or dyadic events) for analysis of the two directions of interaction (CGRBB and BBRCG) by group and semester. To
do so, we used a Markov assumption, a Generalized Linear Mixed Model, and non negative matrix factorization. Compared
to TD children, BBs with AD exhibit a growing deviant development of interactive patterns whereas those with ID rather
show an initial delay of development. Parents of AD and ID do not differ very much from parents of TD when responding to
their child. However, when initiating interaction, parents use more touching and regulation up behaviors as early as the first
semester.
Conclusion: When studying interactive patterns, deviant autistic behaviors appear before 18 months. Parents seem to feel
the lack of interactive initiative and responsiveness of their babies and try to increasingly supply soliciting behaviors. Thus
we stress that credence should be given to parents’ intuition as they recognize, long before diagnosis, the pathological
process through the interactive pattern with their child.
Citation: Saint-Georges C, Mahdhaoui A, Chetouani M, Cassel RS, Laznik M-C, et al. (2011) Do Parents Recognize Autistic Deviant Behavior Long before Diagnosis?
Taking into Account Interaction Using Computational Methods. PLoS ONE 6(7): e22393. doi:10.1371/journal.pone.0022393
Editor: James G. Scott, The University of Queensland, Australia
Received February 14, 2011; Accepted June 21, 2011; Published July 27, 2011
Copyright: ! 2011 Saint-Georges et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This study was supported by the Fondation de France and the Université Pierre et Marie Curie. The funders had no role in study design, data collection
and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
children can gaze at people, turn toward voices and express
interest in communication as typically developing (TD) infants do
[4,5]. However, in several studies, children who later develop AD
show as early as the first year less social behavior (e.g., looking
at others, especially at the face), communication skills (e.g.,
responding to name), inter-subjective initiative, and emotion
expression than TD infants. In the second year, early social signs
intensify; expressive and receptive language fails to develop, while
the lack of inter-subjective skills and of emotional expression
persists [4,5]. These insights from home movies have been
confirmed in studies of at risk children [6,7,8,9] and in studies
using retrospective data from parental interviews to assess early
signs of AD (Guinchat et al., in revision). As regards specificity,
signs that differentiate AD children from children with intellectual
disability (ID) are limited to the second year: fewer responses to
name, fewer glances to others, lower eye contact quality and
quantity, less positive facial expression and fewer inter-subjective
behaviors (e.g., showing shared attention) [4,5]. To further
Early signs of autism
Autism is a severe psychiatric syndrome characterized by the
presence of abnormalities in reciprocal social interactions,
abnormal patterns of communication, and restricted and stereotyped behaviours starting before age 3 [1]. Autism is now a welldefined clinical syndrome after the third year of life, and
considerable progress in understanding its emergence in the first
two years of life has been achieved [2,3]. Although there have
been significant advances in describing single or multiple early
signs, our ability to detect autism during early age is still
challenging. Home movies (ie., naturalistic films recorded by
parents during the first years of life) and direct observations of at
risk infants are the two most important sources of information for
overcoming this problem. They have both described children with
autism disorder (AD) during the first 18 months as not displaying
the rigid patterns described in older children. In particular, AD
PLoS ONE | www.plosone.org
1
July 2011 | Volume 6 | Issue 7 | e22393
186
Early Parental Adaptation to Their Autistic Infant
that AD babies received less action than ID from their CG to
regulate down their arousal and mood.
investigate early signs in the interactive field, Muratori et al. [10]
studied home movies of the first three semesters of life from AD,
ID and TD children with independent scoring of both baby (BB)
and caregiver (CG) behaviors and timing. AD infants displayed
impairments in ‘‘syntony’’, ‘‘maintaining social engagement’’,
‘‘accepting invitation’’ and in ‘‘orienting to their name’’ (definitions are given in Table 1) as early as the first year of life in
comparison with TD children. At semester 3, some items
differentiated AD from TD while for other items AD showed
significantly lower scores compared to ID. In addition, they noted
Taking into account interaction
One of the main limitations of these studies is that they have
not or only poorly taken into account the importance of BB/
CG synchrony and reciprocity in the early interactions [11]. As
it is of seminal importance to have more insight not only into
early social competencies of infants who are developing autism
but also into interactive situations where they preferentially
Table 1. Infant’s and caregiver’s behaviors and meta-behaviors from the infant caregiver behavior scale (ICSB).
Metabehavior
Item Behavior
Glossary
Child Behaviors (N = 29)
Behavior with
object
Vocali-zations
Orienting toward object
Gaze Following an object
The child shifts his/her gaze to follow the trajectory of an object.
Explorative activity with object
The child touches something by hands, mouth or other sensory-motor actions, to find out what it feels like.
Looking at object/around
The child directs his/her eyes towards an object, or simply looks around.
Smiling at object
The child intentionally smiles at object.
Enjoying with object
The child finds pleasure and satisfaction experiencing a physical or visual contact with an object.
Seeking contact with object
The child employs spontaneous and intentional movements to reach contact with an object.
Simple Vocalisation
The child produces sounds towards people or objects.
Crying
Orienting
Orienting toward people
toward people
Receptive to
people
The child directs his/her gaze towards a source of new sensory stimulation coming from an object
The child starts crying after a specific/non specific event.
The child directs his/her gaze towards a source of new sensory stimulation coming from a people
Gaze Following a person
The child shifts his/her gaze to follow the trajectory of another person.
Explorative activity with person
The child touches a person to find out what it feels like (by hands, mouth or other sensory-motor actions).
Looking at people
The child directs his/her eyes towards a human face.
Smiling at people
The child intentionally smiles at a person.
Enjoying with person
The child finds pleasure and satisfaction experiencing a physical or visual contact with a person.
Sintony *
The child shows signs of congruous expressions to affective solicitations, to the other’s mood.
Seeking people Seeking contact with person
Soliciting
Inter-subjective Anticipation of other’s intention
behavior
The child employs spontaneous and intentional movements to reach contact with a person.
The child displays a vocal or tactile action to attract the partner’s attention or to elicit another response.
The child makes anticipatory movements predicting the other’s action.
Communicative gestures
The child displays use of social gestures.
Referential gaze
The child shifts his/her gaze towards the caregiver to look for consultation in a specific situation.
Gaze following gaze
The child shifts his/her gaze to follow the gaze of another person.
Accept Invitation
The child’s behavior is attuned to the person’s solicitation within 3 seconds.
Orienting to name prompt
The child assumes a gaze direction towards the person who calls him/her by the name.
Imitation
The child repeats, after a short delay, another person’s action.
Pointing comprehensive/ declarative/
requestive
The child a) shifts his/her gaze towards the direction pointed by a person; b) points something in order to
share an experience; c) in order to obtain an object.
Maintaining social engagement *
The child takes up an active role within a two-way interaction in order to keep the other person involved.
The child interacts, vocalises and maintains turn taking.
Meaningful Vocalisation
The child intentionally produces sounds with a stable semantic meaning
Caregiver’s Behaviors (N = 8)
Reg-up/down
Regulation up * /down
Touching
Touching
Modulates the child’s arousal and mood, to either excite (reg-up) or calm (reg-down).
Stimulates the child requesting attention by touching him/her.
Vocalization
Vocalizing/naming/behavior request
Stimulates the child requesting attention by vocalizing, naming
Gesturingshowing
Gesturing/showing object
Stimulates the child requesting attention by gesturing or showing him object
doi:10.1371/journal.pone.0022393.t001
2
187
(standard situations). For each infant, the sequences were
organized in three periods of 6 months of age (#6 month;
6,age#12 months; .12 months). Sequences were randomly
selected by group and by semester. Preliminary t-test analysis
showed that chosen video material was comparable across groups
and for each range of age, in length and number of standard
situations.
emerge, we tried to overcome these caveats by using for
previous data [10] new engineering techniques of interaction
analysis focusing on reciprocity and synchrony between BB and
CG. Recently, applying machine learning methods to explore
TD infant and mother behavior during interaction, Messinger
et al. [12] showed that developmental changes were most
evident when the probability of specific behaviors was
examined in specific interactive contexts. The aims of the
current study were to assess early social interactions of infants
with TD, ID and AD taking into account simultaneously: CG
behavior, BB behavior, synchrony of the interaction partners,
and finally, the two directions of interaction (from CG to BB
and from BB to CG). Among others, we hypothesized that (1)
infants with AD should exhibit a growing deviant social
development whereas those with ID should rather show an
initial delay of development; (2) CG of babies with atypical
development should feel very early the initial pathological
process and this feeling could be expressed through atypical/
unusual interactive patterns.
Computer-based coding system (Step 3)
The Observer 4.0H was configured for the application of the
Infant Caregiver Behavior Scale (ICBS) to the video media filematerial. The ICBS (Table 1) is composed of 29 items referring to
the ability of the BB to engage in interactions and 8 items
describing CG solicitation or stimulation toward the infant to
obtain his attention. All target behaviors were described as Events
which take an instant of time. Caregiver regulation up caregiver
regulation down were described as events and also states which
take a period of time and have a distinct start and an end.
Four coders were trained to use the computer-based coding
system until they achieve a satisfactory agreement (Cohen’s Kappa
$0.7). The standard situations derived from the HM of the three
groups of children (AD, ID and TD) were mixed, and each one
was rated by one trained coder blind to which group they
belonged. For a continuous verification of inter-rater agreement,
25% of standard situations were randomized and rated by two
coders independently. The final inter-rater reliability, calculated
directly by the Observer, showed a satisfactory Cohen-k mean
value ranging from 0.75 to 0.77.
Materials and Methods
General view of the study
The diagram-flow of the study is summarized in Figure 1. Fortytwo children were randomly selected inside the Pisa Home Movie
database, with the following criteria: 15 who will be diagnosed
with AD, 12 with ID and 15 who will develop normally (step 1). All
scenes showing a situation in which social interaction could occur
(i.e. all scenes with an infant and an adult) were extracted and, if
necessary, segmented in short sequences in order to be scored (step
2). CG and BB behaviors were rated independently within each
interaction sequence according to a grid with a specific part for
each partner (step 3). An interaction database was created by
extracting [CGRBB] or [BBRCG] signals occurring ‘‘simultaneously’’, that is within a time window of 3 seconds (step 4). A
computational model using Markov assumption of interaction was
performed to describe the interaction (step 5). Quantitative
statistics were performed to assess and compare emergence of
interactive patterns by time and by group (step 6). To study these
interactive patterns with an integrative perspective, Non-negative
Matrix Factorization (NMF) were performed (step 7). Steps 1, 2,
and 3 have been described in a previous report where a full
description is available [10]. Here we only summarize them.
Creation of the interaction database (step 4)
We first created an interaction data base (Step 4) by extracting all
interactive events defined as sequences of caregiver behavior and
infant behavior co-occurring within a time window of 3 seconds. The
whole interaction database was divided into two sets: (1) CGRBB
interactions, i.e. any child behaviors occurring within the 3 seconds
following any caregiver behavior (including events that occur within
the same second); (2) BBCG interactions, i.e. any caregiver behaviors
occurring within the 3 seconds following any child behavior (again
including concomitant events). The 3 second window was based on
available literature on synchrony ([11]). Interactive events that
occurred at the same second were integrated in the two sets of the
interaction database because it was too difficult to assume who was
primary or secondary in the interaction. Extraction was performed
using Linux based script. The sequence of n interactive patterns is
termed n-gram as usually done in natural language processing or
gene analysis. In this study, we only focused on bi-gram modeling.
Given the large number of possible types of interaction ([CG item x
BB item] combinations = 8629), and the low frequency of several
items in the data base, we created five CG meta-behaviors (Vocal
solicitation, Touching, Gestural solicitation, Regulation up, Regulation down) and six BB meta-behaviors (Vocalizations, Inter-subjective
behavior, Seeking people, Receptive to people, Orienting toward
people, Behavior with object) by grouping ICBS items. Meta
behaviors are shown in the left column of Table 1. Then we repeated
the process of extraction to obtain finally, for each standard situation,
all sequences of caregiver meta-behavior and infant meta-behavior
occurring within a time window of 3 seconds.
Participants (Step 1)
The study has been approved by the Ethical Committee of the
Stella Maris Institute/University of Pisa, Italy [13]. The Pisa
Home Movie data base includes three groups of children matched
for gender and socio-economic status, with home movies (HM)
running for a minimum of 10 minutes for each of the first 3
semesters of life. Group 1 includes 15 children (M/F: 10/5) with a
diagnosis of AD without any sign of regression confirmed with the
Autism Diagnostic Interview Revised [13]. Group 2 includes 12
children (M/F: 7/5) diagnosed with ID according to the DSM-IV
criteria and a Childhood Autism Rating Scale (CARS) [14] total
score under 25. The composite IQ score was below 70 for both
AD and MR (figure 1). Group 3 includes 15 children (M/F: 9/6)
with a history of typical development confirmed by non
pathological scores at the Child Behavior Check List [15].
Extraction of CG-BB interaction situations (Step 2)
Characterization of infant-caregiver interactive patterns
(Step 5)
An editor, blind to children diagnoses, selected from among the
HM of each child all segments running for at least 40’’ where the
infant was visible and could be involved in human interaction
General principles of the analysis we used to investigate
interactive patterns by group and by time are summarized in
figure 2. First, we aimed to describe infant-caregiver interaction by
3
188
Figure 1. Diagram flow of the study. SES = Socio Economic Status; IQ = Intellectual quotient; CARS = Children Autism Rating Scale; CBCL = Child
Behavior Check List; SD = Standard Deviation; GLMM = Generalized Linear Mixed Model; *IQ matching only between ID and AD children and based on
Griffiths Mental Developmental Scale or Wechsler Intelligent Scale.
doi:10.1371/journal.pone.0022393.g001
time and by group and assess emergence of language and social
engagement by time and by group as they are core issues of
autism. For each of the two sets of the database (ie., the two
directions of interaction), assuming a Markovian process, we used
a maximum likelihood estimation to estimate, by group and
semesters, the probability (relative frequency) of each interactive
pattern or bi-gram (couple of CG and BB items) using meta
behaviors only (665 for BBRCG and 566 for CGRBB).
Grouping all the more frequent (.1%) interactive patterns (or
bi-grams) allows designing Markov chains representing the parentinfant interaction. Markov diagrams were performed using
Graphviz (see http://www.graphviz.org/).
each CG and BB interactive behavior and meta-behavior, by
group and by semester. To assess by group and/or by time
significant associations, we used a generalized linear mixed model
(GLMM). Using this model, we performed a linear regression that
was generalized to the variable distribution (here a quasi Poisson
distribution) and with a random effect to take into account
patients’ auto correlations [16]. The distribution of each item
behaviors and meta-behaviors was studied in order to compute
statistics with GLMM. All BB and CG meta behaviors, 6 CG items
(Gesturing, Showing object, Vocalizing, Request Behavior,
Naming) and 9 BB items (Orienting to name, Exploring object,
Looking at object, Looking around, Looking at People, Contact
Object, Orienting to People, Simple Vocalizations, Smiling at
People) satisfied a ‘‘quasi-Poisson’’ law. Several other items
occurring with a low frequency were not statistically usable
because their distribution did not satisfy any known law. All BB
and CG items and meta behavior responding to a quasi Poisson
distribution were included in the model.
Quantitative statistics (Step 6)
Statistical analyses were performed using R Software, Version
2.7 (The R Foundation for Statistical Computing). Analyses were
conducted separately on each of the two sets of the data base
(CGRBB and BBRCG). We computed descriptive statistics of
4
189
Figure 2. Analysis of parent-infant interaction: general principals. {CGRBB} ensemble of interactive patterns from caregiver (CG) to baby
(BB); {BBRCG} ensemble of interactive patterns from baby (BB) to caregiver (CG); GLMM = Generalized Linear Mixed Model.
such as the pre-processing of the data, optimization of the rank of
factorization (the number of clusters) and also the initialization.
Regarding the pre-processing, we used a method usually
employed in document analysis: tf-idf (term frequency-inverse
document frequency) [20]. This approach is based on the fact that
a query term that occurs in many documents may not be
discriminant and consequently should be given less weight than
one that occurs in few documents. In our work, terms refer to
interactive patterns while documents refer to home movies. The
key idea is to give more importance to an interactive pattern in a
given home movie if 1) the interactive behavior appears frequently
in the home movie and 2) the interactive behavior does not appear
frequently in other home movies. For a given interactive behavior
ti within a movie dj, we estimated the term-frequency tfij:
We conducted two univariate analyses first with Group as
independent variable for a given semester, and then Time
(semester) as independent variable within the same group. Then
a multivariate analysis with both Time and Group was performed.
As we knew that (1) AD and ID children would not behave better
in interaction than TD and that (2) interactive behaviors change
with time in pathological and typical children, we used a one-tail
threshold of significance (t = 1.645 for p = 0.05) for each
calculation of p.
Computational model of infant-caregiver interaction
(Step 7)
Modeling and analyses done by Markov chains and GLMM
provide useful insights on dynamic and relevance of individual
interactive patterns. In order to study these interactive patterns
with an integrative perspective, we proposed to employ a more
global approach using Non-negative Matrix Factorization (NMF)
[17]. All the m interactive patterns among the n movies have been
grouped into a matrix V.
NMF is an unsupervised feature extraction method involving
the decomposition of a non-negative matrix V (dimension n x m)
into two non-negative matrices W (n x k) and H (k x m) by
multiplicative updates algorithm:
nij
tfij ~ P
l nlj
where nij is the number of occurrences of the considered
interactive pattern (ti) in the movie dj, and the denominator refers
to the total of occurrences of all the interactive patterns in the
movie dj.
The inverse document frequency is a measure of the general
importance of the interactive pattern (a measure of informativeness) defined as the logarithm of the ratio of documents (movies) to
the number of documents containing a given term (interactive
patterns):
V &WH
The non-negativity constraints are relevant for the analysis of
human behaviors since they allow only additive, not subtractive,
combinations (part-based representation). The rank k of the
factorization represents the number of latent factors and is usually
chosen such that (n+m)k,nm. The rank k is interpreted as the
number of clusters resulting in groups of interactive behaviors.
Indeed, rows or columns of the decomposed matrices (H and W)
are usually considered to be the membership degree to a cluster.
NMF has been successfully used in various applications including
interpretation of social behaviors [18] and computational biology
[19]. Most of the studies have pointed important requirements
idfi ~ log
j Dj
jfd : ti [d gj
where |D| is the total number of movies in the database and
|{d:ti M d }| is the number of movies containing the interaction
pattern ti. Finally, the tf-idf representation is obtained by
multiplying the weights: (tf-idf)ij = tfij x idfi.
The number of clusters is an important issue in the current work
since it will provide insights on the combination of interactive
patterns among groups and semesters. To determine the optimal k
5
190
Figure 3. Markov diagram of the main early interactive patterns in typical developing children according to time and interaction
direction.
6
191
which decomposes the samples into ‘meaningful’ clusters, we
investigated ‘Homogeneity-Separation’ since the standard definition of a good clustering is that of ‘Homogeneity-Separation’:
every element in a cluster must be highly similar (homogeneous) to
the other elements in the same cluster and highly dissimilar
(separation) to elements outside its own cluster.
The stochastic nature of NMF requires strategies to obtain
stable and reliable results that also depend on the initialization
process. In the current work we use a recent method proposed by
Boutsidis and Gallopoulos [21] termed Nonnegative Double
Singular Value Decomposition (NDSVD), which is based on
Singular Value Decomposition (SVD) but with non-negative
constraints. Unlike random approaches, NDSVD guaranties stable
results but not necessarily efficient ones; for this purpose multiple
runs of NDSVD have been carried out.
In order to understand the developmental similarity of AD
children towards TD, and ID children towards TD, we calculated
the value of the Normalized Mutual Information (NMI) as
proposed by Strehl and Ghosh [22]. The NMI of two different
clustering measures the agreement between the two clustering:
Pk
i~1
Pk
j~1
n1,2
i,j log
n|n1,2
i,j
n1 |n2
i
j
GLMM model in TD children. Significant developmental changes
are indicated by an arrow ( or
according to a significant
increase or decrease). They are as follows: BB intersubjective
behaviors and seeking people behaviors, both as interaction
initiation [BBRCG] and response [CGRBB] increase from S1 to
S2. The increase continues from S2 to S3 as response [CGRBB]
for BB intersubjective meta-behavior whereas BB seeking people
behaviors decrease (only as response, too). However, during S3,
BB intersubjective behaviors become the second child solicitation
for CG. BB behavior with object becomes the first solicitation from
the BB as soon as S2, and also the first response of the BB at S3.
CG touching behaviors decrease in both directions from S1 to S2,
and from S2 to S3. CG gestural solicitation increases from S1 to
S2. CG vocal solicitation is predominant in all semesters. CG
regulation up/down are very low in TD children during
interactive patterns.
For the meta-behaviors that showed significant changes during
early development, we also tested the corresponding CGBB
individual items included in the model (see methods). Significant
results are as follows: BB orienting to name increases (p,0.001)
from S1 to S2 and decreases from S2 to S3 (p,0.001); BB contact
object increases (p,0.05) from S1 to S2; BB exploring object
increases (p,0.001) from S1 to S2 and again from S2 to S3
(p,0.001); BB looking around (p,0.05) and BB smiling at people
(p,0.05) decrease from S2 to S3. CG gesturing increases
(p,0.001) from S1 to S2 and then decreases (p,0.001) from S2
to S3; CG request behavior (p,0.05) and CG naming (p,0.01)
increase from S1 to S2.
!
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
ffiffiffiffi"
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi#
ffiffiffi
NMI(y1 ,y2 )~ sffi"
Pk
Pk
n1
n2
1
1
i
i
i~1 ni log n
i~1 nj log n
where n1i is the number of interactive patterns belonging to cluster
ci using clustering y1, n2j is the number of interactive patterns
Early interaction in AD and ID infants compared to that in
TD infants
belonging to cluster cj, using clustering y2, and n1,2
i,j is the number
of interactive patterns belonging to cluster ci, using clustering y1
and belonging to the cluster cj using y2. One should note that
NMI(y1,y1) = 1 indicating same clustering and consequently same
interactive behaviors.
Figure 5 and figure 6 summarize the significant developmental
changes over time (represented by an arrow) and the significant
differences in the multivariate analysis (by group and by time
comparison) using the GLMM model in AD and ID children,
respectively.
Considering first child behavior, when CG starts interaction
[CGRBB], BB inter-subjective behaviors grow every semester
(p,0.01) whatever the group, but they are lower for ID than TD
(p,0.01) at S1. In contrast, for AD it is lower (p,0.05) globally (all
semesters combined) and tend to be significantly lower (p,0.1) at
S3. When BB starts interaction [BBCG], BB inter-subjective
behavior is again significantly lower (p,0.05) for ID than TD at
S1. From S1 to S2, unlike for TD, BB inter-subjective behavior
does not increase in both pathological groups, but only children
with ID exhibit a significant increase of inter-subjective behavior
from S2 to S3. BB orienting toward people is lower (p,0.05) in
response at S1 for AD than TD. However, it significantly increases
(p,0.01) from S1 to S2 for AD (whereas TD keep stable). Other
BB meta-behaviors (vocalizations, seeking people, being receptive
to people, behavior with object) show no significant differences
between groups.
From a developmental point of view, AD children, unlike TD
children, show a significant increase (p,0.05) of receptive
behaviors from S1 to S2, and conversely, a much smaller increase
of seeking people behaviors (p,0.05) than TD (p,0.001). In
summary, from S1 to S2, AD children become more ‘‘open’’
(receptive) and interested in an exchange (orienting toward people)
but only in a passive way (not seeking people); moreover at S3, the
decrease of BB receptive behaviors is striking in AD (p,0.01)
whereas this is not significant for TD children.
ID children do not show any increase of BB seeking people over
time but have high rates at S1. Like TD children but unlike AD
children, ID children don’t exhibit significant changes over time
Results
Early interaction in TD children and significant
developmental changes
Figure 3 summarizes the Markov diagram of all interactive
patterns in TD children (at the meta-behavior level) occurring with
a frequency higher than 1% according to both interaction
direction [CGRBB] or [BBRCG] and semester. The diagram
estimates 93.6% to 96% of the total interaction patterns according
to semester and direction of interaction. When CG starts
interaction, he/she predominantly uses vocal solicitation at all
semesters. BB responds with vocalization (38.6%), being receptive
to people (16%) and with object behaviors (8.9%) during the first
semester (S1). BB responds with vocalization (25.4%), with object
behaviors (18.8%) and being receptive to people (12.4%) during
S2. BB responds with vocalization (24.6%), with object behaviors
(22.9%) and intersubjective behaviors (19.1%) during S3. When
BB starts interaction he uses preferentially vocalizations and being
receptive to people during S1, to which CG answers with
vocalizations (54.8%) and touching (12.1%). During S2, BB uses
behavior with object (28.8%), vocalizations (26.9%), being
receptive to people (17.8%) and intersubjective behaviors
(12.4%). CG answers predominantly with vocal solicitation.
During S3, patterns are similar but BB intersubjective behaviors
(21.9%) are much more frequent than being receptive to people
(7.3%).
For each interaction direction, figure 4 shows the relative
distribution of meta-behaviors by semester, and summarizes the
7
192
Figure 4. Developmental view of meta-behaviors for typical infants. Top: Care-Givers towards Babies/Down: Babies towards Care-Givers. S =
Semester; See Table 1 for a brief description of cited infant’s or care-giver’s behaviors and meta-behaviors. In brackets: % of this behavior inside the
whole interactions of the group in the semester. The arrow indicates behaviors that significantly grow ( ) or decrease ( ) compared with the
previous semester (*p,0.05; **p,0.01; ***p,0.001).
either in BB receptive behaviors or in BB orienting toward people.
Unlike AD and TD children, ID children exhibit a significant
increase of BB behaviors with object from S1 to S2, but whatever
the semester they stay (but not significantly) below TD and AD.
Considering now CG behavior, CG vocal solicitation is always
higher for parents of TD children, but it never reaches significance
between groups nor over time. CG gestural solicitation is lower at
S1 in the two pathological groups reaching significance for parents
of ID children only in initiation [CGRBB] (p,0.05) and for
parents of AD children only in response [BBRCG] (p = 0.01).
However, for the three groups it increases significantly from S1 to
S2 in both ways of interaction, except in response for parents of ID
children. CG touching behavior does not change in CG of AD and
ID children from S1 to S2, while it decreases for parents of TD
children (p,0.001). Then from S2 to S3, it decreases in parents of
AD children as it does for parents of TD children. However at S3,
CG touching is higher for parents of AD and ID children
compared with TD children, in initiation [CGRBB] (p,0.05) and
with a tendency (p,0.05 for ID and p,0.1 for AD) in response
[BBRCG]. Finally, CG regulation-up duration is higher for
parents of ID and AD children (p,0.05) at S1. Then it decreases
(p,0.05) from S2 to S3 in all groups. However, at S3, it remains
higher (p,0.05) for parents of AD children.
For item behaviors included in the model (see methods), all
semesters together (in the multivariate analysis), BB orienting to
name and BB exploring object appear lower in the AD group than
in TD (p,0.01 and p,0.001 respectively). With regards to the ID
group, BB looking object, BB looking around and CG gesturing
appear lower than in the TD group (p,0.05). BB exploring object,
at S2 and S3, was lower for AD children (p,0.05 and p,0.01
respectively). As for other developmental changes for AD children,
from S1 to S2, unlike for TD, BB orienting toward people and BB
smiling to people are growing (p,0.01 and p,0.05 respectively).
From S2 to S3, unlike for TD children, BB exploring object and
BB looking around don’t increase, and BB looking at people
decreases (p,0.05). From S1 to S2, CG touching increases nonsignificantly (while there is a significant decrease in TD group:
p,0.001) and from S2 to S3, CG gesturing doesn’t decrease, and
CG naming decreases (p,0.05). For other items, AD group
follows a development similar to that of typical.
Developmental similarity between AD vs TD and ID vs TD
using Non negative Matrix Factorization
To give a more general view of interactive patterns during
infancy, we also used nonnegative matrix factorization. First, we
applied a tf-idf (term frequency-inverse document frequency) to
8
193
Figure 5. Developmental view of main interactive behaviors for infants with autism. Top: Care-Givers towards Babies/Down: Babies
towards Care-Givers. S = Semester; See Table 1 for a brief description of cited infant’s or care-giver’s behaviors and meta-behaviors. In brackets: % of
this behavior inside the whole interactions of the group in the semester. The arrow indicates behaviors that significantly grow ( ) or decrease ( )
compared with the previous semester (*p,0.05; **p,0.01; ***p,0.001). The red color indicates a significant difference when compared with TD:
behavior in red color means that it differs in a group comparison (inside a given semester); arrow in red color means that the progression over time
differs from that of the TD children (meaning the arrow has not the same direction). Significant p values are given in the text.
transform the scenes annotations into a representation suitable for
the clustering task. The best solutions of behavior signals clustering
for the ‘Homogeneity-Separation’ method yielded the following
number of clusters according to semester (S1, S2, S3): 11, 14 and 9
for TD; 5, 11, 14 for ID; 12, 8, 10 for AD.
To illustrate the developmental similarity of AD children
towards TD, and ID children towards TD, we calculated
Normalized Mutual Information (NMI) values between the
clustering results of TD/AD at each semester (0.48, 0.44, 0.37
for S1, S2, S3 respectively) and NMI values between the clustering
results of TD/ID at each semester (0.48, 0.50, 0.47 for S1, S2, S3
respectively). Figure 7 shows that NMI values between the
clustering results of TD/AD decrease over time, whereas NMI
values between the clustering results of TD/ID show stability over
time (see figure 7).
has many advantages. First, it allows to maintain attention on
antecedents and consequences of interactive behaviors; second it
allows to point out significant sequences that could be able to
prompt or inhibit social interaction in a naturalistic and
spontaneous way; third, it could produce insights for treatments
based on parent-infant engagement that are now considered to be
a fundamental part of many types of treatment. We discuss our
results separately with regard to typical and atypical developments
of interactive patterns. Throughout the discussion we put a series
of comparisons with results described in a previous paper on the
same subjects with the objective to demonstrate the added value of
a research on autism using engineering methods which has its
focus on interactive social sequences and not just on simple, or
even complex, behaviors.
Discussion
Summarizing CG«BB interactive patterns in typically
developing babies
As opposed to all previous home movies studies, the use of
engineering methods related to social signal processing allowed
focusing on dynamic parent«infant interaction instead of single
behaviors of the baby or of the parent. The focus on interaction
Among BB behaviors vocalizations are predominant from birth,
and exploring object grows significantly every semester until
behaviors with object become the first BB meta-behavior in the
second year. While seeking people peaks significantly at second
9
194
Figure 6. Developmental view of main interactive behaviors for infants with intellectual disability (ID). Top: Care-Givers towards
Babies/Down: Babies towards Care-Givers. S = Semester; See Table 1 for a brief description of cited infant’s or care-giver’s behaviors and metabehaviors. In brackets: % of this behavior inside the whole interactions of the group in the semester. The arrow indicates behaviors that significantly
grow ( ) or decrease ( ) compared with the previous semester (*p,0.05; **p,0.01; ***p,0.001). The red color indicates a significant difference
when compared with TD: behavior in red color means that it differs in a group comparison (inside a given semester); arrow in red color means that
the progression over time differs from that of the TD children (meaning the arrow has not the same direction). Significant p values of group
comparisons are given in the text.
semester so that as the child becomes gradually more active
(seeking people) and conscious (intersubjective acts) in the
relationship, parents follow suit by leaving their touching behavior
but not their vocalizations and increasing their gestural communication [30]. Indeed, the literature shows that mothers tailored
their communication to infants’ level of lexical-mapping development [28].
semester compared with next and previous semesters, intersubjective behavior continues to grow significantly over the
semesters. Thus in the second semester, a typical child is rather
seeking and attending to his care-giver and little by little turns to
objects, even inside the interaction (since our ‘‘filter’’ keeps only
behaviors that are included in an interactive dynamic). This
pattern describes the typical development of shared or joint
attention [23,24] and points out how this phenomenon is
entangled with both the simultaneous increase of inter-subjectivity
and with vocalizations.
Also among CG behaviors, vocalizations are predominant from
birth. We can assume that this type of stimulation which has its
roots in animal communication is the more powerful way to
strengthen child attention and affective communication. Probably
it happens thanks to prosodic cues specific of infant directed
speech [25,26] that are the object of a parallel paper where we
have proposed a specific technological analysis of motherese [27].
Moreover, vocalizations pose the basics of language acquisition
along with gestures [28,29]. Indeed, CG gestural solicitations
increase during the first year. In contrast, touching decreases every
What differs in AD and ID developments of interactive
patterns?
While ID infants seem to show an initial delay, they more or less
follow the developmental path of TD infants. Namely, after an
initial delay in inter-subjective behavior they increase as do TD
but a semester later. In the same way, ID children exhibit a
significant increase of behaviors with objects during the first year,
moving to catch up to the TD functioning. In contrast, AD
children seem to develop otherwise. Especially, AD children show
less orienting toward people in the first semester, and thereafter
they exhibit a much smaller increase of seeking people behaviors
than TD (whose score is multiplied by 4). As already described in a
10
195
interaction): that means that interactive moments are sustained
both in AD and in ID by CG Regulation up; TD babies do not
need a large amount of these CG behavior to express their
sociality. Second, after the first birthday, regulation up remains
significantly higher only for AD. We can hypothesize that while
parents of both AD and ID feel from the first 6 months that their
baby needs to be more stimulated, afterwards only parents of AD
are confronted with a lack of social interest in their baby as he/she
appears to enter into a clearer pathological process in the third
semester. Indeed, AD children showed a lack of interest in people
from the first 6 months, an increase of engagement (even if more
passive) in the second semester, and then, after the first birthday,
also a sharp decline of receptive meta-behaviors. Third, this
special pattern of CG regulation up is associated, in the second
semester, with the fact that parents go on touching their child to
obtain a response (unlike TD children, there’s no decrease of
touching). The pattern composed of higher touching and longer
regulation up still remains present in the second year when parents
become more conscious of the difficulties to obtain a response.
In contrast, parental responses to inter-subjective behaviors do
not differ from parents of TD babies. The few differences in
quantity of CG responses in the first semester can be put down to
lower babies’ inter-subjective behaviors as far as a parental
response needs a soliciting child. In sum, it seems that, except
feeling that their baby needs to be stimulated, parents respond
globally in the same way to their babies when he/she starts an
interaction.
Figure 7. Developmental similarity between intellectual disability (ID) and typical development (TD) (red line) and
between autism disorder (AD) and typical development (blue
line) using Normalized Mutual Information (NMI) after non
negative matrix factorization (S = semester).
previous study [31], during the second semester there is an
increase of orienting toward people and in receptive behaviors,
especially smiling to people. But this increasing pattern, from an
interactive point of view, appears to be passive, and after the first
birthday these receptive behaviors dramatically decrease (to note
that receptive behaviors remain stable both in TD and in ID
children). Thus, it seems that the real marker for atypical social
development is the weakness in initiating a social interaction:
without the increase of social initiative the ability to be receptive
and responding to others also becomes more scarce. Moreover,
inter-subjective behaviors, even if globally lower, become
specifically lower after the first birthday.
All these results are consistent with the hypothesis of a growing
deviant development in AD [1] whereas children with ID show
just a delay of social development, as illustrated in figure 7
summarizing the NMI values of non negative matrix factorization.
This deviant development concerns also BB exploring object,
which we did not find significant in the previous paper whose focus
was on behaviors not on interaction context. Indeed, in the present
study exploring object appears significantly reduced in the AD
group as soon as the second half of the first year. This means that
AD babies have less exploration of object inside the early
interactive context, and that, unlike for TD (and ID), exploring
object doesn’t increase for AD after the first birthday. Thus the
child does explore object but outside a real social interaction: we
suggest that this pattern could be the expression of an early (and
growing) lack of joint attention in AD. Joint attention is known to
be deficient in older children with autism [32], and early lack of
joint attention is correlated with a poor social interaction [33].
With regards to CG behaviors there are both differences and
similarities as far as initiative and response. First of all, caregivers
have toward their babies longer regulation up interaction and less
gestural solicitation. We imagine that gestural solicitation becomes
reduced because it fails to get a response; as a confirmation, in the
previous paper [10] we described how CG soliciting by name
decreases as a matter of the reduced orienting to name by AD
babies. On the other hand, the high regulation up has a different
meaning. First, CG Regulation up duration appears higher, in the
first 6 months and in both pathological groups, only in the
interactive context (it was not significant without the filter of
Clinical implication for early detection of autism
Over the past 20 years much attention has been dedicated to
behavioral indicators that will be present very early in life,
certainly in infancy. Nevertheless, prospective (such as siblings
studies) and retrospective (such has home videos studies) studies
have not yet identified a clear prodrome that is a constellation of
unfailing early warning signs indicating the development of a
disease up to the time in which the clinical symptoms fulfill the
required criteria for a diagnosis [3]. Our study adds some general
lines useful to reach the objective of identifying prodrome of
autism.
First, our interaction data base (i.e. extracting all sequences of
caregiver behavior and infant behavior occurring within a time
window of 3 seconds) has provided some significant findings which
are detectable only during parent-infant interaction. Thus, we
propose that the best way to study the emergence of autism should
be based on interaction rather than on behaviors of each part of
the dyad. Concepts such as synchrony [11], closely-fitting match
[34] and mutual adaptation could provide a great deal of help to
workers in the field of early detection of autism [35].
Second, our study shows a course of autism characterized by a
decreasing atypical pattern in the second semester of life and
afterwards an increasing loss of contact. This pattern, that we have
named ‘fluctuating type of onset’ [36], does not seem unusual in
non regressive autism as in our sample. This finding could be of
seminal importance for both individualization of the right windows
in screening programs (first six months of life or after the first
birthday) and implementation of timely effective parent-infant
training in a sensible period as the second semester of life does
appear.
Third, we can confirm that much credence should be given to
parents when they entrust their concerns to professionals (as shown
by retrospective parental questionnaires [37,38]). Moreover our
research shows that parent listening can be implemented by some
specific question and/or observation about the hyper-stimulating
style of parent interaction toward their baby; in fact, we suggest
11
196
that this particular attitude betrays the presence of an under-active
baby (lack of initiative, inability to provoke or to anticipate other’s
aims, hypo-activity) which need to be stimulated. Thus through
this pattern of interaction parents seem to feel very early that
something is wrong in their baby - long before diagnosis.
Although, even if the BB intergroup differences do not reach
significance and then are not detectable for a stranger (i.e. the
pediatrician), some dynamic changes like the significant longitudinal decrease of ‘‘receptive’’ meta-behavior after the first birthday
should presumably be detectable for the child’s relatives.
recently developed an algorithmic tool to assess motherese in
home movies [27].
We conclude that using engineering methods to study social
interaction in home movies has improved our understanding of
early interactions. We can assume that, even if most BB behavior
intergroup differences do not reach statistical significance and then
are not detectable for a stranger [10], some interactive/dynamic
changes should be detectable for the child’s relatives. Here, the
results suggest that deviant autistic behaviors appear before 18
months when studying interactive pattern. Furthermore, parents
of AD and ID children feel (consciously or not) the lack of
interactive initiative and responsiveness of their babies and try to
increasingly supply soliciting behaviors. Thus we stress that
credence should be given to parents’ feeling as they recognize,
long before diagnosis, the pathological process through the
interactive pattern with their child. These findings could help
early identification of AD by encouraging professionals to provide
more attention to parents concerns and ways of coping with their
child.
Limits of this study
The first limitation is the sample size. As we used rigorous
statistical methods taking into account the random subject effect
and autocorrelation, we did not always obtain an analyzable,
known distribution, and as scenes were very variable for a given
infant (due to the great variability among scenes), some strong
tendencies did not reach statistical significance; a larger sample
would probably have allowed us more analyzable and/or
significant results. Second, the analysis currently performed with
our interactive filter highlighted the interactive dynamics without
specifying the part played by each partner in the interaction. This
would require additional analysis (e.g. response rate to a given
stimulation) to determine this with accuracy and probably a larger
sample. And last, only behavioral aspects of the stimulations were
taken into account here, but qualitative emotional investment
should be assessed as well, for example with the analysis of prosody
(e.g., motherese); further research will focus on this question as we
Author Contributions
Conceived and designed the experiments: DC MC FM MCL SM.
Analyzed the data: CSG AM MC DC. Contributed reagents/materials/
analysis tools: DC FM CSG MC. Wrote the paper: CSG AM MC RC
MCL FA PM SM FM DC . Performed the clinical experiments: FA PM
SM. Performed the computational experiments: AM CSG RC. Wrote the
first draft of the manuscript: DC FM CSG MC AM.
References
1. American Psychiatric Association (1994) DSM-IV. APA Press: Washington DC.
2. Zwaigenbaum L, Bryson S, Lord C, Rogers S, Carter A, et al. (2009) Clinical
assessment and management of toddlers with suspected autism spectrum
disorder: insights from studies of high-risk infants. Pediatrics 123: 1383–1391.
3. Yirmiya N, Charman T (2010) The prodrome of autism: early behavioral and
biological signs, regression, peri- and post-natal development and genetics.
J Child Psychol Psychiatry 51: 432–458.
4. Palomo R, Belinchon M, Ozonoff S (2006) Autism and family home movies: a
comprehensive review. J Dev Behav Pediatr 27: S59–68.
5. Saint-Georges C, Cassel RS, Cohen D, Chetouani M, Laznik M-C, et al. (2010)
What Studies of Family Home Movies Can Teach Us about Autistic Infants: A
Literature Review. Research in Autism Spectrum Disorders 4: 355–366.
6. Landa R, Garrett-Mayer E (2006) Development in infants with autism spectrum
disorders: a prospective study. J Child Psychol Psychiatry 47: 629–638.
7. Landa RJ, Holman KC, Garrett-Mayer E (2007) Social and communication
development in toddlers with early and later diagnosis of autism spectrum
disorders. Arch Gen Psychiatry 64: 853–864.
8. Ozonoff S, Iosif AM, Baguio F, Cook IC, Hill MM, et al. (2010) A prospective
study of the emergence of early behavioral signs of autism. J Am Acad Child
Adolesc Psychiatry 49: 256–266. e251-252.
9. Zwaigenbaum L, Bryson S, Rogers T, Roberts W, Brian J, et al. (2005)
Behavioral manifestations of autism in the first year of life. Int J Dev Neurosci
23: 143–152.
10. Muratori F, Apicella F, Muratori P, Maestro S (2010) Intersubjective disruptions
and caregiver-infant interaction in early Autistic Disorder. Research in Autism
Spectrum Disorders 5: 408–417.
11. Feldman R (2007) Parent–infant synchrony and the construction of shared
timing; physiological precursors, developmental outcomes, and risk conditions.
Journal of Child Psychology and Psychiatry 48: 329–354.
12. Messinger D, Ruvolo P, Ekas N, Fogel A Applying machine learning to infant
interaction: The development is in the details. Neural Networks in press.
13. Rutter M, Le Couteur A, Lord C, eds (2003) ADI-R: the autism diagnostic
interview-revised. Los Angeles, CA: Western Psychological Services.
14. Schopler E, Reichler RJ, Renner BR, eds (1988) The Childhood Autism Rating
Scale. Los Angeles: Western Psychological Services.
15. Achenbach TM, Rescorla L, eds (2000) Manual for the ASEBA Preschool Forms
and Profile. Burlington, VT: ASEBA.
16. Cnann A, Laird N, Slasor P (1997) Using the general linear mixed model to
analyse unbalanced repeated measures and longitudinal data. Med 16:
2349–2380.
17. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix
factorization. Nature. pp 788–791.
18. Wu ZL, Cheng CW, Li CH (2008) Social and semantics analysis via nonnegative matrix factorization. 2008; Beijing. pp 1245–1246.
19. Devarajan K (2008) Nonnegative Matrix Factorization: An Analytical and
Interpretive Tool in Computational Biology. PLoS Comput Biol 4: e1000029.
20. Salton G, Buckley C (1988) Term-weighting approaches in automatic text
retrieval. Information Processing and Management 5: 513–523.
21. Boutsidis C, Gallopoulos E (2008) SVD based initialization: A head start for
nonnegative matrix factorization. Pattern Recognition 41: 1350 – 1362.
22. Strehl A, Ghosh J (2002) Cluster ensembles - a knowledge reuse framework for
combining multiple partitions. Machine Learning Research. pp 583–617.
23. Striano T, Stahl D (2005) Sensitivity to triadic attention in early infancy. Dev Sci
8: 333–343.
24. Kawai M, Namba K, Yato Y, Negayama K, Sogon S, et al. (2010)
Developmental Trends in Mother-Infant Interaction from 4-Months to 42Months: Using an Observation Technique. Journal of Epidemiology 20:
S427–S434.
25. Kuhl PK (2001) Speech, language, and developmental change; Lacerda F,
VonHofsten C, Heimann M, editors. Mahwah: Lawrence Erlbaum Assoc Publ.
pp 111–133.
26. Fernald A, Kuhl P (1987) Acoustic determinants of infant preference for
motherese speech. Infant Behavior and Development 10: 279–293.
27. Mahdhaoui A, Chetouani M, Zong C, Cassel RS, Saint-Georges C, et al. (2011)
Automatic Motherese Detection for Face-to-Face Interaction Analysis. International Journal of Methods in Psychiatric Research 20(1): e6–e18.
28. Gogate LJ, Bahrick LE, Watson JD (2000) A study of multimodal motherese:
The role of temporal synchrony between verbal labels and gestures. Child
Development 71: 878–894.
29. McGregor KK (2008) Gesture supports children’s word learning. International
Journal of Speech-Language Pathology 10: 112–117.
30. Brand RJ, Shallcross WL, Sabatos MG, Massie KP (2007) Fine-grained analysis
of motionese: Eye gaze, object exchanges, and action units in infant-versus adultdirected action. Infancy 11: 203–214.
31. Maestro S, Muratori F, Cavallaro MC, Pecini C, Cesari A, et al. (2005) How
young children treat objects and people: an empirical study of the first year of life
in autism. Child Psychiatry Hum Dev 35: 383–396.
32. Colombi C, Liebal K, Tomasello M, Young G, Warneken F, et al. (2009)
Examining correlates of cooperation in autism: Imitation, joint attention, and
understanding intentions. Autism 13: 143–163.
33. Girardot AM, De Martino S, Rey V, Poinso F (2009) Étude des relations entre
l’imitation, l’interaction sociale et l’attention conjointe chez les enfants autistes.
Neuropsychiatrie de l’Enfance et de l’Adolescence 57: 267–274.
34. Yirmiya N, Gamliel I, Pilowsky T, Feldman R, Baron-Cohen S, et al. (2006) The
development of siblings of children with autism at 4 and 14 months: social
engagement, communication, and cognition. Journal of Child Psychology and
Psychiatry 47: 511–523.
12
197
37. Young LR, Brewer N, Pattison C (2003) Parental Identification of Early
Behavioural Abnormalities in Children with Autistic Disorder Autism 7:
125–143.
38. De Giacomo A, Fombonne E (1998) Parental recognition of developmental
abnormalities in autism. Eur Child Adolesc Psychiatry 7: 131–136.
35. Trevarthen C, Daniel S (2005) Disorganized rhythm and synchrony: Early signs
of autism and Rett syndrome. Brain & Development 27: S25–S34.
36. Maestro S, Muratori F, Barbieri F, Casella C, Cattaneo V, et al. (2001) Early
behavioral development in autistic children: the first 2 years of life through home
movies. Psychopathology 34: 147–152.
13
198
ICRA 2009
Travaux réalisés dans le cadre de l'école d'été eNTERFACE'08 : MultiModal Communication with Virtual Agents and Robots dont j'étais le principal investigateur.
199
2009 IEEE International Conference on Robotics and Automation
Kobe International Conference Center
Kobe, Japan, May 12-17, 2009
Generating Robot/Agent Backchannels During a Storytelling
Experiment
S. Al Moubayed, M. Baklouti, M. Chetouani, T. Dutoit, A. Mahdhaoui,
J.-C. Martin, S. Ondas, C. Pelachaud, J. Urbain, M. Yilmaz
Abstract— This work presents the development of a realtime framework for the research of Multimodal Feedback
of Robots/Talking Agents in the context of Human Robot
Interaction (HRI) and Human Computer Interaction (HCI). For
evaluating the framework, a Multimodal corpus is built (ENTERFACE STEAD), and a study on the important multimodal
features was done for building an active Robot/Agent listener of
a storytelling experience with Humans. The experiments show
that even when building the same reactive behavior models for
Robot and Talking Agents, the interpretation and the realization
of the behavior communicated is different due to the different
communicative channels Robots/Agents offer be it physical but
less human-like in Robots, and virtual but more expressive and
human-like in Talking agents.
I. INTRODUCTION
During the last years, several methods have been proposed
for the improvement of the interaction between humans
and talking agents or robots. The key idea of their design
is to develop agents/robots with various capabilities: establish/maintain interaction, show/perceive emotions, dialog,
display communicative gesture and gaze, exhibit distinctive
personality or learn/develop social capabilities [1], [2]. Theses social agents and robots aim at naturally interacting with
humans by the exploitation of these capabilities. In this paper,
we have investigated one aspect of this social interaction:
the engagement in the conversation [3]. The engagement
process makes it possible to regulate the interaction between
the human and the agent or the robot. This process is
obviously multi-modal (verbal and non-verbal) and requires
an involvement of both the partners.
This paper deals with two different interaction types
namely Human-Robot Interaction (HRI) with the Sony AIBO
robot and Human-Computer Interaction (HCI) with an Embodied Conversational Agent (ECA). The term ECA has
S. Al Moubayed is with Center for Speech Technology, Royal Institute
of Technology KTH, SWEDEN [email protected]
M.
Baklouti
is
with
the
Thalès,
FRANCE
[email protected]
M. Chetouani and A. Mahdhaoui are with the University Pierre
and Marie Curie, FRANCE [email protected],
[email protected]
de
T. Dutoit and J. Urbain are with the Faculté Polytechnique
Mons,
BELGIUM,
[email protected],
[email protected]
J.-C. Martin is with the LIMSI, FRANCE [email protected]
S. Ondas is with the Technical University of Kosice, SLOVAKIA
[email protected]
C.
Pelachaud
is
with
the
INRIA,
FRANCE
University,
TURKEY
[email protected]
M.
Yilmaz
is
with
the
Koc
[email protected]
978-1-4244-2789-5/09/$25.00 ©2009 IEEE
been coined in Cassell et al. [4] and refers to humanlike virtual characters that typically engage in face-to-face
communication with the human user. We have used GRETA
[5], an ECA, whose interface obeys the SAIBA (Situation,
Agent, Intention, Behavior, Animation) architecture [6]. We
focused on the design of an open-source, real-time software
platform for designing the feedbacks provided by the robot
and the humanoid during the interaction1 . The multimodal
feedback problem we considered here was limited to facial
and neck movements by the agent (while the AIBO robot
uses all possible body movements, given its poor facial
expressivity): we did not pay attention to arms or body
gestures.
This paper is organized as follows. In section II, we
present the storytelling experiment used for the design of our
human robot/agent interaction system described in section
III. Section IV focuses on speech and face analysis modules
we have developed. We then give in sections V and VI a
description of the multi-modal generation of backchannels
including interpretation of communicative signals and the
implemented reactive behaviors of the agent and the robot.
Finally, section VII presents the details of the evaluation and
comparison in our HCI and HRI systems.
II. FACE-TO-FACE STORYTELLING EXPERIMENT
A. Data collection
In order to model the interaction between the speaker and
the listener during a storytelling experiment, we first recorded
and annotated a database of human-human interaction termed
eNTERFACE STEAD. This database was used for extracting
feedback rules (section II-B) but also for testing the multimodal feature extraction system (section IV).
We followed the McNeill lab framework [7]: one participant (the speaker), has previously observed an animated
cartoon (Sylvester and Tweety), retells the story to a listener
immediately. The narration is accompanied by spontaneous
communicative signals (filled pauses, gestures, facial expressions...). 22 storytelling sessions were videotaped with different conditions: 4 languages (Arabic, French, Turkish and
Slovak). The videos have been annotated (with at least two
annotators per session) for describing simple communicative
signals of both speaker and listener: smile, head nod, head
shake, eye brow and acoustic prominence.
1 The database and the source code for the software developed during the project are available online from the eNTERFACE08 web site:
www.enterface.net/enterface08.
3749
200
TABLE I
AGREEMENT AMONG ANNOTATORS
Track name
Speaker Face
Speaker Acoustic
Listener Face
Listener Acoustic
•
•
Agreement (%)
89.3
84.5
77.96
95.97
•
Manual annotations of videos were evaluated by computing agreements using corrected kappa [8] computed in
the Anvil tool [9]. Table I presents the agreements among
annotators for each track. We can see that the best agreement
is obtained for the Listener Acoustic track which is expected
since the listener is not assumed to speak and when he/she
does simple sounds are produced (filled pauses). Other tracks
have a lower agreement such as Speaker Acoustic. The
speaker always speaks during the session and prominent
events are less identifiable. However, the agreements measures are high enough to allow us to assume that selected
communicative signals might be reliably detected.
B. Extracting rules from data
Based on the selected communicative signals, we have
defined some rules to trigger feedbacks. The rules are based
on [10], [11], which involved mainly only mono-modal
signals. The structure of such rules is as follows:
If some signal (eg. head-nod — pause — pitch accent) is
received, then the listener sends some feedback signal with
probability X.
We have extended these rules by analyzing the data annotated from our storytelling database. We looked at correlation
between, not only, speakers mono-modal signal and listeners
feedback, but also we studied the relation between speakers
multi-modal signals and feedback. We define multi-modal
signals as any set of overlapping signals that are emitted by
the speaker.
For each mono-modal (resp. multi-modal) signal emitted
by the speaker we calculate their number of occurrences.
Within the time-window of each speakers signal, we look
at co-occurring listeners signals. We compute the correlation
of occurrence between each speakers signal and each listeners signal. This computation gives us a correlation matrix
between speakers and listeners signals. This matrix can be
interpreted as: given a speakers signal, the probability that
the listener would send a given signal. In our system we
use this matrix to select listeners feedback signals. When a
speakers signal is detected, we choose from the correlation
matrix, the signal (i.e. feedback) with the higher probability.
From this process, we identified a set of rules2 , among
them:
•
2A
Mono-modal signal ⇒ mono-modal feedback:
head nod is received, then the listener sends
head nod medium.
complete list can be found at: http://www.enterface.net/enterface08
Mono-modal signal ⇒ multi-modal feedback: smile is
received, then the listener sends head nod and smile.
Multi-modal signal ⇒ mono-modal feedback:
head activity high and pitch prominence are received,
then the listener sends head nod fast.
Multi-modal signal ⇒ multi-modal feedback:
pitch prominence and smile are received, then the
listener sends head nod and smile.
These rules are implemented in our system in order to trigger
feedbacks, the multi-modal fusion module makes it possible
to activate these rules (section V).
III. SYSTEM DESIGN
Although Human beings are all perfectly able to provide
natural feedback to a speaker telling a story, explaining
how and when you do it is a complex problem. ECAs are
increasingly used in this context to study and model humanhuman communication as well as for performing specific
automatic communication tasks with humans.
Examples are REA [12], an early system that realizes the
full action-reaction cycle of communication by interpreting
multimodal user input and generating multimodal agent behavior. Gandalf [13] provides real-time feedback to a human
user based on acoustic and visual analysis. In robotics,
various models have been proposed for the integration of
feedbacks during interaction [2]. Recently, the importance
of feedbacks for discourse adaptation has been highlighted
during an interaction with BIRON [14].
In a conversation, all interactants are active. Listeners provide information to the speaker their view and engagement
in the conversation. By sending acoustic or visual feedback
signals, listeners show if they are paying attention, understanding or agreeing with what is being said. Taxonomies
of feedbacks, based on their meanings, have been proposed
[15], [16]. The key idea of this project is to automatically
detect the communicative signals in order to produce a
feedback. Contrary to the approach proposed in [14], we
focus on non-linguistic features (prosody, prominence) but
also on head features (activity, shake, nod).
Our system is based on the architecture proposed by [5],
but progressively adapted to the context of a storytelling
(figure 1). We developed several modules for the detection
and the fusion of the communicative signals from both audio
and video analysis. If these communicative signals match our
pre-defined rules, a feedback is triggered by the Realtime
BackChannelling module resulting on two different messages
(described in section VI) conveying the same meaning.
IV. MULTI-MODAL FEATURE EXTRACTION
A. Speech Analysis
The main goal of the speech analysis component is to
extract features from the speech signal that have been previously identified as key moments for triggering feedbacks
(cf. section II). In this study, we do not use any linguistic
information to analyze the meaning of the utterances being
3750
201
Speech Features
Extractor
definitions are based on linguistic and/or phonetic units.
We propose, in this paper, another approach using statistical models for the detection of prominence. The key idea is to
assume that a prominent sound stands out from the previous
message. For instance, during our storytelling experiment,
speakers emphasize words, syllables when they want to focus
the attention of the listener on important information. These
emphasized segments are assumed to stand out from the
overall ones, which makes them salient.
Prominent detectors are usually based on acoustic parameters (fundamental frequency, energy, duration, spectral intensity) and machine learning techniques (Gaussian Mixture
Models, Conditional Random Fields)[23], [24]. Unsupervised methods have been also investigated such as the use of
Kullback-Leibler (KL) divergence as a measure of discrimination between prominent and non-prominent classes [25].
These statistical methods provide an unsupervised framework
adapted to our task. The KL divergence needs the estimation
of two covariance matrices (Gaussian assumption):
Face Feature
Extraction
(event,time)
(event,time)
Multimodal Fusion
Realtime BackChanneling
feedback
signal
AIBO BML
tag
GRETA BML
tag
+(µi −
Fig. 1.
Architecture of our interaction feedback model.
told by the speaker, but we focus on the prosodic crosslanguage features which may participate in the generation of
the feedback by the listener.
1) Feature Extraction: Previous studies have shown that
pitch movements, specially at the end of the utterances, play
an important role in turn taking and backchannelling during
human dialogue [10]. We propose in this work to use the
following features extracted from the speaker’s speech signal:
Utterance beginning, Utterance end, Raising pitch, Falling
pitch, Connection pitch, and Pitch prominence (cf. section
II).
To extract utterances beginning and ending, a realtime
implementation of a Voice Activity Detector (VAD), which
is an adaptation of the SPHINX Vader functionality [18],
has been developed. To extract pitch movements, we used
an implementation of the realtime fundamental frequency
tracking algorithm YIN [19]. We compensate outliers and
octave jumps of the F0 by a median filter of size 5 (60
msec). After extracting the pitch, the TILT model [20] is
used to extract Raising pitch, Falling pitch and Connection
pitch.
These algorithms are then used as a package in PureData
(PD)[17], a graphical programming environment for realtime audio processing, PD is used as an audio provider with
16KHz audio sampling rate. This package sends the id of
the features in the speech signal to the multi-modal fusion
model whenever any of these features is detected.
2) Pitch Prominence Detection: In the literature, several
definitions of acoustical prominent events can be found
showing the diversity of this notion [21], [22]. Terken [22]
defines prominence as words or syllables that are perceived
as standing out from their environment. Most of the proposed
Σj
−1
Σi + tr(Σi Σj )
T −1
µj ) Σj (µi − µj ) −
= 12 [log
KLij
d]
(1)
where µi , µi and Σi , Σj denote the means and the covariance
matrices of i-th (past) and j-th (new event) speech segments
respectively. d is the dimension of the speech feature vector.
An event j is defined as prominent if the distance from the
past segments (represented by the segment i) is larger than
a pre-defined threshold.
One major drawback of the KL divergence approach is that
since the new event is usually shorter, in terms of duration,
than the past events, the estimation of covariance matrices is
less reliable. In addition, it is well-known that duration is an
important perceptual effect for the discrimination between
sounds. Taking into account these points, we propose to
use another statistical test namely the T2 Hotteling distance
defined by:
Hij =
Li Lj
[(µi − µj )T Σ−1
i∪j (µi − µj )]
Li + Lj
(2)
where i ∪ j is the union of i-th (past) and j-th (new event)
segments. Li and Lj denote the length of the segments. T2
Hotteling divergence is closely related to the Mahalanobis
distance.
In this work only the fundamental frequency (F0) is used
as a feature to calculate the Hotteling distance between two
successive voiced segments. In this sense, a prominence is
detected when the Hotelling distance between the current and
the preceding Gaussian distributions of F0 is higher than a
threshold. The decision is done by the help of a decaying
distance threshold over time: adaptation to the speaker. Since
we estimate a statistical model of the pitch for a voiced
segment, we only estimate it when there is enough pitch
samples during the voiced segment, set to 175 msec.
B. Face Analysis
The main goal of the face analysis component (figure 2)
is to provide the feedback system with some knowledge of
3751
202
communicative signals conveyed by the head of the speaker.
More specifically, detecting if the speaker is shaking the
head, smiling or showing neutral expression are the main
activity features we are interested in. The components of this
module are responsible for face detection, head shake and
nod detection, mouth extraction, and head activity analysis.
They are detailed below.
Fig. 2.
optical flow between a set of corresponding points in two
successive frames. We make use of the Lucas-Kanade [28]
algorithm implementation available in the OpenCV library.
Let n be the number of feature points and P ti (xi , yi )
the i-th feature point defined by its 2D screen coordinates
(xi , yi ). We then define the overall velocity of the head as:
Pn
Vx = n1 Pi=1 (xi − xi−1 )
(3)
V =
n
1
Vy = n i=1 (yi − yi−1 )
Overview of the face analysis module.
1) Face Detection: The face detection algorithm that
we used exploits Haar-like features that have been initially
proposed by Viola & Jones [26]. It is based on a cascade
of boosted classifiers working with Haar-like features and
trained with a few hundreds of sample views of faces. We
used the trained classifier available in OpenCV. The face
detection module outputs the coordinates of existing faces in
the incoming images.
2) Smile Detection: Smile detection is performed in two
steps: mouth extraction followed by smile detection. We use
a colorimetric approach for mouth extraction. A thresholding
technique is used after a color space conversion to the YIQ
space. Once the mouth is extracted, we examine the ratio
between the two characteristic mouth dimensions, P1 P3 and
P2 P4 (figure 3), for smile detection. We assume that when
smiling, this ratio increases. The decision is obtained by
thresholding.
Fig. 4.
Figure 4 shows the velocity curves along the vertical and
horizontal axes. The sequence of movements represented
is composed by one nod and two head shakes. We notice
that the velocity curves are the sum of two signals: (1) a
noise movement which is a low frequency signal representing
the global head motion and (2) a high frequency signal
representing the head nods and head shakes.
The idea is then to use wavelet decomposition to remove
the low frequency signals. More precisely, we decomposed
the signal using symlet-6 wavelet. Figure 5 shows the reconstruction of the details at the first level of the signal shown
in figure 4. The head nod and shake events can be reliably
identified by this process.
Fig. 3. Smile detection: combining colorimetric and geometric approaches.
3) Head shake and nod detections: The purpose of this
component is to detect if the person is shaking or nodding
the head. The idea is to analyze the motion of some feature points extracted from the face along the vertical and
horizontal axes. Once the face has been detected in the
image, we extract 100 feature points using a combined corner
and edge detector defined by Harris [27]. Feature points
are extracted in the central area of the face rectangle using
offsets. These points are then tracked by calculating the
Feature point velocity analysis.
Fig. 5.
Signal denoising via wavelets.
4) Head activity analysis: Analysis of recordings of the
storytelling experience has shown a correlation between the
head activity of both speaker and listener. To characterize
the head activity, we use the velocity of the feature points,
defined in (3), to quantify the overall activity A:
X
2
2
A=
Vx,t
+ Vy,t
(4)
i∈time window
where the time window is set to 60 frames (30 frames/s)
3752
203
TABLE II
Q UANTIZATION OF THE HEAD ACTIVITY
VII. ASSESSMENT AND DISCUSSION
A. Experimental setup
Amplitude
< mean
< mean + standard deviation
Otherwise
Interpretation
LOW ACTIVITY
MEDIUM ACTIVITY
HIGH ACTIVITY
This measure provides information about the head activity
levels. In order to quantize head activity into levels (high,
medium or low), we analyzed the head activity of all the
speakers of the eNTERFACE STEAD corpus. Assuming
that the activity of one given speaker is Gaussian, we set
up different thresholds defined in table II. By using these
thresholds, the algorithm will become more sensitive to any
head movement of a stationary speaker whereas it will raise
the thresholds for an active speaker resulting on a flexible
adaptive modeling.
V. MULTI-MODAL FUSION
The Multimodal Fusion Module works by the principle of
activating probabilistic rules (cf. section II-B) depending on
the multimodal events it receives. When a rule is completed
then the output of the rule is sent as a message to the different
Agents/Robots connected to it as a feedback signal.
The rules in this work are extracted from the analysis of a
database annotations and hand-written using feedback rules
defined in the literature [10] (section II-B). The rule takes a
list of input events (mono or multi modal) as output the rules
defines one output feedback signal (mono or multi modal).
The rule can be probabilistic by defining a probability of
this rule, so in case there are more than one rule with the
same input, every rule will have a probability of execution.
For realtime consideration, the rule contains a response time
variable, which defines when the output of the rule should
be executed after the reception of the last input signal. If not
all the input signals are received, the rule will be deactivated
after this specified period.
VI. REACTIVE BEHAVIORS
In our architecture, we aim to drive simultaneously different types of virtual and/or physical agents (figure 1). To ensure high flexibility we are using the same control language
to drive all the agents, the Behavior Markup Language BML
[6]. BML encodes multimodal behaviors independently from
the animation parameters of the agents.
Through a mapping we transform BML tags into MPEG-4
parameters for the GRETA agent and into mechanical movements for the AIBO robot. Various feedbacks are already
available for GRETA such as acceptance (head nod), nonacceptance (head shake) or smile. Concerning AIBO, we
developed similar feedbacks conveying the same meaning
but in a different way. To develop the reactive behavior of
AIBO, we used the URBI (Real-Time Behavior Interface)
library [29] allowing a high-level control of the robot.
Evaluation research is still underway for virtual characters
[30], [31] and for human-robot interaction [33]. Since the
goal of the project was to compare feedback provided by two
types of embodiments (a virtual character and a robot) rather
than to evaluate the multi-modal feedback rules implemented
in each of these systems, we decided to have users tell
a story to both GRETA and AIBO at the same time. An
instruction form was provided to the subject before the
session. Then users watched the cartoon sequence, and were
asked to tell the story to both AIBO and GRETA (figure 6).
Finally, users had to answer a questionnaire. The questionnaire was designed to compare both systems with respect
to the realization of feedback (general comparison between
the two listeners, evaluation of feedback quality, perception
of feedback signals and general comments). Sessions were
videotaped using a Canon XM1 3CCD digital camcorder.
The current evaluation aims at evaluating the relevance
of the characterization of communicative signals for the
regulation of interaction. We performed here only a pretest
and an anova is not possible because the number of subjects
is too small (10 users). In addition, no hypotheses have been
done on the expected results from questionnaires.
Fig. 6.
The assessment set-up.
As illustrated by figure 7, 8 out of 10 users estimated that
GRETA understood better the story than AIBO. Yet, 8 out
of 10 users felt that AIBO looked more interested and liked
the story more than GRETA did.
Fig. 7.
robot.
3753
Comparing the feedbacks provided by the virtual agent and the
204
Further evaluations could be investigated with such a
system. Another possibility would be to have the speaker tell
two different stories one to GRETA, and then another one to
AIBO. The order of the listeners should be counterbalanced
across subjects. This would avoid having the speaker switching his attention between AIBO and GRETA. Perceptive tests
on videos combining speakers and AIBO/GRETA listeners
could also be designed to have subjects 1) compare random
feedback with feedback generated by analyzing users behavior , or 2) rate if the listener has been designed to listen to
this speaker or not.
VIII. CONCLUSIONS AND FUTURE WORKS
We presented a multi-modal framework to extract and
identify Human communicative signals for the generation
robot/agent feedbacks during storytelling. We exploited faceto-face interaction analysis by highlighting communicative rules. A real-time feature extraction module has been
presented allowing the characterization of communicative
events. These events are then interpreted by a fusion process for the generation of backchannel messages for both
AIBO and GRETA. A simple evaluation was established,
and results show that there is an obvious difference in the
interpretation and realization of the communicative behavior
between humans and agents/robots.
Our future works are devoted to the characterization
of other communicative signals using the same modalities
(speech and head). Prominence detection can be improved by
the use of syllable-based analysis, which can be computed
without linguistic information. Another important issue is to
deal with the direction of gaze. This communicative signal
conveys useful information during interaction and automatic
analysis (human) and generation (robot/agent) should be
investigated.
IX. ACKNOWLEDGMENTS
We are grateful to Elisabetta Bevacqua for her advice in
the organization of our work and her help on interfacing our
software with GRETA. We also want to acknowledge Yannis
Stylianou for the feedback he gave during discussions on our
project. This project was partly funded by Région Wallonne,
in the framework of the NUMEDIART research program and
by the FP6 IP project CALLAS.
R EFERENCES
[1] T. Fong, I. Nourbakhsh and K. Dautenhahn, A Survey of Socially
Interactive Robots, Robotics and Autonomous Systems 42(3-4), 143166, 2003.
[2] C. Breazeal, Social Interactions in HRI: The Robot View, R. Murphy
and E. Rogers (eds.), IEEE SMC Transactions, Part C, 2004
[3] C.L. Sidner, C. Lee, C.D. Kidd, N. Lesh, C. Rich, Explorations in
Engagement for Humans and Robots, Artificial Intelligence, May 2005
[4] J. Cassell, J. Sullivan, S. Prevost, and E. Churchill (eds). Embodied
Conversational Agents. MIT Press, 2000.
[5] E. Bevacqua, M. Mancini, and C. Pelachaud, A listening agent exhibiting variable behaviour, Intelligent Virtual Agents, IVA’08, Tokyo,
September 2008.
[6] H. Vilhjalmsson, N. Cantelmo, J. Cassell, N. E. Chafai, M. Kipp, S.
Kopp, M. Mancini, S. Marsella, A. N. Marshall, C. Pelachaud, Z.
Ruttkay, K. R. Thorisson, H. van Welbergen, R. van der Werf, The
Behavior Markup Language: Recent Developments and Challenges,
Intelligent Virtual Agents, IVA’07, Paris, September 2007.
[7] D. McNeil, Hand and mind: What gestures reveal about thought,
Chicago IL, The University, 1992.
[8] R. L. Brennan, D. J. Prediger: Coefficient κ: Some uses, misuses,
and alternatives. In: Educational and Psychological Measurement.
41,687699, 198.
[9] M. Kipp, Anvil - A Generic Annotation Tool for Multimodal Dialogue.
Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech), 1367-1370, 2001.
[10] R. M. Maatman, Jonathan Gratch, Stacy Marsella, Natural Behavior
of a Listening Agent. Intelligent Virtual Agents, IVA’05, 25-36, 2005.
[11] N. Ward, W. Tsukahara, Prosodic features which cue back-channel
responses in English and Japanese. Journal of Pragmatics, 23, 11771207, 2000.
[12] J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H.
Vilhjlmsson, H. Yan, Embodiment in Conversational Interfaces: Rea.
Proceedings of the CHI’99 Conference, pp. 520-527. Pittsburgh, PA,
1999.
[13] J. Cassell and K. Thrisson, The Power of a Nod and a Glance:
Envelope vs. Emotional Feedback in Animated Conversational Agents,
Applied Artificial Intelligence, 13(3), 1999.
[14] M. Lohse, K. J. Rohlfing, B. Wrede; G. Sagerer, ”Try something
else!” - When users change their discursive behavior in human-robot
interaction, IEEE Conference on Robotics and Automation, Pasadena,
CA, USA, 3481-3486, 2008.
[15] J. Allwood, J. Nivre, and E. Ahlsen. On the semantics and pragmatics
of linguistic feedback. Semantics, 9(1), 1993.
[16] I. Poggi. Backchannel: from humans to embodied agents. In AISB.
University of Hertfordshire, Hatfield, UK, 2005.
[17] www.puredata.org
[18] The
CMU
Sphinx
open
source
speech
recognizer
http://cmusphinx.sourceforge.net
[19] De Cheveigne, A., Kawahara, H.: YIN, a fundamental frequency
estimator for speech and music. The Journal of the Acoustic Society
of the America 111. 2002.
[20] P. Taylor. The Tilt Intonation model, ICSLP 98, Sydney, Australia.
1998.
[21] B.M. Streefkerk, L. C. W. Pols, L. ten Bosch, Acoustical features
as predictors for prominence in read aloud Dutch sentences used in
ANNs, Proc. Eurospeech’99, Vol. 1, Budapest, 551-554, 1999.
[22] J.M.B. Terken, Fundamental frequency and perceived prominence of
accented syllables. Journal of the Acoustical Society of America, 95(6),
3662-3665, 1994.
[23] N. Obin, X. Rodet, A. Lacheret-Dujour, French prominence: a probabilistic framework, in International Conference on Acoustics, Speech,
and Signal Processing (ICASSP08), Las Vegas, U.S.A, 2008.
[24] V. K. R. Sridhar, A. Nenkova, S. Narayanan, D. Jurafsky, Detecting
prominence in conversational speech: pitch accent, givenness and
focus. In Proceedings of Speech Prosody, Campinas, Brazil. 380-388,
2008.
[25] D. Wang, S. Narayanan, An Acoustic Measure for Word Prominence
in Spontaneous Speech. IEEE Transactions on Audio, Speech, and
Language Processing, Volume 15, Issue 2, 690-701, 2007.
[26] P. Viola, M.J. Jones, Robust Real-Time Face Detection, International
Journal of Computer Vision, 137-154, 2004.
[27] C.G. Harris, M.J. Stephens, A combined corner and edge detector,
Proc. Fourth Alvey Vision Conf., Manchester, 147-151, 1988
[28] B. Lucas, T. Kanade, An Iterative Image Registration Technique with
an Application to Stereo Vision, Proc. of 7th International Joint
Conference on Artificial Intelligence (IJCAI), pp. 674-679, 1981.
[29] B. Baillie, URBI: Towards a Universal Robotic Low-Level Programming Language, Proc. of the IEEE/RSJ International Conference on
Intelligent Robots and Systems - IROS05, 2005.
[30] D.M. Dehn, S. van Mulken, The impact of animated interface agents:
a review of empirical research. International Journal of HumanComputer Studies, 52: 1-22, 2000.
[31] Z. Ruttkay, C. Pelachaud, From Brows to Trust - Evaluating Embodied
Conversational Agents, Kluwer, 2004.
[32] S. Buisine, J.-C. Martin, The effects of speech-gesture co-operation in
animated agents’ behaviour in multimedia presentations. International
Journal ”Interacting with Computers: The interdisciplinary journal of
Human-Computer Interaction”. 19: 484-493, 2007.
[33] Dan R. Olsen, Michael A. Goodrich, Metrics for Evaluating HumanRobot Interactions. Performance Metrics for Intelligent Systems Workshop held in Gaithersburg, 2003.
3754
Bibliographie
S. Al Moubayed, M. Baklouti, M. Chetouani, T. Dutoit, A. Mahdhaoui, J. C.
Martin, S. Ondas, C. Pelachaud, J. Urbain, and M. Yilmaz. Generating
robot/agent backchannels during a storytelling experiment. Robotics and
Automation, 2009. ICRA '09. IEEE International Conference on, pages
37493754, 2009. (Cité pages 7, 95, 96, 97 et 103.)
J. Allwood, J. Nivre, and E. Ahlsen. On the semantics and pragmatics of
linguistic feedback. Journal of Semantics, 9(1) :126, 1992. (Cité page 94.)
R. Andre-Obrecht. A new statistical approach for the automatic segmentation
of continuous speech signals. Acoustics, Speech and Signal Processing, IEEE
Transactions on, 36(1) :29 40, jan 1988. (Cité page 30.)
R. Andre-Obrecht and B. Jacob. Direct identication vs. correlated models
to process acoustic and articulatory informations in automatic speech recognition. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97.,
1997 IEEE International Conference on, volume 2, pages 999 1002, apr
1997. (Cité page 30.)
M. Argyle. Bodily communication. Methuen, 1987. (Cité page 1.)
M. Argyle and M. Cook. Gaze and Mutual Gaze. Cambridge University Press,
1976. (Cité page 91.)
B. S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linear
prediction of the speech wave. The Journal of the Acoustical Society of
America, 50(2B) :637655, 1971. (Cité pages 20 et 21.)
A. Batliner, C. Hacker, M. Kaiser, H. Mögele, and E. Nöth. Taking into
account the user's focus of attention with the help of audio-visual information : towards less articial human-machine-communication. In International Conference on Auditory-Visual Speech Processing, 2007. (Cité page 98.)
A. Batliner, D. Seppi, S. Steidl, and B. Schuller. Segmenting into adequate
units for automatic recognition of emotion-related episodes : A speech-based
approach. Advances in Human-Computer Interaction, 2010. (Cité page 25.)
A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir. Whodunnit searching for the most important feature types signalling emotion-related
user states in speech. Comput. Speech Lang., 25 :428, January 2011. (Cité
page 37.)
206
Bibliographie
F. Bernieri. Coordinated movement and rapport in teacher-student interactions. Journal of Nonverbal Behavior, 12 :120138, 1988. (Cité pages 79
et 82.)
F. Bernieri and R. Rosenthal. Interpersonal coordination : Behavior matching
and interactional synchrony. Fundamentals of nonverbal behavior. Cambridge University Press, 1991. (Cité page 62.)
Ph. Bidaud, M. Bouzit, and M. Chetouani. Support robotisé de dispositif
multimédia. Brevet N ◦ 10 54317 du 02 juin 2010, 2010. (Cité page 110.)
A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In Conference on computational learning theory, 1998. (Cité
page 55.)
J. Broekens, M. Heerink, and H. Rosendal. Assistive social robots in elderly
care : a review. Gerontechnology, 8(2) :94103, 2009. (Cité page 103.)
N. Campbell. On the use of nonverbal speech sounds in human communication. In A. Esposito, M. Faundez-Zanuy, E. Keller, and M. Marinaro,
editors, Verbal and Nonverbal Communication Behaviours, volume 4775 of
Lecture Notes in Computer Science, pages 117128. Springer, 2007. (Cité
page 15.)
N. Campbell. Individual traits of speaking style and speech rhythm in a
spoken discourse. In A. Esposito, N. Bourbakis, N. Avouris, and I. Hatzilygeroudis, editors, Verbal and Nonverbal Features of Human-Human and
Human-Machine Interaction, volume 5042 of Lecture Notes in Computer
Science, pages 107120. Springer, 2008. (Cité page 6.)
N. Campbell. An audio-visual approach to measuring discourse synchrony in
multimodal conversation data. In Interspeech 2009, 2010. (Cité page 6.)
J. N. Cappella. Behavioral and judged coordination in adult informal social
interactions : Vocal and kinesic indicators. Journal of Personality and Social
Psychology, 72(1) :119131, 1997. (Cité page 77.)
G. Castellano, A. Pereira, I. Leite, A. Paiva, and P. W. McOwan. Detecting
user engagement with a robot companion using task and social interactionbased features. In Proceedings of the 2009 international conference on Multimodal interfaces, ICMI-MLMI '09, pages 119126, 2009. (Cité pages 91
et 92.)
Bibliographie
207
C. Charbuillet, B. Gas, M. Chetouani, and J.-L Zarader. Optimizing feature
complementarity by evolution strategy : Application to automatic speaker
verication. Speech Communication, 51(9) :724731, September 2009. (Cité
page 25.)
T. L Chartrand and J. A Bargh. The chameleon eect : the perceptionbehavior link and social interaction. Journal of Personality and Social Psychology, 76(6) :893910, 1999. (Cité page 63.)
M. Chetouani. Codage neuro-prédictif pour l'extraction de caractéristiques de
signaux de parole. PhD thesis, Université Pierre et Marie Curie, Décembre
2004. (Cité pages 17 et 21.)
M. Chetouani, M. Faundez-Zanuy, B. Gas, and J. L. Zarader. Investigation on
lp-residual representations for speaker identication. Pattern Recognition,
42(3) :487494, 3 2009a. (Cité pages 19, 20, 22, 23 et 24.)
M. Chetouani, M. Faundez-Zanuy, A. Hussain, B. Gas, J. L. Zarader, and
K. Paliwal. Special issue on non-linear and non-conventional speech processing (guest editorial). Speech Communication, 51(9) :713713, 9 2009b.
(Cité page 19.)
M. Chetouani, A. Hussain, B. Gas, M. Milgram, and J. L. Zarader, editors.
Advances in Nonlinear Speech Processing, volume 4885 of Lecture Notes in
Computer Science. Springer, 2009c. (Cité page 19.)
M. Chetouani, A. Mahdhaoui, and F. Ringeval. Time-scale feature extractions
for emotional speech characterization. Cognitive Computation, 1(2) :194
201, 2009d. (Cité pages 10 et 25.)
M. Chetouani, Y. Wu, C. Jost, B. LE Pevedic, C. Fassert, V. CristianchoLacroix, S. Lassiaille, C. Granata, A. Tapus, D. Duhaut, and A.S. Rigaud.
Cognitive services for elderly people : The robadom project. In ECCE 2010
Workshop : Robots that Care, European Conference on Cognitive Ergonomics 2010, 2010. (Cité pages x, 98 et 99.)
G. Chittaranjan, O. Aran, and D. Gatica-Perez. Exploiting observers' judgements for nonverbal group interaction analysis. In IEEE Conference on
Automatic Face and Gesture Recognition, 2011. (Cité page 88.)
C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette. Fear-type
emotion recognition for future audio-based surveillance systems. Speech
Communication, 50(6) :487 503, 2008. (Cité pages 28 et 29.)
208
Bibliographie
A. Clodic, H. Cao, S. Alili, V. Montreuil, R. Alami, and R. Chatila. Shary :
A supervision system adapted to human-robot interaction. In O. Khatib,
V. Kumar, and G. Pappas, editors, Experimental Robotics, volume 54, pages
229238. Springer Berlin / Heidelberg, 2009. (Cité page 91.)
J. F. Cohn. Advances in behavioral science using automated facial image
analysis and synthesis [social sciences]. Signal Processing Magazine, IEEE,
27(6) :128133, 2010. (Cité page 8.)
Z. Cong and M. Chetouani. Hilbert-huang transform based physiological signals analysis for emotion recognition. In Signal Processing and Information
Technology (ISSPIT), 2009 IEEE International Symposium on, pages 334
339, 2009. (Cité pages 106 et 107.)
Z. Cong, M. Chetouani, and A. Tapus. Automatic gait characterization
for a mobility assistance system. In Control Automation Robotics Vision
(ICARCV), 2010 11th International Conference on, pages 473 478, 2010.
(Cité pages 107 et 108.)
Z. Cong, X. Clady, and M. Chetouani. An embedded human motion capture
system for an assistive walking robot. In Rehabilitation Robotics (ICORR),
2011 IEEE International Conference on, pages 1 6, 2011. (Cité page 108.)
A. Couture-Beil, R.T. Vaughan, and G. Mori. Selecting and commanding
individual robots in a multi-robot system. In Computer and Robot Vision
(CRV), 2010 Canadian Conference on, pages 159 166, 2010. (Cité page 91.)
F. Cummins. Speech rhythm and rhythmic taxonomy. In Speech Prosody,
volume 121-126, 2002. (Cité page 38.)
F. Cummins. Rhythm as entrainment : The case of synchronous speech. Journal of Phonetics, 37(1) :1628, 2008. (Cité page 38.)
J. Curhan and A. Pentland. Thin slices of negotiation : Predicting outcomes
from conversational dynamics within the rst 5 minutes. Journal of Applied
Psychology, 92(3) :802811, May 2007. (Cité page 5.)
J. Dauwels, F. Vialatte, T. Musha, and A. Cichocki. A comparative study of
synchrony measures for the early diagnosis of alzheimer's disease based on
eeg. NeuroImage, 49(1) :668 693, 2010. (Cité page 79.)
E. Delaherche and M. Chetouani. Multimodal coordination : exploring relevant features and measures. In Proceedings of the 2nd international workshop on Social signal processing, ACM Multimedia 2010, SSPW '10, pages
4752. ACM, 2010. (Cité pages x, 78, 79 et 80.)
Bibliographie
209
E. Delaherche and M. Chetouani. Characterization of coordination in an
imitation task : human evaluation and automatically computable cues. In
International Conference on Multimodal Interaction (ICMI 2011), 2011a.
(Cité pages x, 82, 83, 84 et 86.)
E. Delaherche and M. Chetouani. Automatic recognition of coordination level
in an imitation task. In Proceedings of the 3rd international workshop on
Social signal processing, ACM Multimedia 2010, 2011b. (Cité pages 85
et 86.)
E. Delaherche, M. Chetouani, A. Mahdhaoui, C. Saint-Georges, S. Viaux,
and D. Cohen. Evaluation of interpersonal synchrony : multidisciplinary
approaches. Soumis, 2011. (Cité pages 62, 63 et 65.)
J. Demouy, M. Plaza, J. Xavier, F. Ringeval, M. Chetouani, D. Perisse,
D. Chauvin, S. Viaux, B. Golse, D. Cohen, and L. Robel. Dierential language markers of pathology in autism, pervasive developmental disorder not
otherwise specied and specic language impairment. Research in Autism
Spectrum Disorders, 5(4) :14021412, 2011. (Cité pages 46, 47 et 49.)
L. Devillers, L. Vidrascu, and L. Lamel. Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4) :407
422, 2005. (Cité page 17.)
R. M. Diaz and L. E. Berk, editors. Private speech : From social interaction
to self-regulation. Lawrence Erlbaum, 1992. (Cité page 100.)
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication (2nd Edition).
Wiley-Interscience, 2000. (Cité page 16.)
G. Dumas, J. Nadel, R. Soussignan, J. Martinerie, and L. Garnero. Interbrain synchronization during social interaction. PLoS ONE, 5(8) :e12166,
08 2010. (Cité pages 62 et 77.)
S. Duncan. Some signals and rules for taking speaking turns in conversations.
Journal of Personality and Social Psychology, 23(2) :283 292, 1972. (Cité
page 91.)
N. Eagle and A. Pentland. Eigenbehaviors : identifying structure in routine.
Behavioral Ecology and Sociobiology, 63 :10571066, 2009. (Cité pages 5,
71 et 72.)
K. Farrahi and D. Gatica-Perez. Probabilistic mining of socio-geographic
routines from mobile phone data. Selected Topics in Signal Processing,
IEEE Journal of, 4(4) :746755, Aug. 2010. (Cité pages 6 et 76.)
210
Bibliographie
M. Faundez-Zanuy. Data fusion in biometrics. Aerospace and Electronic Systems Magazine, IEEE, 20(1) :34 38, January 2005. (Cité page 10.)
M. Faundez-Zanuy. On the usefulness of linear and nonlinear prediction residual signals for speaker recognition. In M. Chetouani and al., editors,
Proceedings of the 2007 international conference on Advances in nonlinear
speech processing, pages 95104. Springer, 2007. (Cité page 20.)
M. Faundez-Zanuy, U. Laine, G. Kubin, B. McLaughlin, S.and Kleijn, G. Chollet, B. Petek, and A. Hussain. The cost-277 european action : An overview,
2005. (Cité page 19.)
D. Feil-Seifer and M. J. Mataric. Dening socially assistive robotics. International Conference on Rehabilitation Robotics (ICORR), pages 465468,
2005. (Cité page 103.)
R. Feldman. Infant-mother and infant-father synchrony : The coregulation
of positive arousal. Infant Mental Health Journal, 24(1) :123, 2003. ISSN
1097-0355. (Cité page 80.)
R. Feldman. Parent-infant synchrony and the construction of shared timing ;
physiological precursors, developmental outcomes, and risk conditions. The
Journal of Child Psychology and Psychiatry and Allied Disciplines, 48(3-4) :
329354, 2007. (Cité pages 63, 68, 69 et 80.)
A. Fernald and P. Kuhl. Acoustic determinants of infant preference for motherese speech. Infant Behavior and Development, 10 :279293, 1987. (Cité
page 52.)
C. Fernyhough and E. Fradley. Private speech on an executive task : relations
with task diculty and task performance. Cognitive Development, 20(1) :
103 120, 2005. (Cité pages 100 et 102.)
H. Fujisaki. Information, prosody, and modeling - with emphasis on tonal
features of speech. In Speech Prosody, 2004. (Cité pages ix et 15.)
E. Goman. Behavior in Public Places : Notes on the Social Organization of
Gatherings. The Free Press, 1963. (Cité pages 91 et 92.)
M. H Goldstein, A. P King, and M. J West. Social interaction shapes babbling :
Testing parallels between birdsong and speech. Proceedings of the National
Academy of Sciences of the United States of America, 100(13) :80308035,
2003. (Cité page 63.)
Bibliographie
211
C. Goodwin. Gestures as a resource for the organization of mutual attention.
Semiotica, 62(1/2) :2949, 1986. (Cité page 91.)
E. Grabe and E. L. Low. Durational variability in speech and the rhythm
class hypothesis. In de Gruyter, editor, Papers in Laboratory Phonology,
volume 7, pages 515546. The Hague, Mouton, 2002. (Cité pages 39, 40
et 42.)
C. Granata, M. Chetouani, A. Tapus, P. Bidaud, and V. Dupourque. Voice and
graphical -based interfaces for interaction with a robot dedicated to elderly
and people with cognitive disorders. In RO-MAN, 2010 IEEE, pages 785
790, 2010. (Cité page 104.)
S. Guionnet, J. Nadel, E. Bertasi, M. Sperduti, P. Delaveau, and P. Fossati.
Reciprocal imitation : Toward a neural basis of social interaction. Cerebral
Cortex, 2011. (Cité page 62.)
H. Gunes and M. Pantic. Automatic, dimensional and continuous emotion
recognition. Int'l Journal of Synthetic Emotion, 1(1) :6899, 2010. (Cité
page 85.)
H. Gunes, B. Schuller, M. Pantic, and R. Cowie. Emotion representation,
analysis and synthesis in continuous space : A survey. In Proceedings of
IEEE International Conference on Automatic Face and Gesture Recognition (FG'11), EmoSPACE 2011 - 1st International Workshop on Emotion
Synthesis, rePresentation, and Analysis in Continuous spacE, Santa Barbara, CA, USA, March 2011. (Cité pages 83, 85 et 86.)
C. Hacker, A. Batliner, and E. Nöth. Are you looking at me, are you talking
with me : Multimodal classication of the focus of attention. In P. Sojka,
I. Kopecek, and K. Pala, editors, Text, Speech and Dialogue, volume 4188,
pages 581588. Springer Berlin / Heidelberg, 2006. (Cité pages 100 et 101.)
W. A. Harrist and R. M. Waugh. Dyadic synchrony : Its structure and function
in children's development. Developmental Review, 22(4) :555 592, 2002.
(Cité page 63.)
H. Hermansky and S. Sharma. Temporal patterns (traps) in asr of noisy
speech. In Acoustics, Speech, and Signal Processing, 1999. ICASSP '99.
Proceedings., 1999 IEEE International Conference on, volume 1, pages 289
292, mar 1999. (Cité page 16.)
H. Hung and D. Gatica-Perez. Estimating cohesion in small groups using
audio-visual nonverbal behavior. IEEE Transactions on Multimedia, 12
(6) :563575, 2010. (Cité pages 64, 77, 82 et 85.)
212
Bibliographie
H. Hung, Y. Huang, G. Friedland, and D. Gatica-Perez. Estimating dominance in multi-party meetings using speaker diarization. IEEE Transactions
on Audio, Speech and Language Processing, 19(4) :847860, 2011. (Cité
page 77.)
R. Ishii, Y. Shinohara, T. Nakano, and Nishida T. Combining multiple types
of eye-gaze information to predict user's conversational engagement. In 2nd
Workshop on Eye Gaze on Intelligent Human Machine Interaction, 2011.
E. Keller. The analysis of voice quality in speech processing. In G. Chollet,
A. Esposito, M. Faundez-Zanuy, and M. Marinaro, editors, Summer School
on Neural Networks, volume 3445 of Lecture Notes in Computer Science,
pages 5473. Springer, 2004. (Cité page 15.)
A Kendon. Some functions of gaze-direction in social interaction. Acta Psychologica, 26(1) :2263, 1967. (Cité page 91.)
A. Kendon, R.M. Harris, and M.R. Key. Organization of behavior in face to
face interactions. The Hague, Mouton, 1975. (Cité pages 1 et 2.)
J. Kim and E. André. Emotion recognition based on physiological changes in
listening music. IEEE Trans.on Pattern Analysis and Machine Intelligence,
30(12) :20672083, December 2008. (Cité pages 105 et 107.)
S. Kim, P.G. Georgiou, Sungbok Lee, and S. Narayanan. Real-time emotion
detection system using speech : Multi-modal fusion of dierent timescale
features. In Multimedia Signal Processing, 2007. MMSP 2007. IEEE 9th
Workshop on, pages 48 51, oct. 2007. (Cité page 29.)
G. Klein, D. D. Woods, J. M. Bradshaw, R. R. Homan, and P. Feltovich.
Ten challenges for making automation a "team player" in joint human-agent
activity. IEEE Intelligent Systems, 19(6) :9105, 2004. (Cité page 91.)
G. Kubin. Nonlinear processing of speech. In W. Kleijn and K.K. Paliwal,
editors, Speech Coding and Synthesis, pages 557610. Elsevier, 1995. (Cité
page 19.)
P. Kuhl. Early language acquisition : cracking the speech code. Nature Reviews
Neuroscience, 5(11) :831843, November 2004. (Cité pages 63 et 68.)
Ludmila I. Kuncheva. Combining Pattern Classiers : Methods and Algorithms. Wiley-Interscience, 2004. (Cité pages 10 et 48.)
Bibliographie
213
G Lacey and S MacNamara. User involvement in the design and evaluation of
a smart mobility aid. Journal Of Rehabilitation Research And Development,
37(6) :709723, 2000. (Cité page 105.)
D. Lakens. Movement synchrony and perceived entitativity. Journal of Experimental Social Psychology, 46(5) :701 708, 2010. (Cité pages 63 et 77.)
J. Le Maitre and M. Chetouani. Selk-talk discrimination in human robotinteraction situations for engagement characterization. Soumis, 2011. (Cité
pages 91, 100 et 102.)
C. M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazemzadeh, S. Lee, and S. Narayanan. Eects of emotion on dierent phoneme classes. The Journal of
the Acoustical Society of America, 116(4) :24812481, 2004. (Cité pages 28
et 32.)
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755) :788791, 10 1999. (Cité page 71.)
L Leinonen, T Hiltunen, I Linnankoski, and M J Laakso. Expression or
emotional-motivational connotations with a one-word utterance. J Acoust
Soc Am, 102(3) :185363, Sep 1997. (Cité pages 28 et 32.)
M. Little. Mathematical foundations of nonlinear, non-gaussian, and timevarying digital speech signal processing. In Nonlinear Speech Processing
NOLISP 2011, Lecture Notes in Computer Science. Springer, 2011. (Cité
page 19.)
R. Lunsford, S. Oviatt, and R. Coulston. Audio-visual cues distinguishing selffrom system-directed speech in younger and older adults. In Proceedings of
the 7th international conference on Multimodal interfaces, pages 167174,
2005. (Cité pages 98 et 100.)
R. M. Maatman, Jonathan Gratch, and Stacy Marsella. Natural behavior of
a listening agent, pages 2536. Springer-Verlag, 2005. (Cité page 95.)
M. S. Magnusson. Discovering hidden time patterns in behavior : T-patterns
and their detection. Behav Res Methods Instrum Comput, 32(1) :93110,
2000. (Cité page 65.)
S. R. Mahadeva Prasanna, Cheedella S. Gupta, and B. Yegnanarayana. Extraction of speaker-specic excitation information from linear prediction residual of speech. Speech Communication, 48(10) :12431261, 10 2006. (Cité
pages 20, 21 et 25.)
214
Bibliographie
A. Mahdhaoui. Analyse de Signaux Sociaux pour la Modélisation de l'interaction face à face. PhD thesis, Université Pierre et Marie Curie, 2010. (Cité
pages 52, 62, 70, 71, 72 et 73.)
A. Mahdhaoui and M. Chetouani. Supervised and semi-supervised infantdirected speech classication for parent-infant interaction analysis. Speech
Communication, 53(9-1) :11491161, 2011. (Cité pages 55, 56 et 57.)
A. Mahdhaoui, M. Chetouani, and Cong Zong. Motherese detection based
on segmental and supra-segmental features. In Pattern Recognition, 2008.
ICPR 2008. 19th International Conference on, pages 1 4, dec. 2008. (Cité
pages 29 et 53.)
A. Mahdhaoui, M. Chetouani, R. S. Cassel, C. Saint-Georges, E. Parlato,
M.-C. Laznik, F. Apicella, F. Muratori, S. Maestro, and D. Cohen. Computerized home video detection for motherese may help to study impaired
interaction between infants who become autistic and their parents. International Journal of Methods in Psychiatric Research, 20(1) :e6e18, 2011.
(Cité page 53.)
D. McNeill. Hand and mind : what gestures reveal about thought. University
of Chicago Press, 1992. (Cité page 95.)
A. N. Meltzo, P. K. Kuhl, J. Movellan, and T. J. Sejnowski. Foundations for
a new science of learning. Science, 325(5938) :284288, 2009. (Cité pages 8,
68 et 88.)
A. N. Meltzo, R. Brooks, A. P. Shon, and R. P. N. Rao. "social" robots are
psychological agents for infants : A test of gaze following. Neural Networks,
23(8-9) :966972, 2010. (Cité pages 8 et 90.)
D. Messinger, P. Ruvolo, V. N. Ekas, and A. Fogel. Applying machine learning
to infant interaction : The development is in the details. Neural Networks,
23(8-9) :10041016, 2010. (Cité page 65.)
M. P Michalowski, S Sabanovic, and H Kozima. A dancing robot for rhythmic
social interaction. Proceeding of the ACM IEEE international conference
on Humanrobot interaction HRI 07, page 89, 2007. (Cité page 64.)
M.P. Michalowski, S. Sabanovic, and R. Simmons. A spatial model of engagement for a social robot. In Advanced Motion Control, 2006. 9th IEEE
International Workshop on, pages 762 767, 2006. (Cité page 92.)
Bibliographie
215
E. Monte-Moreno, M. Chetouani, Faundez-Zanuy ; M., and J. Sole-Casals.
Maximum likelihood linear programming data fusion for speaker recognition. Speech Communication, 51(9) :820830, 2009. (Cité pages 23 et 24.)
L.-P. Morency. Modeling human communication dynamics [social sciences]. Signal Processing Magazine, IEEE, 27(5) :112 116, sept. 2010. (Cité pages 61
et 94.)
L.-P. Morency, I. de Kok, and J. Gratch. Context-based recognition during
human interactions : automatic feature selection and encoding dictionary. In
Proceedings of the 10th international conference on Multimodal interfaces,
pages 181188, 2008. (Cité page 7.)
E. Mower, D.J. Feil-Seifer, M.J. Mataric, and S. Narayanan. Investigating
implicit cues for user state estimation in human-robot interaction using
physiological measurements. In Robot and Human interactive Communication, 2007. RO-MAN 2007. The 16th IEEE International Symposium on,
pages 1125 1130, 2007. (Cité pages 92, 104 et 105.)
F. Muratori, F. Apicella, P. Muratori, and S. Maestro. Intersubjective disruptions and caregiver-infant interaction in early autistic disorder. Research in
Autism Spectrum Disorders, 5(1) :408 417, 2011. (Cité page 68.)
L. Murray and C. Trevarthen. Emotional regulation of interactions between
two-month-olds and their mothers, pages 177197. Ablex, 1985. (Cité
page 63.)
B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita. Footing in humanrobot conversations : how robots might shape participant roles using gaze
cues. In Proceedings of the 4th ACM/IEEE international conference on
Human robot interaction, pages 6168, 2009. (Cité page 92.)
J. Nadel, I. Carchon, C. Kervella, D. Marcelli, and D. Reserbat-Plantey. Expectancies for social contingency in 2-month-olds. Developmental Science,
2(2) :164173, 1999. (Cité page 63.)
Y. I. Nakano and R. Ishii. Estimating user's engagement from eye-gaze behaviors in human-agent conversations. In Proceedings of the 15th international conference on Intelligent user interfaces, pages 139148, 2010. (Cité
page 91.)
M.A. Nicolaou, H. Gunes, and M. Pantic. Output-associative rvm regression
for dimensional and continuous emotion prediction. In Automatic Face
Gesture Recognition and Workshops (FG 2011), 2011 IEEE International
Conference on, pages 16 23, march 2011. (Cité page 87.)
216
Bibliographie
D. Olsen and M. Goodrich. Metrics for evaluating human-robot interactions.
In Proc. NIST Performance Metrics for Intelligent Systems Workshop, 2003.
(Cité page 102.)
D. Oppermann, F. Schiel, S. Steininger, and N. Beringer. O-talk, a problem for human-machine-interaction ? In Proc European Conf on Speech
Communication and Technology, pages 25, 2001. (Cité page 98.)
J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguiar. Ahumada :
A large speech corpus in spanish for speaker characterization and identication. Speech Communication, 31(2-3) :255 264, 2000. (Cité page 23.)
O. Oullier, G. C. de Guzman, K. J. Jantzen, J. Lagarde, and J. A. Scott Kelso.
Social coordination dynamics : Measuring human bonding. Social Neuroscience, 3(2) :178192, 2008. (Cité page 63.)
K.K. Paliwal and M.M. Sondhi. Recognition of noisy speech using cumulantbased linear prediction analysis. In Acoustics, Speech, and Signal Processing,
1991. ICASSP-91., 1991 International Conference on, pages 429 432 vol.1,
apr 1991. (Cité pages 21 et 22.)
C. Pelachaud. Modelling multimodal expression of emotion in a virtual agent.
Philosophical Transactions of the Royal Society B : Biological Sciences, 364
(1535) :35393548, 2009. (Cité page 3.)
F. Pellegrino. Rhythm. In P. Hogan, editor, The Cambridge Encyclopedia of
the Language Sciences. Cambridge University Press, 2011. (Cité page 38.)
A. Pentland. Social dynamics : Signals and behavior. In International Conference on Developmental Learning, 2004. (Cité page 3.)
A. Pentland. Social signal processing (exploratory dsp). Signal Processing
Magazine, IEEE, 24(4) :108111, 2007. (Cité pages 1 et 3.)
A. Pentland. Honest Signals : how they shape our world. MIT Press, 2008.
C Pereira and C Watson. Some acoustic characteristics of emotion. In Fifth International Conference on Spoken Language Processing, 1998. (Cité pages 28
et 32.)
C. Peters, G. Castellano, and S. de Freitas. An exploration of user engagement
in hci. In Proceedings of the International Workshop on Aective-Aware
Virtual Agents and Social Robots, pages 9 :19 :3, 2009. (Cité page 92.)
Bibliographie
217
R. W. Picard. Aective computing. MIT Press, Cambridge, MA, USA, 1997.
(Cité pages 1, 2 et 26.)
K. Prepin and P. Gaussier. How an agent can detect and use synchrony
parameter of its own interaction with a human ? Development of Multimodal Interfaces Active Listening and Synchrony, pages 5065, 2010. (Cité
pages 64 et 85.)
K. Prepin and C. Pelachaud. Shared understanding and synchrony emergence : Synchrony as an indice of the exchange of meaning between dialog
partners. Third International Conference on Agents and Articial Intelligence ICAART2011, pages 3739, 2011. (Cité pages 64 et 85.)
F. Ramseyer and W. Tschacher. Nonverbal synchrony in psychotherapy : Coordinated body movement reects relationship quality and outcome. Journal of Consulting and Clinical Psychology, 79(3) :284 295, 2011. (Cité
pages 64, 77, 79 et 80.)
F. Ramus, M. Nespor, and J. Mehler. Correlates of linguistic rhythm in the
speech signal, 1999. (Cité pages 39 et 40.)
D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Qin
Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and Bing
Xiang. The supersid project : exploiting high-level information for highaccuracy speaker recognition. In Acoustics, Speech, and Signal Processing,
2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on,
volume 4, april 2003. (Cité pages 18 et 25.)
C. Rich, C. L. Sidner, and N. Lesh. Collagen : Applying collaborative discourse
theory to human-computer interaction. AI Magazine, 22(4) :1526, 2001.
(Cité page 91.)
C. Rich, B. Ponsleur, A. Holroyd, and C. L. Sidner. Recognizing engagement in
human-robot interaction. In Proceeding of the 5th ACM/IEEE international
conference on Human-robot interaction, pages 375382, 2010. (Cité pages 91
et 92.)
F. Ringeval. Ancrages et modèles dynamiques de la prosodie : application à la
reconnaissance des émotions actées et spontanées. PhD thesis, Université
Pierre et Marie Curie, 2011. (Cité pages ix, xi, 15, 28, 29, 30, 32, 34, 38,
39, 40, 41, 43 et 46.)
F. Ringeval and M. Chetouani. A vowel based approach for acted emotion
recognition. In Interspeech 2008, pages 27632766, 2008. (Cité page 29.)
218
Bibliographie
F. Ringeval and M. Chetouani. Hilbert-huang transform for non-linear characterization of speech rhythm. In ISCA Tutorial and Research Workshop
on Non-Linear Speech Processing, 2009. (Cité page 41.)
F. Ringeval, J. Demouy, G Szaszak, M. Chetouani, L. Robel, J. Xavier, D. Cohen, and M. Plaza. Automatic intonation recognition for the prosodic assessment of language-impaired children. IEEE Transactions on Audio, Speech
& Language Processing, 19(5) :13281342, 2011. (Cité pages 46, 47 et 48.)
M. Rolf, M. Hanheide, and K. J Rohlng. Attention via synchrony : Making
use of multimodal cues in social learning. IEEE Transactions on Autonomous Mental Development, 1(1) :5567, 2009. (Cité page 64.)
J.-L. Rouas, J. Farinas, F. Pellegrino, and R. André-Obrecht. Rhythmic
unit extraction and modelling for automatic language identication. Speech
Communication, 47(4) :436456, 2005. (Cité page 30.)
C. Saint-Georges. Dynamique, synchronie, réciprocité et mamanais dans les
interactions des bébés autistes à travaers les lms familiaux. PhD thesis,
Université Pierre et Marie Curie, 2011. (Cité pages x, 52, 62, 67, 68, 70, 74,
75 et 76.)
C. Saint-Georges, .R Cassel, D. Cohen, M. Chetouani, M.C. Laznik, S. Maestro, and F. Muratori. What studies of family home movies can teach us
about autistic infants : A literature review. Research in Autism Spectrum
Disorders, 4(3) :355 366, 2010. (Cité page 67.)
C. Saint-Georges, M. Chetouani, R. Cassel, A. Mahdhaoui, F. Muratori, M.-C.
Laznik, and D. Cohen. Motherese, an emotion and interaction based process, impacts infant's cognitive development. Soumis, 2011a. (Cité page 52.)
C. Saint-Georges, A. Mahdhaoui, M. Chetouani, M.C. Laznik, F. Apicella,
P. Muratori, S. Maestro, F. Muratori, and D. Cohen. Do parents recognize
autistic deviant behavior long before diagnosis ? taking into account interaction using computational methods. PLOS ONE, 6(7) :e22393, 07 2011b.
(Cité pages 53, 68, 69, 70, 72, 73 et 74.)
J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P. W. McOwan, and A. Paiva.
Automatic analysis of aective postures and body motion to detect engagement with a game companion. In Proceedings of the 6th international conference on Human-robot interaction, pages 305312, 2011. (Cité page 92.)
M. Schroder, S. Pammi, H. Gunes, M. Pantic, M.F. Valstar, R. Cowie,
G. McKeown, D. Heylen, M. ter Maat, F. Eyben, B. Schuller, M. Wollmer,
Bibliographie
219
E. Bevacqua, C. Pelachaud, and E. de Sevin. Come and have an emotional
workout with sensitive articial listeners ! In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference
on, page 646, 2011. (Cité pages 8 et 94.)
B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers,
L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. The relevance of
feature type for the automatic classication of emotional user states : low
level descriptors and functionals. In INTERSPEECH, pages 22532256,
2007a. (Cité pages 25 et 37.)
B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wendemuth. Comparing one and two-stage acoustic modeling in the recognition of emotion
in speech. In Automatic Speech Recognition Understanding, 2007. ASRU.
IEEE Workshop on, pages 596 600, dec. 2007b. (Cité pages 26, 27 et 28.)
B. Schuller, S. Steidl, and A. Batliner. The interspeech 2009 emotion challenge.
In Interspeech 2009, 2009. (Cité page 37.)
B. Schuller, B. Vlasenko, F. Eyben, M. Wö andllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. Cross-corpus acoustic emotion recognition : Variances
and strategies. Aective Computing, IEEE Transactions on, 1(2) :119 131,
july-dec. 2010. (Cité page 17.)
B. Schuller, A. Batliner, S. Steidl, and D. Seppi. Recognising realistic emotions
and aect in speech : State of the art and lessons learnt from the rst
challenge. Speech Communication, 53(9-10) :1062 1087, 2011. Sensing
Emotion and Aect - Facing Realism in Speech Processing. (Cité pages 25,
43 et 52.)
M. Shami and W. Verhelst. An evaluation of the robustness of existing supervised machine learning approaches to the classication of emotions in
speech. Speech Communication, 49(3) :201 212, 2007. (Cité pages 26, 27
et 35.)
C. Shi, M. Shimada, T. Kanda, H. Ishiguro, and N. Hagita. Spatial formation
model for initiating conversation. In Proceedings of Robotics : Science and
Systems, 2011. (Cité page 92.)
M. Shimada, Y. Yoshikawa, M. Asada, N. Saiwaki, and H. Ishiguro. Eects of
observing eye contact between a robot and another person. International
Journal of Social Robotics, 3 :143154, 2011. (Cité page 92.)
220
Bibliographie
C. L. Sidner, C. D. Kidd, C. Lee, and N. Lesh. Where to look : a study of
human-robot engagement. In Proceedings of the 9th international conference on Intelligent user interfaces, IUI '04, pages 7884. ACM, 2004. (Cité
page 91.)
A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR '06 : Proceedings of the 8th ACM International Workshop on
Multimedia Information Retrieval, pages 321330, 2006. (Cité page 13.)
C. Song, Z. Qu, N. Blumm, and A.-L. Barabási. Limits of predictability in
human mobility. Science, 327(5968) :10181021, 2010. (Cité page 5.)
Y. Spenko, M.and Haoyong and S. Dubowsky. Robotic personal aids for mobility and monitoring for the elderly. IEEE Transactions on neural systems
and rehabilitation engineering, 14(3) :344351, 2006. (Cité page 105.)
A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework
for combining multiple partitions. Machine Learning Research, 3 :583617,
2002. (Cité page 73.)
X. Sun, K. Truong, A. Nijholt, and M. Pantic. Automatic visual mimicry
expression analysis in interpersonal interaction. In Proceedings of IEEE Int'l
Conf. Computer Vision and Pattern Recognition (CVPR-W'11), Workshop
on CVPR for Human Behaviour Analysis, pages 4046, Colorado Springs,
USA, June 2011. (Cité pages 79 et 85.)
W. Swartout, J. Gratch, R. W. Hill, E.. Hovy, S. Marsella, J. Rickel, and
D. Traum. Toward virtual humans. AI Magazine, 27 :96108, July 2006.
C. Tao, J. Mu, X. Xu, and G. Du. Chaotic characteristics of speech signal
and its lpc residual. Acoustical Science and Technology, 25(1) :5053, 2004.
(Cité page 20.)
A. Tapus, M.J. Mataric, and B. Scasselati. Socially assistive robotics [grand
challenges of robotics]. Robotics Automation Magazine, IEEE, 14(1) :35
42, march 2007. (Cité pages 103 et 104.)
P. Thévenaz and H. Hugli. Usefulness of the lpc-residue in text-independent
speaker verication. Speech Communication, 17(1-2) :145157, 8 1995. (Cité
page 20.)
Bibliographie
221
J. Thyssen, H. Nielsen, and S.D. Hansen. Non-linear short-term prediction in
speech coding. In Acoustics, Speech, and Signal Processing, 1994. ICASSP94., 1994 IEEE International Conference on, volume 1, pages 185 188, apr
1994. (Cité page 20.)
S. Tilsen and K. Johnson. Low-frequency fourier analysis of speech rhythm.
The Journal of the Acoustical Society of America, 124(2) :EL34EL39, 2008.
(Cité pages ix, 33, 34, 38 et 40.)
E. Tognoli, J. Lagarde, G. DeGuzman, and J. A. Scott Kelso. The phi complex
as a neuromarker of human social coordination. Proceedings of the National
Academy of Science (PNAS), 104(19) :81908195, May 2007. (Cité page 62.)
G. Varni, A. Camurri, P. Coletta, and G. Volpe. Toward a real-time automated
measure of empathy and dominance. In CSE (4), pages 843848, 2009. (Cité
page 77.)
H. Vilhjálmsson, N. Cantelmo, J. Cassell, E. N. Chafai, M. Kipp, S. Kopp,
M. Mancini, S. Marsella, A. N. Marshall, C. Pelachaud, Z. Ruttkay, K. R.
Thórisson, H. Welbergen, and R. J. Werf. The behavior markup language :
Recent developments and challenges. In Proc. of the 7th inter. conference
on Intelligent Virtual Agents, IVA '07, pages 99111, 2007. (Cité page 97.)
A. Vinciarelli. Capturing order in social interactions. Signal Processing Magazine, IEEE, 26(5) :133 152, September 2009. (Cité pages 7, 13, 14, 64,
77 et 85.)
A. Vinciarelli, M. Pantic, H. Bourlard, and A. Pentland. Social signal processing : state-of-the-art and future perspectives of an emerging domain. In
Proceeding of the 16th ACM international conference on Multimedia, pages
10611070, 2008. (Cité page 2.)
A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing : Survey
of an emerging domain. Image and Vision Computing, 27(12) :17431759,
11 2009. (Cité pages ix, 3, 4 et 82.)
A. Vinciarelli, M. Pantic, D. Heylen, C. Pelachaud, I. Poggi, F. D'Errico, and
M. Schroeder. Bridging the gap between social animal and unsocial machine : A survey of social signal processing. IEEE Transactions on Aective
Computing,, 2011. (Cité pages 9 et 94.)
B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll. Frame vs. turn-level :
Emotion recognition from speech considering static and dynamic processing.
In Proc. of the 2nd int. conference on Aective Computing and Intelligent
Interaction, pages 139147, 2007. (Cité pages 26, 29 et 35.)
222
Bibliographie
L. S. Vygotsky. Thought and Language. MIT Press, 1986. (Cité pages 100
et 102.)
N. Ward and W. Tsukahara. Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics, 32(8) :1177 1207,
2000. (Cité page 95.)
P.B. Wieber, F. Billet, L. Boissieux, and R. Pissard-Gibollet. The HuMAnS
toolbox, a homogenous framework for motion capture, analysis and simulation. In 9th International Symposium on the 3D Analysis of Human Movement, AHM 2006, June, 2006, 2008. (Cité page 108.)
S. F. Worgan and R. K. Moore. Towards the detection of social dominance in
dialogue. Speech Communication, In Press, 2011. (Cité page 77.)
B. Wrede, S. Kopp, K. Rohlng, M Lohse, and C. Muhl. Appropriate feedback
in asymmetric interactions. Journal of Pragmatics, 42(9) :2369 2384, 2010.
(Cité page 94.)
Z-l. Wu, C.-W. Cheng, and C.-h. Li. Social and semantics analysis via nonnegative matrix factorization. In Proceeding of the 17th international conference on World Wide Web, pages 12451246, 2008. (Cité page 71.)
J. Xavier, L. Vannezel, S. Viaux, A. Leroy, M. Plaza, S. Tordjman, C. Mille,
C. Bursztejn, D. Cohen, and Guile J.M. Reliability and diagnostic eciency
of the diagnostic inventory for disharmony (did) in youths with pervasive
developmental disorder and multiple complex developmental disorder. Research in Autism Spectrum Disorders, 5 :14931499, 2011. (Cité page 50.)
J.J. Yanguas, C. Buiza, I. Etxeberria, E. Urdaneta, N. Galdona, and M.F.
González. Eectiveness of a non pharmacological cognitive intervention on
elderly factorial analisys of donostia longitudinal study. Adv. Gerontol., 3 :
3041, 2008. (Cité page 99.)
B. Yegnanarayana, K. Sharat Reddy, and S.P. Kishore. Source and system
features for speaker recognition using aann models. In Acoustics, Speech,
and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, volume 1, pages 409 412, 2001. (Cité page 20.)
N. Zheng, T. Lee, and P. C. Ching. Integration of complementary acoustic
features for speaker recognition. Signal Processing Letters, IEEE, 14(3) :
181 184, march 2007. (Cité page 20.)
E. Zwicker and Fastl H. Psychoacoustics : facts and models. Springer Berlin
/ Heidelberg, 1990. (Cité page 37.)

Traitement du signal social - ISIR

Transcription

Documents pareils

Voici un petit tutorial pour ceux qui souhaite ajouter des voix

Programme FEDE EDM 2015 - l`Ameublement français

Workshop « IoT: Policy, regulatory implications and best practices»

Speech Therapist – Complementary Services Department

De quelques origines américaines des sciences de la communication

Dix choses que je peux faire chez moi en utilisant les

WHAT WE OFFER CHILDREN