Discrete/Continuous Modelling of Speaking Style in HMM
Transcription
Discrete/Continuous Modelling of Speaking Style in HMM
Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation Nicolas Obin 1,2 , Pierre Lanchantin 1 Anne Lacheret 2 , Xavier Rodet 1 Analysis-Synthesis Team, IRCAM, Paris, France Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France 1 2 [email protected], [email protected], [email protected], [email protected] Abstract This paper presents an average discrete/continuous HMM which is applied to the speaking style modelling of various discourse genres in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidentally, the robustness of the HMMbased speech synthesis is evaluated in the conditions of realworld applications. The paper is organized as follows: the speaking style corpus MODELLING design is OF described section 2; the averDISCRETE/CONTINUOUS SPEAKING in STYLE IN HMM-BASED SPEECH SYNTHESIS: age discrete/continuous HMM model is presented in section 3; DESIGN AND EVALUATION the evaluation is presented and discussed in sections 4 and 5. 1,2 1 This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speaking style. The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style. Index Terms: speaking style, speech synthesis, speech prosody, average modelling. Nicolas Obin , Pierre Lanchantin Anne Lacheret 2 , Xavier Rodet 1 1 2 log(1/speech rate) The following is a description of the four selected DG’s: −1.2 Fig. 1. Prosodic description of the speaking styles depending on the speaker. Mean and variance of f0 and speech rate. P2 −1.4 −1.6 log syllable duration P3 study was supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research; 2008-2012. 1 This study was supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research; 2008-2012. log f0 −2 −2.2 4.5 log logf0 f0 −1.8 Each speaker has his own speaking style which constitutes his vocal S1 5.4 signature, and a part of his identity. Nevertheless, a speaker continu5.5 ously adapt his speaking style according S3 S4 5.3to specific communication 5.4 S1 S5S3 S4 situations, and to his emotional state. In particular, each situational 5.3 S5 S2 context determines a specific mode of 5.2 production associated with itS6 S2 S6 5.2 M2 - a genre - which is defined by a set of conventions of form and conM2 5.1 M7 M6 tent that are shared among all of its productions [1]. In particular, 5.1 5 M5 M7 J5 J4 M6 a specific discourse genre (DG) relates to a specific speaking style. J3 M3 4.9 M1 Consequently, a speaker adapts his speaking style to each specific J1 5 P5 M4M5 P4 4.8 J4 4.7 situation depending on the formal conventions that are associated J2 J5 J3 with the situation, his a-priori knowledge about these conventions, P3 M3 4.6 4.9 and his competence to adapt his speaking style. Thus, each comM1 P1 P2 J1 4.5 munication act instantiates a style which is composed of a style that P5 4.8 −2.2 −2 −1.8 −1.6 M4 P4−1.4 −1.2 depends on the speaker identity, and a conventional speaking style log syllable duration that is conditioned by a specific situation. J2 In speech synthesis, methods have been4.7 proposed to model and adapt Fig. 1. Prosodic description of the speaking styles dethe symbolic [3, 4] and acoustic speech characteristics of a speaking pending on the speaker. P3 Mean and variance of f0 and 4.6 synthesis [2]. However, style, with application to emotional speech speech rate. no study exists on the joint modelling of the symbolic and acoustic P2 4.5 characteristics of speaking style, and speaking style acoustic modP1 log f0 elling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5]. −2.2 −2 −1.8 −1.6 −1.4 −1.2 This paper presents an average discrete/continuous HMM which is log syllable duration applied to the speaking style modelling of various discourse genres log(1/speech rate) 4.6 4.8 5 4.9 5.1 P1 M5 M3 M1 P4 P5 M4 J2 J1 M2 M7 M6 J5 J4 J3 S5 S2 S1 S4 S6 S3 5.2 5.3 5.4 5.5 1 This study was partially funded by “La Fondation Des Treilles”, and supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research; 20082012. sists of four different DG’s: catholic mass ceremony, political, journalistic, and sport commentary. In order to reduce the DG intravariability, the different DGs were restricted to specific situational contexts (see list below) and to male speakers only. The following is a description of the four Figure 1: Prosodicselected description of the speaking DG’s: styles depending on the speaker. Mean and variance of f0 and speech rate (syllable per second). 1 This 1. INTRODUCTION 5.5 1. INTRODUCTION Each speaker has his own speaking style which constitutes his vocal signature, and a part of his identity. Nevertheless, a speaker continuously adapt his speaking style according to specific communication situations, and to his emotional state. In particular, each situational context determines a specific mode of production associated with it - a genre - which is defined by a set of conventions of form and content that are shared among all of its productions [1]. In particular, a specific discourse genre (DG) relates to a specific speaking style. Consequently, a speaker adapts his speaking style to each specific situation depending on the formal conventions that are associated with the situation, his a-priori knowledge about these conventions, and his competence to adapt his speaking style. Thus, each communication act instantiates a style which is composed of a style that depends on the speaker identity, and a conventional speaking style that is conditioned by a specific situation. In speech synthesis, methods have been proposed to model and adapt the symbolic [3, 4] and acoustic speech characteristics of a speaking style, with application to emotional speech synthesis [2]. However, no study exists on the joint modelling of the symbolic and acoustic characteristics of speaking style, and speaking style acoustic modelling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5]. This paper presents an average discrete/continuous HMM which is applied to the speaking style modelling of various discourse genres 2.1. Corpus Design For the purpose of speaking style speech synthesis, a 4-hour ABSTRACT in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidenmulti-speakers speech database was designed. tally, the robustness of the HMM-based speech synthesis isThe evaluated speech the conditions of real-world applications. The paper is organized database consists of four inasdifferent DG’s: catholic cerefollows: the speaking style corpus design is described inmass section 2; the average discrete/continuous HMM model is presented in secmony, political, journalistic, and sport commentary. In order tion 3; the evaluation is presented and discussed in sections 4 and 5. to reduce the DG intra-variability, the&different 2. SPEECH TEXT MATERIALDGs were re2.1. Corpus Design stricted to specific situational contexts (see list below) and to For the purpose of speaking style speech synthesis, a 4-hour multimale speakers only. speakers speech database was designed. The speech database con- This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speaking style. The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style. Index Terms: speaking style, speech synthesis, speech prosody, average modelling. 4.7 For the purpose of speaking style speech synthesis, a 4-hour multispeakers speech database was designed. The speech database consists of four different DG’s: catholic mass ceremony, political, journalistic, and sport commentary. In order to reduce the DG intravariability, the different DGs were restricted to specific situational contexts (see list below) and to male speakers only. in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidentally, the robustness of the HMM-based speech synthesis is evaluated in the conditions of real-world applications. The paper is organized as follows: the speaking style corpus design is described in section 2; the average discrete/continuous HMM model is presented in section 3; the evaluation is presented and discussed in sections 4 and 5. 2. SPEECH & TEXT MATERIAL ABSTRACT This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speaking style. The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style. Index Terms: speaking style, speech synthesis, speech prosody, average modelling. [email protected], [email protected], [email protected], [email protected] 1 Analysis-Synthesis Team, IRCAM, Paris, France Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France 2 Nicolas Obin 1,2 , Pierre Lanchantin 1 Anne Lacheret 2 , Xavier Rodet 1 log f0 DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLE IN HMM-BASED SPEECH SYNTHESIS: DESIGN AND EVALUATION Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France 2.1. Corpus Design [email protected], [email protected] [email protected], [email protected], 1. Introduction Each speaker has his own speaking style which constitutes his vocal signature, and a part of his identity. Nevertheless, a speaker continuously adapt his speaking style according to specific communication situations, and to his emotional state. In particular, each situational context determines a specific mode of production associated with it - a genre - which is defined by a set of conventions of form and content that are shared among all of its productions [1]. In particular, a specific discourse genre (DG) relates to a specific speaking style. Consequently, a speaker adapts his speaking style to each specific situation depending on the formal conventions that are associated with the situation, his a-priori knowledge about these conventions, and his competence to adapt his speaking style. Thus, each communication act instantiates a style which is composed of a style that depends on the speaker identity, and a conventional speaking style that is conditioned by a specific situation. In speech synthesis, methods have been proposed to model and adapt the symbolic [2, 3] and acoustic speech characteristics of a speaking style, with application to emotional speech synthesis [4]. However, no study exists on the joint modelling of the symbolic and acoustic characteristics of speaking style, and speaking style acoustic modelling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5]. 2. Speech & Text Material Analysis-Synthesis Team, IRCAM, Paris, France The following is a description of the four selected DG’s: mass: Christian church sermon (pilgrimage and Sunday highmass sermons); single speaker monologue, no interaction. political: New Year’s French president speech; single speaker monologue; no interaction. journal: radio review (press review; political, economical, technological chronicles); almost single speaker monologue with a few interactions with a lead journalist. sport commentary: soccer; two speakers engaged in monologues with speech overlapping during intense soccer sequences and speech turn changes; almost no interaction. The speech database consists of natural speech multi-media audio contents with strongly variable audio quality (background noise: crowd, audience, recording noise, and reverberation). The speech prosody characteristics of the speech databased are illustrated in figure 1. 3. Speaking Style Model A speaking style model λ(style) is composed of discrete/continuous context-dependent HMMs that model the symbolic/acoustic speech characteristics of a speaking style. “ ” (style) (style) λ(style) = λsymbolic , λacoustic (1) During the training, the discrete/continuous context-dependent HMMs are estimated separately. During the synthesis, the symbolic/acoustic parameters are generated in cascade, from the symbolic representation to the acoustic variations. Additionally, a rich linguistic description of the text characteristics is automatically extracted using a linguistic processing chain [6] and used to refine the context-dependent HMM modelling (see [7] and [8] for a detailed description of the enriched linguistic contexts). 3.1. Training of the Discrete/Continuous Models 3.1.1. Discrete HMM For each speaking style, an average context-dependent discrete (style) HMM λsymbolic is estimated from the pooled speakers associated with the speaking style. The prosodic grammar consists of a hierarchical prosodic representation that was experimented as an alternative to ToBI [9] for French prosody labelling [10]. The prosodic grammar is composed of major prosodic boundaries (FM , a boundary which is right bounded by a pause), minor prosodic boundaries (Fm , an intermediate boundary), and prosodic prominences (P). Let R be the number of speakers from which an average model (style) λsymbolic is to be estimated. Let l = (l(1) , . . . , l(R) ) the total set of prosodic symbolic observations, and l(r) = [l(r) (1), . . . , l(r) (Nr )] the prosodic symbolic sequence associated with speaker r, where l(r) (n) is the prosodic label associated with the n-th syllable. Let q = (q(1) , . . . , q(R) ) the total set of linguistic contexts observations, and q(r) = [q(r) (1), . . . , q(r) (Nr )] the linguistic context sequence associated with speaker r, where (r) (r) q(r) (n) = [q1 (n), . . . , qL (n)]> is the (Lx1) linguistic context vector which describes the linguistic characteristics associated with the n-th syllable. (style) An average context-dependent discrete HMM λsymbolic is estimated from the pooled speakers observations. Firstly, an av(style) erage context-dependent tree Tsymbolic is derived so as to minimize the information entropy of the prosodic symbolic labels l conditionally to the linguistic contexts q . Then, a context(style) dependent HMM model λsymbolic is estimated for each termi(style) nal node of the context-dependent tree Tsymbolic . 3.1.2. Continuous HMM (style) For each speaking style, an average acoustic model λacoustic that includes source/filter variations, f0 variations, and statedurations, is estimated from the pooled speakers associated with the speaking style based on the conventional HTS system [11]. Let R be the number of speakers from which an average model is to be estimated. Let o = (o(1) , . . . , o(R) ) the total set of observations, and o(r) = [o(r) (1), . . . , o(r) (Tr )] the observation sequences associated with speaker r, where (r) (r) o(r) (t) = [ot (1), . . . , ot (D)]> is the (Dx1) observation vector which describes the acoustical property at time t. Let q = (q(1) , . . . , q(R) ) the total set of linguistic contexts observations, and q(r) = [q(r) (1), . . . , q(r) (Tr )] the linguistic context sequence associated with speaker r, where (r) (r) q(r) (t) = [q1 (t), . . . , qL (t)]> is the (Lx1) linguistic context vector which describes the linguistic properties at time t. (style) An average context-dependent HMM acoustic model λsymbolic is estimated from the pooled speakers observations. Firstly, a context-dependent HMM model is estimated for each of the linguistic contexts. Then, an average context-dependent tree (style) Tacoustic is derived so as to minimize the description length of (style) the context-dependent HMM model λacoustic . The acoustic module models simultaneously source/filter variations, f0 variations, and the temporal structure associated with a speaking style. Speakers f0 were normalized with respect to the speaking style prior to modelling. Source, filter, and normalized f0 observation vectors and their dynamic vectors (style) are used to estimate context-dependent HMM models λacoustic . Context-dependent HMMs are clustered into acoustically similar models using decision-tree-based context-clustering (MLMDL [11]). Multi-Space probability Distributions (MSD) [12] are used to model continuous/discrete parameter f0 sequence to manage voiced/unvoiced regions properly. Each contextdependent HMM is modelled with a state duration probability density functions (PDFs) to account for the temporal structure of speech [13]. Finally, speech dynamic is modelled according to the trajectory model and the global variance (GV) that model local and global speech variations over time [14]. 3.2. Generation of the Speech Parameters During the synthesis, the text is first converted into a concatenated sequence of context-dependent HMM models (style) λsymbolic associated with the linguistic context sequence q = [q1 , . . . , qN ], where qn = [q1 , . . . , qL ]> denotes the (Lx1) linguistic context vector associated with the n-th phoneme. Firstly, the prosodic symbolic sequence bl is determined so as to maximize the likelihood of the prosodic symbolic sequence l conditionally to the linguistic context sequence q and the model (style) λsymbolic . bl = argmax p(l|q, λ(style) ) symbolic l (2) Then, the linguistic context sequence q augmented with the prosodic symbolic sequence bl is converted into a concatenated (style) 4 1 Exper men a Se up sequence of context-dependent models λacoustic . m m m • journal: radio review (press review; political, economical, tech- 40 mation conditionof the10 prosodic symbolic speech u e mances pe DG we e labels se ec qedproso n MM he speak m entropy b is inferred so as to maximize the likeThe acoustic sequence o ngally s y to e co and contexts emovedq .om hea context-dependent a n ng se Lex ca ac single speaker monologue the pus linguistic Then, HMM lihood of the nological acoustic chronicles); sequence oalmost conditionally to the model with a m fi e mha nsu ed ha cess was emoved us ng am band pass he (style) (style) few interactions with a lead journalist. λacoustic . for eacha terminal nodeand of thehecontextmodel equency λsymbolic ois estimated equency h ghes mmowes mhe undamen (style) (style) (style) s ha mon c was nc uded • sport commentary: speakers engaged in mono- equency b = argmax o max p(o|q, λsoccer; )p(q|λ ) (3) dependento trees fiTsymbolic . acoustictwo acoustic q o m m m logues with speech overlapping during intense soccer sequences 4 2 Sub ec ve Evamua on MM m m MM b First, the state sequence q is determined so as to maximize • journal: radio reviewturn (press review;almost political, economical, tech- mation entropy of the prosodic m m symbolic labels qproso conditionand speech changes; no interactions. the likelihood of the state sequence conditionally to the model m m m The eva ua on cons s s o a mu p e cho ce den fica on ask Continuous nological chronicles); almost single speaker monologue with a ally to the 3.1.2. linguistic contexts HMM q . Then, a context-dependentm HMM (style) om speech p osody pe cep on The eva ua on was conduc ed b λacoustic . Then, the observation sequence c is determined so The •speech database consists of natural speech multi-media audio m symbolic labels q HAreview; political, O ODY Xeconomical, A ON journal: (press tech- mm mation (style) conditionentropy of the prosodic interactions with aradio leadreview journalist. (style) is estimated for each terminal node of the contextmodel λ acco d ng o c owd sou c ng echn que us ng soc a ne wothat ks as tofew maximize the likelihood of the observation sequence connological chronicles); almost single speaker monologue with a ally to the linguistic contexts q . Then, a context-dependent HMM symbolic For each speaking style, an average acoustic model λ contentsfew with strongly variable audio quality (background noise: mbveqnode mFq ench acoustic (style) interactions with a(press lead journalist. ou na d o v w p v w po onom h ond on m on n opy o h p o od ymbo 4 5 P o od Pa am E ma on 50 sub ec s nc ud ng 25 na speake s 15 non na ve is estimated for each terminal of the contextmodel λ • journal: radio review review; political, economical, techconditionmation entropy of the prosodic symbolic labels (style) HA O ODY X A ON b, speakers ditionally the nostate sequence q the ng model λ mono under • sport to commentary: soccer; two engaged in ogu monodependent tree og commentary: h on mo p reverberation). k acoustic wmonoha speech ytoothehmlinguistic nguT on . x. qq. Then, Thvariations, na context-dependent on x fd0 pvariations, nd nHMM HMM includes source/filter state-durations, chronicles); almost single speaker monologue ally contexts crowd, audience, recording noise, The m and • nological sport soccer; twoand speakers engaged inwith symbolic dependent tree Tspeake F ench s 10 non F ench speake s 34 expe and 16 Fund m n F u n dynamic constraint o = Wc. MMnodofothehcontextw interactions w hoverlapping ou nduring intense soccer sequences model few a lead m dforoeach hterminal m n node on x mod λλ is estimated logues with speech 4n5 overlapping P oon odwith Pa amd journalist. E databased masoccer on sequences logues with speech during intense is estimated with the na ve s ene sfrom pathecpooled pa ed speakers n h s associated expe men Pa speaking c pan s prosody characteristics ofsoccer; the speech are illustrated independent figCHAPTER 4. turn SPEECH MODELLING and nomspeakers interactions. pom speech omm n achanges; yPROSODY o almost wo p & k SYNTHESIS: ng gA dGHinn monomono • sport commentary: two engaged d p nd n tree TT m . m 3.1.2. Continuous HMM m K m m b (4) c = r R m hnoverlapping F pp u nng 52 speech turn STATE-OF-THE-ART CHAPTER 4.Fund PROSODY MODELLING &&&SYNTHESIS: bno bSPEECH q q we e g ven a b e desc p on o he d e en speak ng s y es ogu w h p ov du ng n n o qu n CHAPTER 4. SPEECH PROSODY MODELLING SYNTHESIS: and changes; almost interactions. R m m m CHAPTER 4. SPEECH PROSODY MODELLING SYNTHESIS: logues with speech during intense soccer sequences The speech database consists of natural speech multi-media audio style conventional HTS system ([14]). urem 1. and m K Gbased MOonantheaverage 52 STATE-OF-THE-ART 5252 ndCHAPTER p with h turn n h ngPROSODY mo MODELLING nointeractions. nSTATE-OF-THE-ART STATE-OF-THE-ART speech changes; almost no SPEECH & on SYNTHESIS: m4.ustrongly m A GH 3.1.2. For each speaking style, acoustic model λa speakthat Continuous HMM contents variable audio quality (background noise: Then hey we e asked o assoc a e ng s y e o each o 3 1 2 Con nuou HMM m = A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier 3.1.2. Continuous HMM where:AA52 m4.mKinput m m CHAPTER PROSODY MODELLING SYNTHESIS: m a asequence mhAmulti-media m mThe includes M variations, f variations, and state-durations, crowd, audience, recording noise, reverberation). speech phonetizer isisused to convert text into A syllabifier Th p h the dthe SPEECH b into on omsyllables. naofand uphonemes. aAtSTATE-OF-THE-ART p &prosodic mu d audio ud o phonetizer used to convert input text into sequence of phonemes. syllabifier MM he source/filter The speech database consists of natural speech The52isisisspeech database consists of natural speech multi-media used to merge the sequence phonemes am sequence of the level, audio speech u e ances Fo h s pu pose pa c pan s we e g ven m m A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier STATE-OF-THE-ART M used to the sequence into ayreview; of At level, is MMestimated the speakers associated with used tomerge mergeiscontents the sequence phonemes intotext asequence sequence ofsyllables. syllables. Atthe theprosodic level, prosody characteristics of speech are illustrated in fig- entropy Fo hspeaking p from k ngstyle, y pooled v g acoustic ou h on n convert wphonemes hthe ong vintothe ud ofodatabased qu yprosodic kg ound noise: no For eachof an naverage λ λthe speaking that =mod with strongly variable audio quality (background mspeakers journal: radio review (press political, economical, techconditionmation the prosodic symbolic labels q m model (style) A phonetizer used to input absequence phonemes. Absyllabifier mMMthe −1 > pauses are to m level, is •used toidentified mergeure the sequence into adsequence of syllables. Atv theb prosodic Meach Let Rthefibeons number ofvsystem from which an average pauses are identified to1. pauses are toto hmsource/filter ee op W Σ W. (5) For R style on conventional HTS nlinguistic udbased ou v. Then, on f variations, on([14]). du λon owd ud phonemes n= o maudio ngaq no on speaking style, anafaverage acoustic model thatmodel is to b q contents with strongly variable quality includes andnd state-durations, crowd, audience, recording noise, and reverberation). The speech used toidentified merge the sequence phonemes into ainto sequence of msyllables. At theMODEL prosodic level, A isphonetizer is used convert the input text sequence ofnd phonemes. A syllabifier nological chronicles); almost single speaker monologue with anoise: contexts qvariations, context-dependent HMM b m(background mTh m pally hto the 3. SPEAKING STYLE acoustic M pauses are identified to M m MM m m • journal: review; political, techmation ofm the prosodic symbolic q q(1)oofconditionD dfrom om hfor poo dspeakers plabels kb associated dond w (R) hthe hspeaking p k =ng is pauses used to merge intojournalist. a of sequence of Atbonom the prosodic level, ptheradio osequence ov the h speech pmsyllables. hdatabased d economical, dare illustrated u h din nmodel figonentropy are identified too review is opy estimated the pooled with prosody characteristics figfew interactions with • ou na dody vhphonemes w a(press plead w po on m n o h p o od ymbo is estimated each terminal node the contextλ −1reverberation). The speech > estimated. Let ose =from (o , .nd .one ,HMM oHMM ) model thestate-durations, total observations, MR pausesaudience, are identified to includes variations, variations, and crowd, noise, and nological almost single speaker monologue linguistic contexts qq . of Then, context-dependent Let the number is to yngu bobesource/filter onconfidence hxconventional onv nspeakers on HTS m 14 u chronicles); 1recording adbe ec y.an speak ng sset y eof when ce a n (6) rqb = W Σ style based on the system ure 4.6• no Abstract Model: Text To Prosodic Structure og commentary: h1. on mo ng pµSTYLE k .Structure mono oguinwith wmonoh a mally ytoothe h on Th na HTS on xfy0which don p([14]). naverage 3. SPEAKING MODEL b q 4.6 Abstract Structure bspeakers q Model: Text To Prosodic sport soccer; two engaged dependent tree T . few interactions with a lead journalist. (r) (r) (r) be estimated. Let o = (o , . . . , o ) the total set of observations, each hterminal ofothehcontextmodel (style) 4.6 w Abstract Model: Text To Prosodic Structure Structure n withon w hoverlapping d ouTo nduring 4.6 Abstract Model: Text Prosodic m dfor oce m n node nod on xis with mod λλis estimated oisbestimated he cho logues speech intense soccer sequences = [o (1), . . . , o (T )] the observation sequences and o 4.6 Abstract Structure from the pooled speakers associated the speaking Model: Text To Prosodic Structure prosody characteristics of the speech databased are illustrated in figr A speaking style model λ is composed of discrete/continuous The prosodic structure model is to infer a sequence of prosodic labels that are associated ⇓ L R h numb o p k om wh h n v g mod o R m m m = [o (1), . . . , o (T )] is the observation sequences and o The prosodic structure model is to infer a sequence of prosodic labels that are associated Let R be the number of speakers from which an average model is to A are speaking λthe is composed of discrete/continuous m = •prosodic sport commentary: soccer; two engaged inassociated and µ the tree TT K G⇓ . MO and Σ4.6 Abstract Model: Text To Prosodic Structure 3 model SPEAK NG MODEL The structure model isytostyle infer aalmost sequence prosodic labels 3. SPEAKING STYLE MODEL b• q and speech turn noofspeakers interactions. b q po omm n respectively achanges; o wo p covariance kSTYLE ngthat g are d matrix n monomono anddependent with relevant prosodic events. d p nd nbe (r)es when wo (r) speak with relevant prosodic events. bassociated m HMM L on =(o or,ec o d) (t) o set oobservations, (r) � estimated. Let oo=conventional ,where . . . ,wo oo thehe= total of context-dependent HMMs that modelStructure the symbolic/acoustic speech con us sewith en speak yon 3.1.2. speech Continuous with speaker [o (1), . .ob . ,ng o vs(D)] 4.6 Model: Text To Prosodic Structure logues with speech overlapping during intense soccer sequences Abstract Text To Prosodic The prosodic structure model toText infer sequence of.prosodic prosodic labels thatsymbolic/acoustic arequ associated ⇓⇓⇓don with relevant prosodic Abstract Model: To Prosodic Structure style based the HTS system ([14]). ure 4.6 1.vector ogu w h events. pModel: his consists pp ng du nmodel nlabels othe n context-dependent HMMs that m = The prosodic structure model isisov infer aasequence sequence labels that are associated bofng mean for the state q associated speaker r, where o (t) =qu[o=propThe prosodic structure model toytosequence infer a speaking of prosodic that are associated The speech database of natural speech multi-media audio = o 1 o T h ob v on nt (1), . . . , ot (D)] nd o characteristics of a style. = [o (1), . . . , o (T )] is the observation sequences and o is the (Dx1) observation vector which describes the acoustical A p k ng mod λ ompo d o d on nuou with relevant prosodic events. and speech turn changes; almost no interactions. speaking style model λ is composed of discrete/continuous ⇓ ng s y es a e poss b e 4.6 Abstract Model: Text To Prosodic Structure nd p h u n h ng mo no n on ⇓ M with relevant prosodic events. with relevant prosodic events. speaking an average acoustic model that ⇓=w⇓the contents Longtemps with strongly quality (background 3.1.2. Continuous , variable me audio suis h couché deh symbolic/acoustic bonne heurenoise: . sentence erty time Let q =robservation (q ,o.o. . (t) ,q the total ofo linguistic on x d pconsists HMM mod dethe ymbo pFor 3 1 2heach Con HMM oM at HMM dwith p(Dx1) k r, wh oλ (1), 1 . .set D the acoustical propcontext-dependent HMMs model speech associated where = . , odescribes (D)] ==)[ostate-durations, characteristics ofojejean speaking style. Longtemps , nd me suis that couché bonne heure ou . sentence isstyle, vector which mnuou mt.hspeaker The speech database of speech multi-media ⇓observation includes source/filter variations, fon variations, crowd, noise, reverberation). speech Th p audience, hcharacteristics b recording op natural na u astyle. mThe ud. o “ ”d audio MM v on hd strongly k and ng y p h(background oban vand owhich hdescribes d” [q b the hsacoustical p)]opcomp of speaking oDx1 aobservations, ndec svector se ndec on” e e y unsu e =and (1), .that . .ou , q when (T is qwhec Longtemps , on je jeame suis couché demu bonne heure sentence is contexts theh (Dx1) propFor each speaking style, average acoustic model λ contents with variable audio quality noise: (1) (R) Longtemps , me suis couché de bonne heure . sentence is estimated from the pooled speakers associated with the speaking M prosody characteristics of the speech databased are illustrated in figλ = λ , λ (1) Fo h p k ng y n v g ou mod λ h on n w h ong y v b ud o qu y b kg ound no Longtemps , je me suis couché de bonne heure . sentence ⇓suis merty mqwe Let RyMMM be the number speakers from anon model m L q= =of q q o. . set olinguistic =)r,ngu the context sequence with speaker where erty atlinguistic time t.ec Let (q , . . .q ,associated qand )(q thehwhich total ofaverage =h,du myDsetisasto at time t. Let = . , q the total ofa linguistic couché deonbonne heure . sentence includes source/filter variations, fHTS variations, state-durations, crowd, recording and reverberation). The speech Sub s e asked o use s poss b y on ve y as 3. STYLE MODEL style based on the conventional system ([14]). ure 1. audience, n ud ou fi v on f v on nd owd udLongtemps n SPEAKING o d, ngjenoise, nome⇓ nd v b Th p h M “ ” “ ” qthe 1(r) T)](r)is on (t) xtheobservations, on. . . and q is (1) (R) =ob [q v(t), , qnd the (L’x1) augmented qfrom == [q (1), . . . , q q (Tlinguistic contexts q(t)] ⇓ is estimated pooled speakers associated with speaking the databased are illustrated in”nfig(r) “ m d om h poo d p k o d w h h p k ng m pprosody o ody characteristics h During the of oλ h speech p⇓ h d b d u d fig training, the discrete/continuous context-dependent λ = λ λ 1 be estimated. Let o = (o , . . . , o ) the total set of observations, = λ , λ (1) eso ⇓ ⇓ h ngu on x qu n o d w h p k r wh = [q (1), . . . , q (Tr )] is contexts observations, and q context vector which describes the linguistic properties at time t. the linguistic context sequence associated with speaker r, where style based on the conventional HTS system ([14]). ure 1. (style) (style)⇓⇓ m average fromy which model is to y R b beMM dthe onnumber h m onvof nspeakers on HTS m an 14 prosodic u 1 HMMs separately. During(style) the synthesis, the Let sym3. are SPEAKING STYLE MODEL (r) (r) m Lwe =[q=⇓ (o(t), x1observations, n linguistic d ngu om he pa c pan s ⇓ ⇓=To λdiscrete/continuous , λdiscrete/continuous λestimated =o theh (L’x1) augmented qq(1)(t)(r) Add ona ma ons e=ugm gobservation eaned (style) prosodic structure be estimated. ,.(1), .. .. ., ,qn.o.o.(t)] )otheistotal set of bolic/acoustic parameters are generated in cascade, symacoustic Du ng Model: h λtraining, nText ng isthe hcomposed dsymbolic on nuou on from x d the p nd n andAnoLet 4.6 Abstract Prosodic Structure = [o , (T )] is the sequences During the context-dependent the linguistic context sequence associated with speaker r, where r A speaking style model of m hhan mnguv acoustic context-dependent HMM λmt. ⇓STYLE on average x vector vmof ospeakers wh hdescribes dM b the p op is context linguistic at otime structure F *representation RRbe from which model to 3. SPEAKING MODEL bolic acoustic Additionally, aym rich Land h=number numb k (T om naverage gproperties mod prosodic [o speech (1), .p, oexpe )] se iswhtheexpe observation sequences ob the prosodic na vemodel HMM dtoSTYLE pcomposed y *During Du ng the h synthesis, yn** h h Let ⇓⇓o(o. .which A speaking style model λm NG isthe ofvariations. discrete/continuous HMMs are estimated separately. the sym= anguage MM a(r) m na �ve F ench SPEAK MODEL (r) F4.6 Abstract isLet estimated from the pooled speakers observations. Firstly, Fprosodic *3*Model: * (r) (r) (r) be estimated. o = , . . . , o ) the total set of observations, (r) � Text To Prosodic Structure structure structure linguistic description of the text characteristics is automatically exb m d L o = o o h o o ob v on bo *HMMs ouHMMs pthat mmodel g symbolic/acoustic nsymbolic/acoustic d inn cascade, d om h symym associated context-dependent that model the parameters generated from the context-dependent the speech 4.6 Abstract Text To are Prosodic associated with=speake speaker r, where (t) [o. .is .ou .speake . ,= o [o (D)] ⇓context-dependent speaker where o, (1), (t) (1), .F. ,the oaugmented P *bolic/acoustic *Model: (t) [qopna (t), .HMM qacoustic (t)] isfor (L’x1) linguistic qwith Fstructure * * Structure ****speech non ve F ench non ench speake age context-dependent model estimated each MMd)] An v(1), g on x=HMM nd n=HMM mod λ . of tnthe t (D)] An model λ F * model [ooaverage the observation sequences oho = 1isr, Fprosodic *a A speaking style composed mcontext-dependent tracted processing chain [9] and used to refine the bo pusing nλ on h ompo ou d* of vdiscrete/continuous on Add on y and 4.6 Model: Text To Prosodic =linguistic 1 ...,o o (TTwhich han ob the vL acoustical on qum propndthe characteristics speaking style. tooisthe acoustic Additionally, a is rich (Dx1) observation vector describes p During kAbstract ng y of mod λa linguistic ovariations. dStructure on nuou P * bolic the training, the discrete/continuous context-dependent Then, FA ***representation ***speech F ** m dcontexts. omthe ho pooled poo dspeakers p average k headphones ob v(D)] on oFirstly, Fno yatree is estimated from observations. m Fstructure * HMMs *u om context-dependent that model the symbolic/acoustic and s en ng cond on Exper pa c pan s associated with speaker r, where (t) = [o (1), . . . , o context-dependent HMM modelling (see [10] and [11] for a detailed characteristics of a speaking style. syllable Longtemps ## je me suis couché de bonne heure ## ngu d p on o h x h y x is the (Dx1) observation vector which describes the acoustical proplinguistic description of the text characteristics is automatically exerty at time t. Let q = (q , . . . , q ) the total set of linguistic context vector which describes the linguistic properties at time t. on x d * p* nd **n* HMM h mod h ymbo ou ** *p h o dcontext-dependent w r wh = ois estimated 1themdescription odforoDeach P MMlength mhofo ofthemthe onh xp d kis p derived nd n HMM HMM hm Fcharacteristics ** * ”contexts). P soo as describes tomod minimize T model mdescription =acoustical F of style. iscontexts the (Dx1) observation propthe enriched linguistic “ngu da*using uspeaking ngofa linguistic p mocoung chain hDuring nbonne 9m and nd u* synthesis, dtoorefine fin tracted [9] used the syllable Longtemps ## je me suisprocessing ché de heure ## omare pestimated the symhhthe Dx1 ob v contexts. vand o which h= d an(1) b vthe hom pd)]op (R) [q (1), . .context-dependent .ou ,va q (T isnd n tree observations, qwh eononvector ac ua yThcom ng ous doma ns speech and aud o MM we m k ng y separately. m ngu x n n g on x p P Fh HMMs * * * * linguistic Then, average * * * . context-dependent HMM model λ Table 4.2:temps text-to-prosodic-structure conversion. time Let qq==t.sequence (qq , .q . . ,associated λIllustration = , couλ ché (1) erty Let =qqngu (qm) the .cs.o. ,set q dofolinguistic )cr,the totalngPaset ofhpan linguistic on x d ## p ## ndofjethe njeλHMM HMM mod 10 and nd 11 for o derty ⇓suis syllable me dede bonne heure ## context-dependent modelling (see [10] [11] a detailed m t. Mat Ltime h,total ngu theyatdlinguistic context speaker where m s we e encou syllable LongLongtemps me suis cou- ng ché heure ## Wbonne d og v dsoes oas = nsmwith pans on hofothe “ c(r) isMMm derived to ominimize the description length TTm M echno P *” * m . .h.mus , q q (T contexts observations, bolic/acoustic are cascade, from d*3.1. p* on oparameters hthe n Discrete/Continuous h d linguistic ngugenerated on x in “ m of m” the contexts). Tabledescription 4.2: Illustration text-to-prosodic-structure conversion. (style) m(L’x1) Training ofenriched the Models (r) (r) = q (1), 1augmented T)] is x symob[q v(t),on. . . and q is “ofofmthe ” (t) = , qnd q (t)] the[q linguistic qonMMthe = λ , λλ (1) syllable Longtemps jethe me suisprosodic cou- structure ché deconversion. bonne heure ## Two main Table approaches can be## distinguished in modelling: on the one The acoustic module models simultaneously source/filter variations, on x d p nd n HMM mod λ m m m ⇓ m = [q (1), . . . , q (T contexts observations, and q An average context-dependent HMM acoustic model 4.2:λ Illustration text-to-prosodic-structure During the training, discrete/continuous context-dependent . context-dependent HMM model λ λ = λ 1 the linguistic context sequence associated with speaker r, where aged oquuse headphones ⇓formal models r )] is λsymbolic (style) (style) h nguvector on x describes n the o m d wproperties h p k at time r wht. Table(style) 4.2: Illustration of the text-to-prosodic-structure context which linguistic Rconversion. hand, expert approaches attempt at elaborating thatsynthesis, account for thethe observed m m 3.1.1. Discrete HMM bolic representation to the acoustic variations. Additionally, a rich f variations, and the temporal symbolic associated with a speaksyllable Longtemps ## je me suis couché de bonne heure ## HMMs are estimated separately. During the sym⇓ λlinguistic, , λacoustic λ3 1training, (1) Two main approaches can be distinguished in structure modelling: on the one is theh (L’x1) augmented qq (t) ==[q (t), . . . , q (t)] T arespect ng o D prosodic Con nuou Mod SModels bn= du nh Discrete/Continuous M Training of the Models prosodic with to para-linguistic, and extra-linguistic conL x1the ugm n respect dlinguistic ngu 4.6.1 variations Expert o3.1. symbolic m r, where Firstly, a During the the discrete/continuous context-dependent the linguistic context sequence with speaker isSpeakers estimated from pooled observations. Two main approaches can be distinguished informal prosodic structure modelling: onen ing style. fmod were normalized with toMM the Th ou describes modu muproperties nassociated ou ysource/filter ou speakers fiMMm v speaking on bolic/acoustic parameters generated in cascade, from sym4.2: Illustration ofhelaborating theare text-to-prosodic-structure conversion. The acoustic module models simultaneously variations, hand, expert approaches attempt atdistinguished models that account the observed oh For Du ng ncanng dmethods on nuou on xstatistical donthe pthethe nd context the linguistic at time straints. OnTable the other hand, statistical attempt at elaborating a for model each speaking style, an average context-dependent discrete Two main approaches be inDuring prosodic structure modelling: on one An context-dependent HMM acoustic model λmt. on average x vector v exo which whprior h d⇓to modelling. ngu pfilter, op and o 3.1.1. linguistic description of the text characteristics is automatically HMMs are separately. the synthesis, the symhand, expert approaches attempt atto elaborating formal models that account for the observed MMbtheh h style Source, normalized fha speakobserva3 estimated 1the 1t m D HMM prosodic variations with respect linguistic, para-linguistic, and extra-linguistic conf v on nd mpo ymbo o d w p k Discrete HMM bolic representation to the acoustic variations. Additionally, a rich f variations, and temporal symbolic associated with HMM d p y Du ng h yn h h ym (r) (r) ⇓ which accounts for prosodic variations from the observation of statistical regularities on m m (r) � Table 4.2: Illustration of the text-to-prosodic-structure conversion. hand, expert attempt at elaborating formal models that for the observed twith Results & Discussion is estimated from the pooled observations. Firstly, a context-linguistic HMM λstatistical estimated theaccount pooled speakers prosodic variations respect linguistic, para-linguistic, and extra-linguistic con- associated 4.6.1 Models context-dependent HMM isothe oapproaches bis nd in Models bolic/acoustic parameters are cascade, the symstraints. On the other hand, methods attempt atnfrom elaborating afrom statistical model tion vectors their dynamic vectors are used estimate l lExpert (t) [q⇓d1and .speakers .Mm,5isqnormalized (t)] (L’x1) augmented qing ng y =Speakers Sp k(t), w m dis wthe hmodel pto to hestimated p k ng for each of the am linguistic description ofStothe text characteristics ex- An average large speech corpora. style. f f.were with respect speaking Zgenerated bo ou gZdu d automatically om ym t pwith o ah egeis prosodic variations respect linguistic, para-linguistic, and extra-linguistic concontext-dependent HMM acoustic model λλ Two main approaches can be distinguished inny@ prosodic modelling: ondhthe one Lno straints. On the other hand, statistical methods attempt atvelaborating a statistical Fo pthe ktoacoustic ng n structure on x[9] pmodel nd n discrete d An context-dependent HMM model estimated for each of the m with the speaking style. tracted using a linguistic processing chain and used to refine the For each speaking style, an average context-dependent S v g on x p nd n HMM ou mod 4.6 Abstract Model: Text To Prosodic Structure l bolic representation to variations. Additionally, a rich S which accounts for the prosodic variations from the observation of statistical regularities on 4.6 Abstract Model: Text To Prosodic Structure Z MM m @ R W W i During the training, the discrete/continuous context-dependent yfrom p the o too mod models ng Source, Sou fi . Context-dependent noFirstly, mm df v 4.6.1 Expert Models tracted linguistic processing chain to refine the a ostatistical eand HMMs dependent HMM λ style prior modelling. filter, andndnormalized bo pusing ntaprosodic on helaborating ou vProsodic Add y model h dd used mattempt straints. On the other hand, methods at[9] elaborating aon statistical Hi on mob context-dependent uthat estimated pooled speakers a fobservawhich accounts for the variations fromTo the observation statistical regularities on smodels 4.6 Abstract Model: Text Structure u sH hand, expert approaches attempt atdistinguished formal for observed 4.6.1 Expert linguistic Then, an average treee tModels Then, ancontexts. average tree S of dcontexts. poo pdescribes k observations. obcontext-dependent onu was ymm m@ d ifrom om haccount dthe pdetailed oislinguistic d mcontext Two main approaches can beof inm prosodic modelling: thekexone den fica pe mance es ma us ng large speech @ λaon is the pooled speakers kstructure linguistic description the text characteristics is automatically k[10] vector which the linguistic properties atxaretime t. a measu onvectors vom o hinto nd h don dyn vo o vare dtoFof oestimate oned l l corpora. context-dependent HMM modelling [11] for aon @ 2poo 2 which accounts for thepprosodic from the observation regularities onassociated clustered acoustically similar models using decision-tree-based tion and their dynamic vectors used contextngu dHMM o variations hlinguistic, xestimated hpara-linguistic, uWom ycon-x[11] eofdstatistical m large speech corpora. ZZ H u ofeand context-dependent HMM model is m estimated the s(see context-dependent HMM modelling (see [10] and for prosodic variations with respect and extra-linguistic prosodic grammar consists aStructure hierarchical prosodic reprex d is p derived nd n MM HMM mdescription dforo each h o of hm wThe h the haModel: pseparately. kto ng y@@To so(style) ason tomod minimize the length the Tona detailed hand, expert approaches attempt atprocessing elaborating formal models that account for thefin observed with the speaking style. 4.6 Abstract Text Prosodic b ki [9] tracted using linguistic chain used to refine the HMMs are estimated During the synthesis, the symm bOO SSand large speech corpora. description ofaModels enriched linguistic contexts). @ 2 4.6.1 Expert context-clustering (ML-MDL [15]). Multi-Space probability Distrid based Cohen s Kappa s a s c 16 Cohen s Kappa s n d u ng ngu p o ng h n 9 nd u d o h i Con x d p nd n HMM d p nd n HMM mod λ n 9 model ms H .so Context-dependent are models λv g context-dependent linguistic contexts. Then, anis average mattempt u asd an a alternative s H [10] 4.6.1 Expert Models straints. On the other hand,HMM statistical methods atuelaborating statistical sentation that modelling was experimented to T O BI [12] dependent ngu onMMm x HMM Th model n nλ on pminimize nd n treeHMMs 9 R conderived asx using todulength themparamedescription length of athes THMM context-dependent (see [11] for prosodic variations respect to linguistic, para-linguistic, extra-linguistic kk10 and . models context-dependent O R d butions (MSD) [16] are model acoustic on xdescription d p ndwith nmostly HMM mod ng ndand 11 o a detailed d mod2contexts). 2b @@ linguistic u dinto nas oacoustically ou yused mpto mod continuous/discrete ng dof on b be d ween wo a e s nregularities clustered similar decision-tree-based of the enriched m m is derived so to minimize the description the T Expert approaches concern hierarchical prosodic structure c measu es he opo on o ag eemen for French prosody labelling [13]. The prosodic grammar is which accounts for the prosodic variations from the observation of statistical on 9 bolic/acoustic parameters are generated in cascade, from the symTh p o od g mm on o h h p o od p The prosodic grammar consists of a hierarchical prosodic repredescription of the enriched linguistic contexts). d v d o o m n m h d p on ng h o h T (style) straints. On the other hand, statistical methods attempt at elaborating a statistical model bbOO 4.6.1 R1980, (style) ter fx sequence manage voiced/unvoiced d3.1. Expert p oninoModels hof the n Discrete/Continuous hprosodic d ngufrontierson ([Cooper x Models 4.6.1 Expert Models onaverage u context-dependent ngto ML MDL 15 Multi-Space Mu Spregions obproperly. b model y D Each nn context-clustering (ML-MDL probability Distrielling, and particular Training composed major prosodic (F n9,9 avRboundary which HMM acoustic λmsymbolice mon o s HMM model λλon [15]). large speech corpora. on that hofvariations w experimented xp n boundaries d as ofand n Paccia-Cooper, oon Bcontext-dependent sentation was an alternative to TTOOBI [12] context-dependent HMM model λpduration which forn1983, the prosodic themobservation statistical The module simultaneously source/filter variations, hmodels co o. omodel andom ag on12 acoustic x An d pbu nd n(MSD) HMM mod meemen context-dependent HMM modelled with anuou state probabilR Gee andaccounts Grosjean, Ferreira, 1988, from Abney, 1992,prosodic Watson and regularities Gibson,mod2004] 4.6.1 Expert Models onw MSD 16ec u isdto mod on d p. Ou m measu [16] are used continuous/discrete parameMMm acoustic Expert approaches mostly concern structure bolicExpert representation to the acoustic variations. Additionally, a rich approaches mostly concern hierarchical prosodic structure modis right bounded by labelling ahierarchical pause), minor prosodic boundaries (Ff ,isvariations, an butions o F n h p o ody b ng 13 Th p o od g mm ⇓ for French prosody [13]. The prosodic grammar ⇓ 3.1.1. Discrete HMM large speech corpora. and the temporal symbolic associated with a speakfor English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996, 3.1. Training of the Discrete/Continuous Models PEECH YNTHE ity density functions (PDFs) toween account fordregions theagtemporal structure of MM m m f qu n o m n g vo d unvo on p op y E h elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, ter f sequence to manage voiced/unvoiced properly. Each ⇓ elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, 3 1 T a n ng o h D Con nuou Mod he ag eemen be he ngs o he pa c mpana s and he is estimated from the pooled speakers observations. Firstly, 4.6.1 Expert Models intermediate and prominences (P). Expert approaches mostly hierarchical prosodic mod- y which mprobabilompo d of o boundary), mconcern o for p oFrench, od prosodic bound bound wh h ou Speakers The module[17]. models simultaneously variations, composed major prosodic boundaries (FFstructure , some a boundary Delais-Roussarie, 2000, Mertens, [Barbosa, 2006]and forGibson, other ingacoustic style. were normalized respect to the speaking Th muspeech ouwith ysource/filter ou vduration onon speech dynamic and Grosjean, 1983,particular Ferreira, 1988, Abney, Watson 2004] onmodu x d fpmod ndFinally, nMHMM HMM mod dwith wishafimodelled duaccording p obtobthe Gee Abney, 1992, 1992, Watson and Gibson, 2004] context-dependent isnmodelled state linguistic description ofassume the text characteristics automatically 3.1. of2004b] Discrete/Continuous Models For each style, ana aprosodic average context-dependent discrete elling, and inTraining prosodic frontiers ([Cooper and Paccia-Cooper, 1980, (Fex3.1.1. Discrete HMM gh bound dthe by pMonnin u structure mGrosjean, noresults p is o1993, od bound Fffstyle n prior languages). Expert models that from theboundaries integration and the temporal symbolic associated with athat speakisspeaking right bounded by pause), minor prosodic , variations, an ⇓ g ound u h The measu e va es om 1 o 1 s pe ec to modelling. Source, filter, and normalized f observa3 1 1 D HMM v on nd h mpo ymbo o d w h p k English; [Dell, 1984, Bailly, 1989, Monnin and Ladd, 1996, for and Grosjean, 1993, Ladd, 1996, trajectory model and the global variance (GV) model local and y d n y un on PDF o oun o h mpo u u o ity density The functions (PDFs) to account foristhe temporal of context-dependent HMM model for each of the 4.6.1 Expert Models 4.6.1 andExpert Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] MM1 variations, m m structure ofGee various and potentially constraints, in particular syntactic rhythmic motheestimated msMMpdaccording Let RMertens, beconflictual the number of speakers an average model HMM λ intermediate estimated from the pooled speakers associated 4.6.1 Expert Models acoustic module models source/filter n Models m dis bound y French, nd p[Barbosa, o od from p2006] omwhich n nand P ing ff were normalized with to boundary), and prosodic prominences (P). Delais-Roussarie, French, [Barbosa, for some other 2000, 2004b] for 2006] for some other tion vectors their dynamic vectors are used estimate contextngstyle. y Speakers Sp kd m w hrespect ptomodelled hspeaking global speech variations p and h[17]. 17 w Feemen n noyspeech p over dyn m[?]. mod d ngtoag otheheemen tracted using aspeaking used For each style, ann1989, average context-dependent discrete sag 0hdand stime chance 1simultaneously pek ongec Con us on speech Finally, dynamic is English; [Dell, 1984, Bailly, and[9] Grosjean, 1993, Ladd, 1996, the [?,for Dell, 1984] constraints. Fo hExpert plinguistic kModels ng yprocessing v Monnin gchain on x qdand p nd n integration dto refine 4.6.1 4.6.1 Expert Models with the speaking style. style prior to modelling. Source, filter, normalized f observaλ is to be estimated. Let = (q , . . . , q ) languages). prosodic structure results from the integration Expert models assume that a prosodic structure results from the Expert approaches mostly concern hierarchical prosodic y p linguistic otrajectory o ng Sou nd no m d that fHMMs vare MMmod m o massociated contexts. an context-dependent .gThen, Context-dependent dependent HMM λ ndthehfiglobal o models ymodel mod ob vthe n average GV hob mmodel m local Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] forassociated somemodother and variance (GV) The linguistic module mostly concerns the pooled extraction of structure prominent syn3.1.1. Discrete HMM HMM λ is estimated from the speakers fngs and symbolic withToaaspeakof constraints, particular syntactic and rhythmic 0hvariations, ahmm we cons de ed as equa ymod poss band enda ngstree nde Lthe bconflictual hprosodic numb oprosodic k[10] om hobservations, vdetailed mod EVALUAT ON tion vectors and their dynamic vectors used totemporal estimate HMM λ Expert m d of om hininpstructure poo dfrom pPaccia-Cooper, kwh on1980, d g model total set symbolic Let R be the number of speakers which an average of variousand and constraints, particular syntactic and rhythmic context-dependent HMM modelling (see and [11] amodelling, inpotentially particular ([Cooper and onand v oglobal ndacoustically ve over o are m contexton x clustered into similar models tactic boundaries from deep syntactic parsing, based onfrom syntactic conlanguages). models assume that afrontiers prosodic results thefor integration g ob p dyn vm on ov muusing ?d odecision-tree-based variations time [?]. Expert approaches mostly concern hierarchical prosodic structure with the speaking style. (style) [?, Dell, 1984] constraints. The prosodic grammar consists of a.dPaccia-Cooper, hierarchical prosodic repre3.2. speech Generation of the Speech ParametersHMMs Mare [?, Dell, 1984] constraints. w h h p k ng y M M m o b m L q = q q . Context-dependent dependent HMM models λ = [q (1), . . , q (N )] is the prosodic symq λ is to be estimated. Let q = (q , . . . , q ) stituency (Constituent-Depth [Cooper and 1980], φ-phrases of various and potentially conflictual constraints, in particular syntactic and rhythmic Geeelling, andlinguistic Grosjean, 1983,particular Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] c s on a ngs we e e a ve y a e 3% o he o a a ngs and context-clustering (ML-MDL [15]). Multi-Space probability DistriE p m n S up Con x d p nd n HMM d p nd n HMM mod λ is derived so as to minimize the description length of the T and in prosodic frontiers ([Cooper and Paccia-Cooper, 1980, ing style. Speakers f were normalized with respect to the speaking The module mostly concerns the extraction of prominent syn0 m The linguistic module mostly concerns the extraction of prominent synthat was experimented aso contexts). an alternative T O BI v[12] acoustic [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / toRight-hand-side h[Dell, osequence ostyle, pparsing, od ymbo ob onq clustered nddiscrete description of the the enriched similar models using decision-tree-based total setlinguistic of prosodic symbolic observations, and [?,sentation Dell, 1984] constraints. bolic associated with speaker r, where butions (MSD) [16] are to gu model paramefor English; 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996, u(n) dinto nDuring oacoustically ou yused mF mod ng dconverted onhe b a concatenated dfica tactic boundaries from deep syntactic based on syntactic conFor each speaking an average context-dependent M into m mon con us on m ma x text isu first The prosodic grammar consists of a hierarchical prosodic repreGee Grosjean, 1983, Ferreira, 1988, Abney, Watson and Gibson, 2004] tactic boundaries from deep syntactic parsing, based on syntactic cone 3continuous/discrete pmm esen smproperly. den forand prosody labelling [13]. The prosodic grammar is (style) Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based Th pFrench o (Constituent-Depth od gthe mm on oand hq1992, h(N oprominent od psyn3 2 Generation Gemoved nthe asynthesis, onprior o[15]). h the Sp hParameters Pa am 3.2. ofvoiced/unvoiced the Speech context-clustering (ML-MDL Multi-Space probability DistriThe linguistic module mostly concerns the extraction ofpfor T m = 1 N h p o od ym stituency [Cooper Paccia-Cooper, 1980], φ-phrases = [q (1), . . . , )] is the prosodic symq is prosodic label associated with syllable n. Let ter f sequence to manage regions Each style to modelling. Source, filter, and normalized f observaDelais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] some other on x u ng ML MDL 15 Mu Sp p ob b y D 0 sentation that was experimented as ann alternative too T BI [12] for [Dell, 1984, Bailly, 1989, Monnin and(FGrosjean, 1993, 1996, stituency [Cooper Paccia-Cooper, φ-phrases .λprobabilcontext-dependent HMM model λacoustic MM (MSD) mOve maused sequence of context-dependent HMM associated (style) composed prosodic boundaries av on boundary n English; on(Constituent-Depth hqof1983, w= xp d and n , 1980], TOOLadd, B which 12 tactic boundaries from deep parsing, based syntactic con- contexts DG butions [16] are to continuous/discrete parame[Gee and Grosjean, Delais-Roussarie, 2000], Left-hand-side Right-hand-side mfica sconverted amodels onwhere ,m .an.estimated ,for q )structure the set of linguistic context-dependent modelled afinuou state duration bo qu(q n that o[13]. d Left-hand-side w[Barbosa, htotal p the k// pooled whspeakers non MSD bu 16 owmodel mod onfirst d = den m languages). Expert models assume prosodic results from the bolic sequence associated with speaker r,rintegration where (n) [Gee Grosjean, Delais-Roussarie, 2000], Right-hand-side for French prosody The prosodic grammar HMM λomajor is.syntactic from associated Du hHMM ynu voiced/unvoiced hisdsco hetextevea xwith onv dinto npmEach n pe dm o mance κ = Delais-Roussarie, 2000, 2006] for other fisome the synthesis, the issequence withng linguistic context qproperly. [q ,oa. .concatenated . , on qare], used is by alabelling pause), minor prosodic (F , isanqpater stituency [Cooper and 1980], φ-phrases oand F2 nbounded h(Constituent-Depth p1983, ody b2004b] ng 13French, Th p oboundaries od g speech mm mDuring Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based F gu eright Gene aMertens, on oand dlabel sc einassociated ePaccia-Cooper, con nuous symbolic fLlinsequence tomthe manage regions tion vectors and their dynamic vectors to estimate contextity density functions (PDFs) to account for the temporal structure of h p o od b o d w h y b n = [q (1), . . . , q (N )] is the observations, q f qu n o m n g vo d unvo d g on p op y E h Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based is the prosodic with syllable n. Let of [Gee various and potentially conflictual constraints, particular syntactic and rhythmic composed of major prosodic boundaries (F , a boundary which 0 38 ± 0 04 wh ch s compa ab e o ha obse ved o den languages). models assume abound prosodic structure resultsbound from the yintegration MMcontext m o vectord MM and Grosjean, 1983, Delais-Roussarie, 2000], prominences Left-hand-side / Right-hand-side n HMM on nd nHMM HMM mod λ qqu = [qo context-dependent , . is . . modelled , qx ]d pdenotes the (Lx1) linguistic intermediate sequence of models associated ompo dExpert oofboundary), m o Discrete/Continuous p o and odthatprosodic F mModels h context-dependent state duration MM(style) mλprobabil3.1.ame Training the m,couche speech speech dynamic modelled q )dependency hassociated o boundaries o su ngu on context with speaker where onx x d[17]. passociated ndFinally, n HMM mod dwith wishan. duaccording on p obtobthe the eright swith oconstraints. he sen emps e (P). me swh qguistic =speaking (q , . .style. .sequence , “Long q the total set of linguistic contexts [?,Boundary Dell, 1984] mlinguistic bounded by aqence pause), minor (F ann r, [Watson and ofis various and potentially constraints, syntactic and with unit w h h ngu on x qu n q = q q wh gh bound d Gibson, by conflictual p2004]), u syntactic m no prosodic pinoparticular od (Dependency-Grammar-based bound Frhythmic with the linguistic context sequence q = [q , . . . , q ], where ity density functions (PDFs) to account for the temporal structure of fica on om na u a speech κ = 0 45 ± 0 03 m m L na u a trajectory anddependent the global model ulocal and HMMs The are HMM models λlinguistic m y dn n The y model unacoustic on PDF o variance oun o (GV) h that mpo u module models simultaneously source/filter = qwhich 1isofan N hcontext ob v mostly on nd q . . the (n) = qprominences (n)] the (Lx1) linguistic q [q .(P). . P.prominent ,q )] oismbed the linobservations, and intermediate boundary), and prosodic linguistic module concerns synm.theContext-dependent mo x v ovariations, acoustic fiqo (N [?, constraints. Let R1984] the number speakers from average model de The bonne heure” “For a(n), me go = [q ,speech nois modelled h (Lx1) Lx1 ngu tocontext on nDell, m dbe bound y [qof nd p qoong od. ,= p extraction om n (1), n used qmq Finally, . ,over q hdynamic ]dyn denotes the speech m dfimpe global speech time [?]. 4.6.1 Expert Models pwithhw[17]. 17 F=variations nden y . .pfica mod daccording o gn d ngfican o h yvector gu on xdescribes qu nparsing, obased characteristics donof w syntactic h speaker p conkassociated r where wh mm on o mance s depends on he speak 4.6.1 Expert Models vector which thethelinguistic guistic context sequence associated with r, tactic boundaries from deep syntactic m Firstly, the prosodic symbolic q̄ is inferred so as to maximize The linguistic module mostly concerns extraction prominent syno d w h ngu un n m m associated with linguistic unit n. λDiscrete isHMM to be estimated. Let wq = (q , . . F. , q ) trajectory 4.6.1 Expert Models model and the global variance (GV) that model local andndM using 3.1.1. clustered into acoustically similar models decision-tree-based f variations, and the temporal symbolic associated with a speakM M earstituency y” o y mod nd h g ob v n GV h mod o 0 syllable n. deep Let be the number speakers from which an model nfrom= nwhbased ngu on x (n) (n), . ,consists qparsing, the (Lx1) linguistic context q [Cooper Paccia-Cooper, 1980], the log-likelihood of the symbolic sequenceaqy s subs condi- an a y den fied L RThe b (Constituent-Depth hprosodic numb oprosodic pn k. . and om(n)] vLx1 gφ-phrases mod ng s yove time figu eprosodic 4 spo commen tactic boundaries syntactic syntactic conthe R total set of [qof symbolic andprosodic global grammar ofhisobservations, anh/onaverage hierarchical repreob hspeech p hvariations v m onwover m [?]. ? 4.6.1 Expert mdsotheoasrespect m v(Constituent-Depth o bewhich whestimated. h describes d b 2000], hq ngu = (q h 1980], dgwith w vector linguistic associated λqand Grosjean, is tooModels ,Right-hand-side . . . ,w qq )o [Gee 1983, Left-hand-side 3.2. Generation of the Parameters M nq and mMulti-Space mqthe Ftionally yκ hto p0 oSpeech od ymbo omaximize m x mtheκ stituency and Paccia-Cooper, φ-phrases context-clustering (ML-MDL [15]). probability Firstly, the prosodic symbolic q̄q̄ is na inferred to .speaking the linguistic context sequence model λfied λ m . .d.[Cooper L the q(N =is characteristics ing style. Speakers f were normalized with to = [qb Delais-Roussarie, (1), , qLet )]discrete prosodic sym0 = 68 ± 0 05 ou a y den =m 0 50 ±Distri0 06 R m m m An average context-dependent HMM λ is estimated y b n mm syllable n. the total set of prosodic symbolic observations, and m [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based sentation that was experimented as an alternative to T O BI [12] h og k hood o h p o od ymbo qu n q ond For Boundary each speaking style, an average context-dependent discrete [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-side the log-likelihood of the prosodic symbolic sequence q condi4 Evaluation h o o p o od ymbo ob v on nd m bolic sequence associated with speaker r, where qan average (n) m contextm hden from pooled observations. Firstly, During the asynthesis, the text isPa firstam converted into a concatenated 3.2. Generation ofo the Speech Parameters po ca d scou se mode a e y fied κ = 0 28±0 07 and butions (MSD) [16] are used to model continuous/discrete parame= [q the (1), . . .speakers ,syntactic q (N )] is (Dependency-Grammar-based the prosodic symq style prior to modelling. Source, filter, and normalized f observa3 2 G n on h Sp h Boundary [Watson and Gibson, 2004]), dependency 0 on y o h ngu on x qu n q nd mod λ . tionally to the linguistic context sequence q and the model λ = 1 N h p o od ym q m = m m is (style) the prosodic syllable Let fi q̄ m = argmax wHMM models P(q λ |q, λ associated ) (2) v associated g label on Txassociated d labelling p nd d [13]. HMM λn. d isn with derived so asThe to minimize them infordependent tree An average context-dependent discrete HMM λ sequence ofiscontext-dependent forqu(q French prosody prosodic grammar mon wisgestimated bolic with speaker r, where qon MMκused mn to HMM from the speakers associated The p oposed mode eva ed av(n)speak ng mass yis sfirst ghconverted yto manage den fied = 0d 12 ± 0 context06 properly. n compa qλsymbolic =sequence .estimated . .pooled ,has the set of linguistic contexts During the synthesis, thehon textxsequence into a concatenated bo n is,the oq d)been dspeakers wk htotal p ua kpooled r based wh n om h poo p ob v F y n on x Du ng h yn h fi onv d n o on ter f voiced/unvoiced regions Each from observations. Firstly, an average contexttion vectors and their dynamic vectors are estimate m with the linguistic context sequence q = [q , . . . , q ], where 0 m isobservations, the label associated Let hcomposed pdoon bmajor d(1),with w h.v, qsyllable b )] n.nis = o[qexpe .boundaries the and qcep aHMM gmax P qon of models λλ|q,q associated P(q λλcontext (2) (Fmcontexts alin-boundary h==argmax fica om u)o vector a speech s ywith e den ficaprosodic and ed oqnquoa =n which m 2 he den fica on s M d.men d ysoo(N n ,mL h sequence podnd nof nd nden HMM mod d [qo context-dependent , . on . .son , qxq̄q̄ ]dwpdenotes the (Lx1) linguistic is set derived as too compa minimize the infordependent tree TTua prosodic the speaking qguistic = (q . .pe .sequence ,q total of (style) Gna mstate q = context q ,style. q ) the hassociated o o linguistic ngu x with speaker r, onwhere context-dependent HMM modelled with a m m with the linguistic context sequence qλ == [qmq, ... is .Context-dependent ,q ], where = N associated with linguistic unit n. are probabildependent HMM models w h h ngu on x qu n q q wh compa ab e n he case o he spo commen aduration y and he ou speak ng s y e den fica on expe men w h na u a speech w (1), . . ,minor q )] is the lin-n observations, q mo m HMMs acoustic bounded aq pause), prosodic boundaries (F[qm ,, .an = 1is .the q (N N hcontext ob vis right on nd(n), q . . . ,= = [qand qby [q (n)] (Lx1) linguistic q (n) qq == . . , q ] denotes theh (Lx1) linguistic context vector d no Lx1 ngu on x v guistic context sequence associated with speaker r, where ity density functions (PDFs) to account for the temporal structure na speak ng s y es κ = 0 70 ± 0 03 and κ 15 Fo hewhich pu xdescribes posequonthesuch necessa y gu on o a compa d w h son pDGkassociated rwas whwith vector linguistic characteristics associated with linguistic unit n. na u a na u a of= fi clustered into acoustically similar models using decision-tree-based Firstly, the o Sub d wprosodic h nguu symbolic m nun nq̄ mis inferred som as to maximize intermediate prosodic prominences [q (n), . .consists . , q (n)] isa the (Lx1) linguistic context qsyllable n. =ng n boundary), h Lx1 ngu onmen x (P). q grammar ofand hierarchical prosodic reprem 0E 54 ±mprosodic 0 [17]. 05 symbolic espec ve y qHoweve he e s aaccording d op n to den o The p ovprosodic de(n) anwhich s= e eva ua on nscheme o wbo h expe s thenRlog-likelihood ofspeech the sequence condiFinally, speech dynamic is modelled the w fi vector describes the linguistic characteristics associated with m . MM m Districontext-clustering [15]). Firstly, the prosodic symbolic q̄(ML-MDL is inferred so oasMulti-Space to omaximize v o wh h d b h ngu h o dw h Ftionally p olinguistic ymbo m m probability context and model λ xmass n. fica on o q̄sequence he po nqsequence cadtheand speak ng s y es wh ch s pasentation cu syllable aAn was no experimented poss b e discrete o con oTalternative he ngumistosestimated o y hto=the asDGan Tc Ocon BI en [12] context-dependent HMM λ mod the ofo the symbolic q q he condiy average bthat n was hTlog-likelihood og k hood h prosodic p o= odm model ymboand the qu nglobal ond (GV) that model local and trajectory variance butions (MSD) used totheomodel from Let the pooled speakers observations. Firstly, an average contextespec a [16] yxsequence s are gn he continuous/discrete mass s .y e κna =u paramenafor u aFrench speech uR ebeances wh chof p ov desλ ev den cues DG tionally linguistic context qqand model λλ a = 0 34±0 05 the number speakers from anod average othe qu fican nN |q, An average discrete iswhich estimated labelling grammar isons ytomodel mThe mP(q q̄h fi ngu = argmax λnd h mod ) (2) =mon tospeech An v prosody g context-dependent on Tx d p nd isn derived d [13]. HMM HMM λ prosodic minforso as to minimize the dependent tree global variations over time [?]. and κ = 0 38 ± 0 04 espec veproperly. y Th sEach nd ca es ha den fica on a s ng e keywo d wou d be su fic en o den y a = ter f sequence manage voiced/unvoiced regions (style) (1) (R) from pooled speakers observations. Firstly, ann average contextna u a 0 om the h poo d p k ob v on F y v g on x G m Tgmax w |q,qmλλ N P(q λofsymbolic isprosodic to beisson estimated. =hexinfor(q , . . . , q=proso q̄q̄ ) ==argmax ) (2)2 composed major boundaries a boundary which proso Mmq,nproso derived as Let to(F T P somehow q hea mode a ed o cap u e he e evan DG Thus a compa d vequ d soo ed oominimize memove m the nca o access context-dependent ddependent p ndsuch n tree T HMM is modelled with a state duration probabil-cues o he theon total set cofdminor prosodic and= co espond ng speak ng s y e Neve he ess a a ge con us on is right bounded pause), prosodic (Fm , an and o ocus hebypaosod mens on onsymbolic yboundariesobservations, ity density3.2. functions (PDFs)ofto the account for Parameters the temporal structure of (r) (r) (r) Generation Speech intermediate boundary), prosodic = and [qproso (1), . . prominences . , qproso (Nr )](P).is the prosodic speech sym- [17]. Finally, qproso mMM m according to the m m speech dynamic is modelled =(r) m = bolic sequence associated with speaker r, where qproso (n) = model trajectory global variance model localaand Duringand theN the synthesis, the text is(GV) firstthat converted into concatenated Let R be isthe the number of speakers from which an average model = m (style) prosodic label associated with syllable n. global Let speech variations over time [?]. MM m sequence of context-dependent HMM models λsymbolic associated (style) (1) (R) (R) q λsymbolic qis to=be estimated. , qproso ) contexts (q(1) , . . . , qLet ) proso the = total(qproso set ,of. . .linguistic =m with the linguistic context sequence = m q = [q1 , . . . , qN ], where the totalobservations, set of prosodic observations, = [q(r) (1), . . . , q(r) (Nr )]andis the lin- q = [q , .m. .m, q ]� denotes the and q(r) symbolic m m m context vector (r) (r) (r) n 1 L 3.2. Generation of the Speech Parameters (Lx1) linguistic [qproso (1), . . .sequence , qproso (Nrassociated )] is the with prosodic sym-r, where qproso =guistic context speaker m associated with linguistic unit n. (r) (r) (r) bolic sequence where (n) context = [q (n),with . . . , speaker q (n)]�r,is the (Lx1)qproso linguistic q(r) (n)associated proso (style) symbolic proso (style) symbolic (style) m symbolic (style) m symbolic 52 52 (style) acoustic CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS: CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS: STATE-OF-THE-ART 0 STATE-OF-THE-ART 0 (style) symbolic (style) symbolic m (style) is used to merge the sequence phonemes into a sequence of syllables. At the prosodic level, The prosodic structure model is to infer a sequence of prosodic labels that are associated proso (style) symbolic (r) (r) (style) (1) (r) (style) The prosodic structure structure model model is to infer infer aa sequence sequence of of prosodic prosodic labels labelsthat thatare areassociated associated Longtemps isisto,toinfer je a sequence me suis of couché labels de that bonne heure . sentence structure The prosodic prosodic are associated with relevant prosodic prosodicmodel events. events. with relevant prosodic events. Longtemps mea sequence suis couché . sentence The prosodic structure model, is toje infer of prosodic de labelsbonne that areheure associated CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS: with relevant prosodic events. (style) (style) (style) 52 Longtemps je me me suis suis couché de STATE-OF-THE-ART bonne heure heure sentence Longtemps ,, je couché de bonne .. (r) (r) 0 (r) 1 (r) (r) (1) 1 (r) (r) (r) (r) (r) (r) (1) (style) (r) acoustic 1 (r) (r) � L (r) 0 1 (r) L L (style) symbolic m (r) t (r) (R) (r) r � � (style) m symbolic � r r (style) acoustic (style) symbolic 0 0 (style) symbolic m (style) acoustic 0 (style) (style) acoustic acoustic 0 (style) acoustic 0 0 0 0 (style) acoustic 0 (style) acoustic (style) symbolic n 1 nn L 1 (style) m symbolic � 1 �� LL 1 N NN proso proso (style) symbolic proso (style) (style) (style) m symbolic proso symbolic symbolic proso proso n n n 1 L � proso qproso qp qproso � L � L 1 1 (style) N (style) prososymbolic m symbolic m 1 N N proso proso (style) symbolic (style) symbolic m (r) t 0 L L (style) symbolic r (style) symbolic (r) t (r) (style) (R) R acoustic (r) 0 1 (r) (r) t � � 0 f0 [Hz] f [Hz] 0 f0 [Hz] f0 [Hz] f [Hz] (R) proso (r) � �� (r) t (r) �� r (r) (r) t (style) (r)acoustic t (r) (style) acoustic (R)R r (R) R (r) (1) (style) acoustic amplitudeamplitude amplitude amplitude amplitude proso (r) (r) (r)� L (r) (R) LL (1) (r) t (R) (1) acoustic �prosodic (r) � structure modExpert approaches mostly mostly(r)concern concern hierarchical hierarchical prosodic structure modL L elling, ([Cooper and 1980, Expert approaches mostly 1prosodic concern hierarchical prosodic structure (R) modelling, and inin particular particular prosodic frontiers frontiers ([Cooper and Paccia-Cooper, Paccia-Cooper, 1980, (style)and (1) R2004] Gee and Grosjean, 1988, Abney, 1992, Watson and (r) (r) Ferreira, (r) elling, and in 1983, particular prosodic frontiers ([Cooper Paccia-Cooper, 1980, Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, and Watson and Gibson, Gibson, 2004] proso proso proso symbolic (style) r proso proso proso for [Dell, 1984, Bailly, 1989, Monnin and Grosjean, Ladd, 1996, mEnglish; Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson 1993, and Gibson, 2004] for English; [Dell, 1984, Bailly, 1989, Monnin andprosodic Grosjean, 1993, Ladd, 1996, Expert approaches mostly concern hierarchical structure modsymbolic (r)1996, Delais-Roussarie, 2000, Mertens, 2004b] French, [Barbosa, 2006] some other for English; 1984, Bailly, 1989,for Monnin Grosjean, 1993,for Delais-Roussarie, 2000, Mertens, 2004b] for French,and [Barbosa, 2006] forLadd, some other elling, and in[Dell, particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, proso (r) (r) (r) languages). Expert models assume that aa prosodic structure results integration Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006]from for the some other languages). Expert models assume thatproso prosodic results from the integration Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] rinstructure proso proso ofof various and conflictual constraints, particular and rhythmic languages). Expert models assume that(style) aconstraints, prosodic structure resultssyntactic from the(style) integration various andpotentially potentially conflictual inand particular syntactic and rhythmic for English; [Dell, 1984, Bailly, 1989, Monnin Gros (r) [?, Dell, 1984] constraints. m of [?, various and potentially constraints, in particular syntactic and rhythmic symbolic (1) (R) symbolic Dell, constraints. D R 1984] M conflictual B m proso The module concerns [?, Dell,linguistic 1984] constraints. m m The linguistic module mostly mostly concerns the the extraction extraction of of mprominent prominent synsyn(r) (r)the (r) of on tactic boundaries from deep syntactic parsing, based syntactic synconThe linguistic module concerns extraction prominent m tactic boundaries frommostly deep (style) syntactic parsing, based on conr syntactic (1) (R) syntactic stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], tactic boundaries from deep parsing, based on syntacticφ-phrases conD stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases m R symbolic [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-side stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-side T m m m (r) 2004]), syntactic (r) dependency (r) Boundary [Watson and (Dependency-Grammar-based [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side (r) (r)syntactic (r) � dependency Boundary [Watson andGibson, Gibson, 2004]), (Dependency-Grammar-based m r/ Right-hand-side Boundary [Watson dependencyC(Dependency-Grammar-based 1 Gibson, C and D 2004]),Lsyntactic C G G D R R (r) (r) � B (r) W G D G mm � r (r) 0 CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS: 52 STATE-OF-THE-ART isprosodic used to merge the sequence phonemes into a sequence of syllables. At the prosodic level, (style) (style) (style) 52structure STATE-OF-THE-ART pauses are identified toCHAPTER 4. SPEECH PROSODY m MODELLING acoustic& SYNTHESIS: symbolic AF phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier * * STATE-OF-THE-ART M A 52 phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier prosodic is used phonemes into a sequence of syllables. At the prosodic* level, Fm to merge the sequence * * isprosodic used to merge the sequence phonemes into a sequence of syllables. At the prosodic level, structure A phonetizer is pauses are identified to to* convert the input text into a sequence P * used * of phonemes. A syllabifier * structure pauses are identified to used to merge the sequence phonemes into a sequence of syllables. At the prosodic FM ** * level, (r) M (style) FisM prosodic * ** M prosodic pauses are identified prosodic structure is to infer a sequence prosodic that are heure Longtemps je me suis of coude bonne ## FThe *to * labels Fsyllable * ** m **model ## *ché *associated m mstructure Mm structure with relevant prosodic events. P P FM ** *** * ** **** (r) (style) m FM Table 4.2:* Illustration of the text-to-prosodic-structure conversion. * * * M MFprosodic m The structure model is to infer a sequence of prosodic* labels that are associated The structure model is to infer a sequence of prosodic that are heure associated temps ## je me me suis coucouché de bonne ## Fsyllable *events. * * labels * * ## Pprosodic * temps * ## Longtemps ## je ché de m Longsuis ché de bonne heure ## with relevant prosodic Msyllable m Longtemps , je je me me suis suis coucouché de bonne bonne heure heure . sentence m relevant Two approaches can be distinguished in prosodic structure modelling: the one Pwith *prosodic *events. * *associated The main prosodic structure model is to infer a sequence of prosodic labels that areon mhand, syllable Longtemps ## je text-to-prosodic-structure me formal suis coude bonne heure ## approaches attempt at modelsché thatconversion. account for the observed 4.2: Illustration Illustration ofelaborating the text-to-prosodic-structure conversion. Table 4.2: of the with relevant prosodic events. M expert Table 4.2: Illustration of the text-to-prosodic-structure conversion. prosodic with respect to and bonne extra-linguistic conLongtemps , je linguistic, me suis suispara-linguistic, couché de heure . sentence variations (style) (style) (style) syllable Longtemps ## je me couché de bonne heure ## Longtemps , je me suis couché de bonne heure . sentence m mainOn straints. the other hand, methods attempt at elaborating a statistical Table 4.2:can Illustration of the text-to-prosodic-structure conversion. acoustic symbolic Two approaches can be statistical distinguished in prosodic prosodic structure modelling: onthe themodel one approaches be distinguished in structure modelling: on one which accounts for the prosodic variations from the observation ofaccount statistical regularities on. Longtemps , elaborating je me suis couché de bonne heure sentence hand, approaches attempt at elaborating formal models that account forthe the observed expert approaches attempt at models that for observed Two main approaches can be distinguished informal prosodic structure modelling: on the one (r) (style) (style) (style) large speech corpora. Table 4.2: with Illustration thelinguistic, text-to-prosodic-structure conversion. prosodic prosodic variations with respect to para-linguistic, and extra-linguistic conTwo main approaches can beatof distinguished inPROSODY prosodic structure modelling: the one respect to para-linguistic, and extra-linguistic conCHAPTER 4.4.linguistic, SPEECH MODELLING & SYNTHESIS: hand, expert approaches attempt elaborating formal models that account for theonobserved CHAPTER SPEECH PROSODY MODELLING & SYNTHESIS: acoustic symbolic structure straints. other hand, hand, statistical methods attempt at elaborating elaborating statistical model hand, expert approaches attempt at SPEECH elaborating formal models that account for the observed On the other statistical methods attempt at aastatistical model m PROSODY 52 STATE-OF-THE-ART CHAPTER 4. MODELLING & SYNTHESIS: prosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con52 STATE-OF-THE-ART F * respect * which the with prosodic variations from the thepara-linguistic, observationof ofstatistical statistical regularities on M accounts (r) prosodic variations to linguistic, and extra-linguistic confor the prosodic variations from observation regularities on 52 STATE-OF-THE-ART Two main approaches can be distinguished in prosodic structure modelling: on the one prosodic straints. On On the other hand, statistical methods attempt atatelaborating aastatistical F * of phonemes. * model large corpora. m speech straints. other statistical methods attempt elaborating statistical model corpora. A isthe toto*hand, convert the input text into aa sequence A syllabifier prosodic Aphonetizer phonetizer isused used convert theSPEECH input text into sequence of phonemes. A observed syllabifier 120 structure CHAPTER 4. PROSODY MODELLING & regularities SYNTHESIS: hand, expert approaches attempt at elaborating formal models that for the P * for *prosodic *of * level, which accounts the prosodic variations from the observation ofofaccount statistical on which accounts the variations from the statistical regularities on used the sequence phonemes ainto of At 120 structure Ais phonetizer isfor used to the input into text a observation sequence phonemes. A syllabifier usedto tomerge merge the sequence phonemes into asequence sequence of syllables. syllables. At the the prosodic prosodic level, prosodic Fis *convert * 52 M prosodic variations with to concern linguistic, para-linguistic, andSTATE-OF-THE-ART extra-linguistic conapproaches mostly hierarchical prosodic mod120 speech corpora. large speech corpora. pauses are identified toto Flarge *respect * level, isExpert used to merge the sequence phonemes into a sequence of syllables. At thestructure prosodic M pauses are identified structure F * * * (style) m elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, syllable Longtemps ## je me suis couché de bonne heure ## straints. hand, statistical methods at elaborating a statistical Fphonetizer * * model pauses are the identified to m100On A isother to**convert the input text attempt into a sequence phonemes. A *syllabifier FM * symbolic * 2004] P100 *used1983, * ofWatson 120 Gee and Grosjean, Ferreira, 1988, Abney, 1992, and Gibson, Pused * * of statistical * level, which accounts for* the prosodic variations fromathe observation on is120 to merge the sequence phonemes into sequence of syllables. At the regularities prosodic F * * * 1996, m English; for100 1984, Bailly, 1989, Monnin andprosodic Grosjean, 1993, Ladd, Expert approaches mostly concern hierarchical prosodic structure modTable [Dell, 4.2: Illustration of the text-to-prosodic-structure conversion. mostly structure large speech corpora. pauses identified P areapproaches * to * ## concern * modsyllable Longtemps je meforhierarchical suis cou- [Barbosa, ché * de 2006] bonne ## Delais-Roussarie, Mertens, 2004b] French, for heure some other (style) elling, particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, syllable temps ##prosodic je mefrontiers suis couché and de Paccia-Cooper, bonne heure1980, ## 8080 and Longin 2000, particular ([Cooper 100 languages). Expert models assume that aaprosodic structure results from the integration 100 The prosodic structure model is to infer sequence of prosodic labels that are associated m Gee and 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] symbolic The prosodic structure model is to infer sequence of prosodic labels that are associated Expert approaches mostly concern hierarchical prosodic structure modGrosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] Two main approaches can be distinguished prosodic structure modelling: on the one (style) 80 syllable Longtemps ## je text-to-prosodic-structure me in suis couché conversion. de bonneand heure ## Table 4.2: Illustration of the of various and potentially conflictual constraints, particular syntactic rhythmic with relevant prosodic events. The prosodic structure model isBailly, to infer a sequence ofinmodels prosodic labels that are associated for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996, Expert approaches mostly concern hierarchical prosodic structure modwith relevant prosodic events. elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, [Dell, 1984, 1989, Monnin and Grosjean, 1993, 1996, hand, expert approaches attempt at elaborating formal that account forLadd, the observed Table 4.2: Illustration of the text-to-prosodic-structure conversion. acoustic 60 60 [?, Dell, 1984] constraints. with relevant prosodic events. Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa,and 2006] for some other other Gee Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson andsome Gibson, 2004] Mertens, 2004b] for French, [Barbosa, 2006] for prosodic variations with respect to linguistic, para-linguistic, and extra-linguistic conelling, and in 2000, particular prosodic frontiers ([Cooper Paccia-Cooper, 1980, 8080 and Tablemodule 4.2: Illustration of the conversion. The linguistic mostly concerns the extraction of prominent synTwo main approaches can be distinguished in prosodic structure modelling: on the model one languages). models assume that a text-to-prosodic-structure prosodic structure results from the integration for English; [Dell, 1984, Monnin and Grosjean, 1993, Ladd, 1996, The prosodic structure model is toBailly, infer sequence of1992, prosodic labels that are associated Expert models assume that a1989, prosodic structure results from the integration (style) straints. On the other hand, statistical methods attempt at elaborating a statistical 60 Gee and Grosjean, 1983, Ferreira, 1988, Abney, Watson and Gibson, 2004] Two main approaches can be distinguished in prosodic structure modelling: on the one Longtemps , , elaborating je me suis couché de heure sentence tactic boundaries from deep syntactic parsing, based onbonne syntactic con-.. Longtemps je me suis couché de bonne heure sentence hand, expert approaches attempt at formal models that account for the observed of various potentially conflictual constraints, in particular syntactic and rhythmic 40 Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some other with relevant prosodic events. and potentially conflictual constraints, in particular syntactic and rhythmic acoustic which accounts for the prosodic variations from the observation of statistical regularities on Expert approaches mostly concern hierarchical prosodic structure modexpert approaches attemptBailly, elaborating formal models thatde account for the observed 40 forhand, English; [Dell, 1984, 1989, Monnin andstructure Grosjean, 1993, Ladd, 1996, Longtemps , at me suis couché bonne heure . sentence stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases 60 Two main approaches can be distinguished in prosodic the one prosodic variations with respect to jelinguistic, para-linguistic, and modelling: extra-linguistic con60 [?, Dell, constraints. languages). Expert models assume that a prosodic structure results from the on integration 1984] constraints. large speech corpora. prosodic variations with respect to linguistic, para-linguistic, and extra-linguistic conelling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side Mand Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] forRight-hand-side some other hand, expert approaches attempt at elaborating formal models that account for the rhythmic observed straints. On the other hand, statistical methods attempt at elaborating statistical model The module mostly concerns the in extraction of a/prominent prominent synof40various and potentially conflictual constraints, particular syntactic linguistic module mostly concerns the extraction of synstraints. On[Watson the other hand, statistical methods attempt at elaborating a statistical model Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] Boundary and Gibson, syntactic dependency (Dependency-Grammar-based Longtemps , 2004]), je suis couché de bonne heure . sentence prosodic variations with respect to linguistic, para-linguistic, and extra-linguistic conwhich accounts for the prosodic variations from the observation of statistical regularities on languages). Expert models assume that ame prosodic structure results from the integration tactic boundaries from deep syntactic parsing, based on syntactic con[?, 1984] constraints. from deep syntactic parsing, based on syntactic con40 Dell, m which accounts for the1984, prosodic variations fromMonnin the observation of statistical regularities on 40 for English; [Dell, Bailly, 1989, and Grosjean, 1993, Ladd, 1996, straints. On the other hand, statistical methods attempt at elaborating statistical model 0 large speech corpora. stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases The linguistic module mostly concerns the extraction of a prominent syn(Constituent-Depth [Cooper and Paccia-Cooper, 1980], oflarge various and potentially conflictual constraints, in particular syntactic andφ-phrases rhythmic 0.4speech corpora. which accounts for1983, the prosodic variations the observation of2006] statistical regularities on and 1983, Delais-Roussarie, 2000], Left-hand-side / for Right-hand-side M Delais-Roussarie, 2000, Mertens, 2004b] for from French, [Barbosa, some other tactic boundaries from deep syntactic parsing, based syntactic conGrosjean, Delais-Roussarie, 2000], Left-hand-side /on Right-hand-side M [?,[Gee Dell, 1984] constraints. 0.4 prosodic prosodic large speech corpora. Boundary and Gibson, Gibson, 2004]), syntactic dependency (Dependency-Grammar-based stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases [Watson and 2004]), syntactic dependency (Dependency-Grammar-based languages). Expert models assume that a prosodic structure results from the integration Expert approaches mostly concern hierarchical prosodic structure modstructure Theprosodic linguistic module mostly concerns the extraction of prominent syn0 structure mm 0.4 [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-side (style) 0.2 elling, and potentially in particular prosodic frontiers ([Cooper and Paccia-Cooper, oftactic various and constraints, in particular and rhythmic FF **conflictual ** 1980, structure MM boundaries 0.4 from deep 2004]), syntactic parsing, basedsyntactic on syntactic conBoundary [Watson and syntactic dependency symbolic Gee and Grosjean, 1983, 1988, Abney, 1992, **(Dependency-Grammar-based Watson and Gibson, FF **Gibson, * ** 2004] 0.2 m [?, Dell, constraints. M 0.4 F * Ferreira, prosodic m1984] stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases (style) (1) (R) Expert approaches concern structure for English; [Dell,*mostly 1984, 1989, hierarchical Monnin andprosodic Grosjean, 1993, Ladd, 1996, *module **mostlyBailly, ** proso **modFPmPlinguistic *prosodic * proso Expert approaches mostly concern hierarchical structure mod*symbolic structure proso The concerns the extraction of synelling, and(style) in particular prosodic frontiers ([Cooper and2006] Paccia-Cooper, 1980, 0 Grosjean, 0.2 [Gee and 1983, Delais-Roussarie, 2000], / prominent Right-hand-side Delais-Roussarie, 2000, Mertens, 2004b] for French,Left-hand-side [Barbosa, for some other P * * * * 0.2 elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980, F * * Mand Expert approaches mostly concern hierarchical prosodic structure modsymbolic tactic boundaries from deep syntactic parsing, based onfrom syntactic conGee Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] languages). Expert models assume that a prosodic structure results the integration Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based syllable Longtemps ## je me suis couché de bonne heure ## m Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004] syllable Longtemps ##prosodic je me frontiers suis coude1993, bonneLadd, heure ## 0 Felling, * *ché and * 1996, 0.2 m and in particular ([Cooper Paccia-Cooper, 1980, for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases R of various and potentially conflictual constraints, in particular syntactic and rhythmic (r) (r) (r) (style) (1) (R) syllable temps ## je 1989, me suis cou-andché bonne heure ## for English;Long[Dell, 1984, Bailly, Monnin Grosjean, 1993, Ladd, 1996, P * 2000, *Mertens, * de * other −0.2 Gee Grosjean, Ferreira, 1988, Abney, [Barbosa, 1992, Watson and Gibson, 2004] Delais-Roussarie, 2004b] for2000], French, 2006] some r / for proso proso prosoproso proso proso 0 and [?, 1984] constraints. m1983, 0Dell, [Gee and Grosjean, 1983, Delais-Roussarie, Right-hand-side symbolic Delais-Roussarie, 2000, Mertens, 2004b] for French,Left-hand-side [Barbosa, 2006] for some other Table 4.2: of text-to-prosodic-structure conversion. Table 4.2:Illustration Illustration ofthe the text-to-prosodic-structure conversion. for linguistic English; [Dell, 1984, Bailly, Monnin and results Grosjean, 1993, Ladd, 1996, languages). Expert models assume that a1989, prosodic structure from the integration (r) The module mostly concerns the extraction of prominent synlanguages). Expert models assume that a prosodic structure results from the integration Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based syllable Longtemps ## je me suis couché de bonne heure ## Table 4.2: Illustration of the text-to-prosodic-structure conversion. −0.2 0 2000, Mertens, for French, [Barbosa, 2006] other of Delais-Roussarie, variousboundaries and potentially conflictual constraints, in particular syntactic and some rhythmic proso from deep 2004b] syntactic parsing, based on for syntactic conoftactic various and potentially conflictual constraints, in particular syntactic and rhythmic −0.4 (r) models (r) (r)structure Two main approaches can distinguished in prosodic modelling: the languages). Expert assume that a time prosodic structure results from the on integration [?, Dell, 1984] constraints. Two main approaches canbe be distinguished inand prosodic structure modelling: on the one one −0.2 [s] stituency (Constituent-Depth [Cooper Paccia-Cooper, 1980], φ-phrases −0.2 [?, Dell, 1984] constraints. raccount proso proso 0 hand, expert approaches attempt atofproso elaborating formal models that for the observed Table 4.2: Illustration the text-to-prosodic-structure conversion. Two main approaches can be distinguished in2000], prosodic structure on one of various and potentially conflictual constraints, particular syntactic rhythmic The linguistic module mostly concerns the in extraction of prominent synhand, expert approaches attempt elaborating models thatmodelling: account forand thethe observed [Gee and Grosjean, 1983, Delais-Roussarie, Left-hand-side Right-hand-side (1)at (R)formal M (r) The linguistic module mostly concerns the extraction of /prominent synprosodic variations with respect to linguistic, para-linguistic, and conhand, expert approaches attempt at elaborating formal models that account for the observed −0.4 [?, Dell, 1984] constraints. tactic boundaries from deep syntactic parsing, based on extra-linguistic syntactic con−0.2 prosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con- proso Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based time [s] tactic boundaries from deep parsing, based on syntactic constraints. On hand, statistical methods attempt at elaborating statistical model prosodic variations with respect to syntactic linguistic, para-linguistic, and1980], extra-linguistic con−0.4 m The linguistic module mostly concerns the extraction of aaprominent synstituency (Constituent-Depth [Cooper and Paccia-Cooper, φ-phrases straints. Onthe theother other hand, statistical methods attempt at elaborating statistical model Two main approaches can be distinguished in prosodic structure modelling: on the one (r) (r) (r) 0 timeand [s] Paccia-Cooper, 1980], −0.4 accounts stituency (Constituent-Depth [Cooper φ-phrases which for the prosodic variations from the observation of statistical regularities on r constraints. On the other hand, statistical methods attempt at elaborating a statistical model tactic boundaries from deep syntactic parsing, based syntactic [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side /on Right-hand-side which accounts for the prosodic variations from the observation ofaccount statistical regularities on time [s] hand, expert approaches attempt at elaborating formal models for the observed Mthat [Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-side Rthe Mof and (1) (R) large speech corpora. which for the prosodic variations from observation statistical regularities on stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based largeaccounts speech corpora. prosodic variations with respect to linguistic, para-linguistic, extra-linguistic con−0.4 Boundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based mmodel time [s] large speech corpora. [Gee and 1983, Delais-Roussarie, 2000], Right-hand-side straints. OnGrosjean, the other at elaborating a/ statistical m (r) methods (r) Left-hand-side (r) hand, statistical (r) attempt (r)� (r) Boundary [Watson Gibson, 2004]), syntactic (Dependency-Grammar-based which accounts for theand prosodic variations from thedependency observation of statistical regularities on r 1 L large speech corpora. (1) proso r R (R) (r) (1) (1) Longtemps , 4. je input me text suis couché dephonemes. bonne heure . sentence CHAPTER PROSODY MODELLING & SYNTHESIS: A phonetizer to convert the into of A syllabifier Longtemps , SPEECH je me suis a sequence couché de bonne heure . sentence is used (style) symbolic (r) Expert approaches (R) (r) symbolic m pauses are structure identified model to The prosodic with relevant prosodic events.is to infer a sequence of prosodic labels that are associated with relevant prosodic events. symbolic (style) acoustic proso 4. SPEECH MODELLING & SYNTHESIS: A phonetizer is usedCHAPTER to convert the input textPROSODY into a sequence of phonemes. A syllabifier A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier CHAPTER 4. SPEECH SPEECH PROSODY MODELLING SYNTHESIS: CHAPTER 4. &&SYNTHESIS: 52 STATE-OF-THE-ART is used to to merge thethe sequence phonemes into sequence syllables. At prosodic is used merge sequence phonemes intoa aPROSODY sequenceofofMODELLING syllables. At the the prosodic level, level, 52 STATE-OF-THE-ART STATE-OF-THE-ART pauses areare identified toto pauses identified CHAPTER 4. SPEECH PROSODY MODELLING SYNTHESIS: A phonetizer is used to convert the input text into a sequence of phonemes.& A syllabifier A 52 phonetizer is used to to convert convert the the input input text text into into aa sequence sequence of ofphonemes. phonemes. syllabifier AAsyllabifier STATE-OF-THE-ART is used to to merge the sequence phonemes At the prosodic level, is used sequence phonemesinto intoaaasequence sequenceof ofsyllables. syllables. At Atthe theprosodic prosodiclevel, level, merge the sequence phonemes into sequence of syllables. pauses identified toto pauses to to convert the input text into a sequence of phonemes. A syllabifier A are phonetizer is used are identified proso proso qproso qproso qp proso proso proso (style) symbolic proso (style) (style) symbolic m symbolic (style) symbolic m S 83 7 AL 237 357 64 2 28 116 460 47 T 43 38 73 470 NATURAL SPEECH SYNTHESIZED SPEECH 0.7 M AS 166 AL 0.8 390 Cohen’s Kappa 0.5 0.4 0.3 JO U R N PO LI TI C 0.6 AS S S M T O 53 390 166 83 209 237 245 357 54 64 3 18 28 87 116 348 460 28 53 43 32 38 41 73 365 470 7 2 47 SP SP OR O T R T J JO OU U RN R A N L AL MMA AS S SS P PO OL LI ITI TI C C AL AL PO PO L LI TI ITIC C AL AL JO JO U UR R N NA AL L SP SP O OR R T T JOURNAL 0.68 0.70 0.51 0.54 0.28 0.12 0.38 0.34 POLITICAL SPORT SP 131 M AS 154 MASS Figure 4: Mean identification scores and 95% confidence interval obtained for natural and synthesized speech. (a) natural speech 165 0.1 0 R AL U JO PO M LI TI R N C AS S AL SP O R 0.2 (b) synthetic speech Figure 3: Identification confusion matrices. Rows represent synthesized speaking style. Columns represent identified speaking style. exists between the political and the mass speech that is inherent to a similarity in the speaking style and the formal situation in which the speech occurs. Additionally, the conventional HMMbased speech synthesis system failed into modelling adequately the breathiness and the creakiness that is specific to the political speaking style, especially within unvoiced segments. ANOVA analysis was conducted to assess whether the identification performance depends on the language of the participants. Analysis reveals a significant effect of the language (F(2, 59) = 15, p < 0.001) (F(48,2)=5.9, p-value=0.005), and confirms results obtained for natural speech. This confirms evidence that there exists variations of a speaking style depending on the language and/or cultural background. Finally, an informal evaluation of the quality of the synthesized speech suggests that the speaking style modelling is robust to the large variety of audio quality. 6. Conclusion In this study, the ability and the robustness of a HMM-based speech synthesis system to model the speech characteristics of various speaking styles were assessed. A discrete/continuous HMM was presented to model the symbolic and acoustic speech characteristics of a speaking style, and used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consisted of an identification experiment of 4 speaking styles based on delexicalized speech, and compared with a similar experiment on natural speech. The evaluation showed that the discrete/continuous HMM consis- tently models the speech characteristics of a speaking style, and is robust to the differences in audio quality. This proves evidence that the discrete/continuous HMM speech synthesis system successfully models the speech characteristics of a speaking style in the conditions of real-world applications. 7. References [1] A.-C. Simon, A. Auchlin, M. Avanzi, and J.-P. Goldman, Les voix des Français. Peter Lang, 2009, ch. Les phonostyles: une description prosodique des styles de parole en français. [2] H. Schmid and M. Atterer, “New statistical methods for phrase break prediction,” in International Conference On Computational Linguistics, Geneva, Switzerland, 2004, pp. 659–665. [3] P. Bell, T. Burrows, and P. Taylor, “Adaptation of prosodic phrasing models,” in Speech Prosody, Dresden, Germany, 2006. [4] J. Yamagishi, T. Masuko, and T. Kobayashi, “HMM-based expressive speech synthesis - Towards TTS with arbitrary speaking styles and emotions,” in Special Workshop in Maui, Maui, Hawaı̈, 2004. [5] S. Krstulović, A. Hunecke, and M. Schröder, “An HMM-based speech synthesis system applied to german and its adaptation to a limited set of expressive football announcements,” in Interspeech, 2007. [6] E. Villemonte de La Clergerie, “From metagrammars to factorized TAG/TIG parsers,” in International Workshop On Parsing Technology, Vancouver, Canada, Oct. 2005, pp. 190–191. [7] N. Obin, P. Lanchantin, A. Lacheret, and X. Rodet, “Towards improved HMM-based speech synthesis using high-level syntactical features,” in Speech Prosody, Chicago, U.S.A., 2010. [8] A. Lacheret, N. Obin, and M. Avanzi, “Design and Evaluation of Shared Prosodic Annotation for Spontaneous French Speech: From Expert Knowledge to Non-Expert Annotation,” in Linguistic Annotation Workshop, Uppsala, Sweden, 2010. [9] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg, “ToBI: a standard for labeling english prosody,” in International Conference of Spoken Language Processing, Banff, Canada, 1992, pp. 867–870. [10] N. Obin, V. Dellwo, A. Lacheret, and X. Rodet, “Expectations for Discourse Genre Identification: a Prosodic Study,” in Interspeech, Makuhari, Japan, 2010, pp. 3070–3073. [11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” in European Conference on Speech Communication and Technology, Budapest, Hungary, 1999, pp. 2347–2350. [12] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markov models based on multi-space probability distribution for pitch pattern modeling,” in International Conference on Audio, Speech, and Signal Processing, Phoenix, Arizona, 1999, pp. 229–232. [13] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” in International Conference on Speech and Language Processing, Jeju Island, Korea, 2004, pp. 1397– 1400. [14] T. Toda and K. Tokuda, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Transactions on Information and Systems, vol. 90, no. 5, pp. 816–824, 2007. [15] N. Obin, A. Lacheret, and X. Rodet, “HMM-based prosodic structure model using rich linguistic context,” in Interspeech, Makuhari, Japan, 2010, pp. 1133–1136. [16] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.