Text-to-Speech Synthesis - TCTS Lab
Transcription
Text-to-Speech Synthesis - TCTS Lab
TTS: What for ? PART V Text-to-Speech Synthesis (TTS) • Telephone-based applications – Telecommunications ($) • Who’s calling • Integrated messaging (fax, email, answering machine) • Automatic reverse directory • Personal telephone attendant – Voice access to databases (70% of calls require very little interactivity) • Price lists • Cultural events • Weather report TTS: What for ? • Multimedia – CDRoms – Talking books – Interactive games Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit • Man-machine communication TTS: What for ? TTS = NLP + DSP TEXT-TO-SPEECH SYNTHESIZER • Help to the disabled – Speech impairment • Artificial voice – Sight impairment • Automatic reading of electronic documents • Automatic reading of paper documents (with OCR) TTS: What for ? • Fundamental research T E X T NATURAL LANGUAGE PROCESSING Phonetization Intonation/ Duration Generation (a) (b) To be or not to be, that is the question. Narrow Phonetic Transcription DIGITAL SIGNAL PROCESSING Phones Int/Dur Speech Synthesis (c) _ 210 t 40 U 55 0 173 75 173 b 80 10 160 i: 198 5 173 75 235 … Automatic phonetization • Dictionnary look-up ? To be or not to be, that is the question. _ t U b i: Q r n Q t t U b i: _ D { t s D @ k w e s tS @ n _ Be Not b i: nQt Or Question That The Or k w e s tS @ n D{t D@ To ‘s tU s Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit S P E E C H Automatic phonetization Intonation • More complex than that! Problem Assimilation Example Level Information nasality or sonority word/sentence reading style, pronunciation assimimation, vocalic of neighbors harmonization the, record, contrast , read, est, couvent, portions, etc. word homographs Schwa deletion table rouge, je ne te le redirai pas sentence Phonetic liaisons très utile, deux à deux, plat exquis sentence syntactic articulation, New words proopiomelancortin word Proper names your name here ... word spelling analogy morphology, analogy Heterophonic part-ofspeech, meaning (rare) syntactic articulation, pronunciation of neighbors, speaking style I saw him yesterday. I saw him yesterday. I saw him yesterday. I saw him yesterday. I saw him yesterday. I saw him yesterday. I saw him yesterday. I saw him yesterday. a. b. The term 'prosody' refers to certain properties of the speech signal. d. (a,b) Focus (c) Finality/continuity (d) Grouping, using phrase-level accent Intonation time (s) 0 0.2 freq. (Hz) 200 0.4 0.6 0.8 1 1.2 Phoneme Duration 1.4 1.6 1.8 100 l e t c. E k n i g d ´ t { E t m a) n y m e { i g d ´ l a p a { ç l • Why ups and downs? – Stress (word level) Accent (phrase level) (Phonetization↑) 2 Q l I s I z ´ d v e n tS • Not constant • Not fixed for a given phoneme • Linked to intonation (longer on accented syllables) • Modify slightly unnatural ! Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit ´ z Intonation/Duration Challenges : a summary Intelligible – Natural – Cost effective 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe All mimst were the borogroves, And the mome raths outgrabe. Lewis Carroll, Jabberwocky • Accurate automatic phonetization (≠dictionnary look-up) • Prosody generation(i.e., intonation and phoneme durations) must be “coherent”; easy to produce unnatural prosody • Synthesis of phoneme sequences with corresponding prosody – Coarticulation! – Segmental quality should be maintained after pitch and duration modification • Engineering – Low design and maintenance cost – Low computational and memory cost – Easy adaptation to other languages Coarticulation !!! Contents F1 (Hz) • Introduction 800 a • Acoustic speech synthesis (DSP) 600 400 ç ø o 200 u – Model-based (rule-based) approach – Instance-based (concatenative) approach E O y e • From text to phonemes and prosody (NLP) i F2 (Hz) 0 0 1000 2000 3000 • Synthesis : be able to mimic coarticulation! • (Recognition : be able to overcome it!) – – – – Preprocessing Morpho-syntactic analysis Phonetization Prosody Generation Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit Von Kempelen’s talking machine (1791) 1. John Holmes’ formant synthesizer (1964) Main bellows Nostrils Mouth Small bellows 'S' pipe 'S' lever 'Sh' lever Haskins Labs (1968) InfoVox (1983-95) DecTalk (1983) Rule-based Synthesis (J.S. Liénard, LIMSI) 'Sh' pipe Omer Dudley’s Voder (Bell Labs, 1936) Noise Source UV Resonnance Control Oscillator 1. John Holmes’ formant synthesizer (1964) Amplifier F0 V Glottal pulses 1 2 3 4 6 5 7 8 10 AV 9 B-Q resonators F1 F2 F3 F Z ,n F P ,n "Quiet" Energy switch wrist bar t-d p-b k-g Voder Console Keyboard Pitch-control pedal Noise B-Q resonators A NV Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit + F1 + F2 + FZ + x(n) 2. Diphone concatenation (1977) Contents F0 • Introduction • Acoustic speech synthesis (DSP) _ d 50ms 80ms – Model-based (rule-based) approach – Instance-based (concatenative) approach • From text to phonemes and prosody (NLP) – – – – _d o g 160ms _ 70ms 50ms do og g_ Prosody Diphone Database Modification Preprocessing Morpho-syntactic analysis Phonetization Prosody Generation Smooth joints 1 x 10 4 0.5 0 -0.5 -1 2. Diphone concatenation (1977) 0 1000 2000 3000 4000 5000 6000 7000 8000 Joe Olive’s LPC synthesizer (1977) P V/UV σ coefficients σ 1 A (z) P V UV Olive(1980) FPMs (1989) Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit p Christian Hamon’s PSOLA (1988) The MBROLA project L Cnet (1990) Limsi (Paris , 1992) T. Dutoit’s MBROLA (1993) • Based on the same Poisson’s sum formula as PSOLA, but using edited diphones • Similar overal quality as PSOLA • Same computatinoal load • Completely automatic! ⇒ can be used to create lots of compatible synthesizers 3. Automatic unit selection (1997) DiphoneDiphone-based synthesis F0 _ d 50ms 80ms _d Diphone Database o g 160ms do _ 70ms 50ms og g_ Prosody Modification Smooth joints J’ai été conçu... 1 x 10 4 0.5 Ma voix... 0 -0.5 -1 Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit 0 1000 2000 3000 4000 5000 6000 7000 8000 3. Automatic unit selection (1997) Unit selectionselection-based synthesis _ d 80ms o _d VERY LARGE CORPUS T A R G E T F0 50ms g 160ms do _ 70ms 50ms og 3. Automatic unit selection g_ sent : phonet : stress : tone : dur : f0 : “To _ t U l 210 40 55 be b i: ^ H 80 198 …” … … … … … Target j Target cost tc(j,i) Prosody Modification (ATR, 1996) (Univ. Edinburgh, 1997) Very Large corpus Smooth joints 1 x 10 4 0.5 (AT&T, 1998) 0 -0.5 (L&H, 1999) -1 0 1000 (Loquendo, 2001) (Babel Technologies, 2002) 2000 3000 4000 5000 6000 7000 sent : phonet : stress : tone : dur : f0 : “… bear.“ b E@ ^ l L 150 50 85 90 150 t Target j 3. Automatic unit selection Unit i Very Large corpus target cost tc(j,i) Concatenation cost cc(i-1,i) Unit i Formants: How to get the best sequence of units for a given utterance? Viterbi search Unit i … … … … 8000 3. Automatic unit selection (1997) Unit i-1 to U Unit i+1 Concatenation cost cc(i,i+1) • Target cost ? How to predict which units will sound as they would naturally connected? (should be perceptual) • Concatenation cost ? How to predict which sequences of units will sound naturally connected? (should be perceptual) sent : phonet : stress : tone : dur : f0 : Unit I+1 Concatenation cost cc(i,i+1) “… bear.“ b E@ ^ l L 150 50 85 90 150 t to U Formants: Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit … … … … sent : phonet : stress : tone : dur : f0 : Formants: “… teletubbies …“ @ b i: s … ^ … L l … 80 95 90 130 … t 3. Automatic unit selection From text to phonetics Text The NLP module Unit i-1 Very Large corpus Unit i Concatenation cost cc(i-1,i) Text Analyzer 1 =0 in case of successive units Pre-Processor Morphological Analyzer 2 Contextual Analyzer sent : phonet : stress : tone : dur : f0 : “… bear.“ b E@ ^ l L 150 50 85 90 150 t to U … … … … sent : phonet : stress : tone : dur : f0 : “… to t bear.“ b E@ ^ l L 150 50 85 90 150 U … … … … 3 SyntacticProsodic Parser 4 Letter-ToSound module 5 Prosody generator 6 Formants: Formants: … Contents M L D S or F S s to the DSP block 1. Pre-processing • Introduction • Acoustic speech synthesis (DSP) – Model-based (rule-based) approach – Instance-based (concatenative) approach • From text to phonemes and prosody (NLP) – – – – Preprocessing Morpho-syntactic analysis Phonetization Prosody Generation 'tgl.' = 'täglich', 'tägliche', 'täglichem', 'täglichen', 'täglicher', 'tägliches' 'Dr. Jones lives at the corner of Jones Dr. and St. James St. Recognizing acronyms 'L.T.S.', 'UFO', FPMs, ... Processing numbers '3.14', '2.16 pm', '13:26', '08.11.94', 'the 16th' Simple Regular grammars do most of the job (Lex – FSTs) Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit 2. Morphological analysis 3. Contextual analysis : n-grams Sentence W=(w1, w2, ..., wN) St{ ↔ All possible sequences T=(t1, t2, ..., tN) ^ Best sequence of tags T : T = arg max P(T |W ) st{ T P (W |T )P (T ) Tˆ = arg max P (W ) T ( Bayes) = arg max P (W |T )P(T ) T P(W |T ) = P (w1 ,w2 ,...,wN | t1 ,t2 , ...,t N ) = P (w1 | t1 ) P (w2 | w1 , t1 ,t2 ) P(w3 | w1 ,w2 ,t1 ,t2 ,t3 )... N Increasingly : use of graphotactic trained systems (ex : TNT http://www.coli.uni-sb.de/~thorsten/tnt/) Or even : brute force : inflected dictionnary = ∏ P ( w |t ) i i (Strong hypothesis 1) i=1 P(T )=P(t1 ,t2 , ...,t N ) =P(t1 ) P(t2 |t1 ) P(t3 |t2 ,t1 )... N = ∏ P (ti | ti −1 ,ti − 2 ,...,ti − n ) (Strong hypothesis 2) i=1 3. Contextual analysis Dogs like to bark 3. Contextual analysis : n-grams Example : « Dogs like to bark », using bi-grams (n=2) … (all possible paths) … N_sg P(N_pl,verb,Inf_marker,verb|dogs,like,to,bark) = P(dogs|N_pl) Verb Infinitive marker N_pl Prep V_3sg Verb P(N_pl|_) P(verb|N_pl) P(Inf_marker|verb) Adj ∏ P ( w |t ) i P(to|Inf_marker) i i=1 P(bark|verb) Prep Conj N P(like|verb) N_sg P(verb|Inf_marker) Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit N ∏ P (t i i=1 | ti −1 ,ti − 2 ,...,ti − n ) 4. Syntactic-Prosodic Phrasing chink Classification and Regression trees (CARTs) 1 chunk 2 3 4 5 6 7 8 9 10 11 12 13 14 • Chinks’n chunks Predict Color(n) SHapes, Sizes (n,n-1,n+1,n-2,n+2,…) ? a prosodic phrase = chink chunk a sequence of chinks (≈function words) followed by a sequence of chunks (≈content words) SOL : if SHape(n-1)=SHape(n) Color(n)=White else Color(n)=Black • Example : This can be seen as a classification problem : I asked them if they were going home to Idaho and they said yes and anticipated one more stop before getting home 4. Syntactic-Prosodic Phrasing No sentence final • CART tree No 1108/1118 No 327/361 Yes 82/120 No tr 27/46 … B S C S S B C … W B C S C S C … W S C B C B T … … … … … … … … … Classification and Regression trees (CARTs) « impure » set Yes 16/24 Q No Question which splits into best « purified » sets No fr 156/223 No j3 156/216 Yes 9/11 « pure » sets No j3 154/205 Yes 8/11 No fl 19/38 No 11/14 … C Yes Yes 7/7 Yes st 101/166 No 1108/1118 SH(n+1) S … No tr 164/252 Yes 21/29 Yes type 112/201 No 24/35 S(n+1) - No 1126/1225 No fl 491/613 No 151/198 Yes 15/15 SH(n-1) - tr= utterance rate (in words/second) No j2 1617/1838 Yes tr 127/216 S(n-1) S j2 = tag of word on the left No tr 249/423 No 9/9 SH(n) S j3= tag of wor on the right No j3 1866/2261 Yes et 127/225 S(n) B st = time to end of sentence Yes 297/298 No st 2974/3379 C(n) No j4 151/194 Yes 5/6 No 150/188 Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit Classification and Regression trees (CARTs) 5. Automatic Phonetization • Decision trees for predicting phonemes+stress « Purity » of a set? = « Entropy » (bits) = TRIE = -P(Black) log2[P(Black)] -P(White) log2[P(White)] Entropy = -(1/2 x -1) -(1/2 x -1) =1 bit Total entropy after split = 0+0=0 Entropy = -(1 x 0) -(0 x …) =0 bit Entropy = -(1 x 0) -(0 x …) =0 bit d b c u v w first character on the right h i first character on the left e i o second character on the right _ e s second character on the left (+ part-of-speech!) [eI] 5. Automatic Phonetization • Rule-based, usually formalized as in generative phonology : a [b] / l _ r 6. Prosody generation : patterns - Si ces oeufs minor continuation étaient frais major continuation j'en prendrais finality Qui les vend ? wh-question • Example : s -> [s] or [z] C'est bien toi, question s [z] / [éanti|hanti] _ [<V>] ma jolie ? s [s]/[anti|contre|impr‚|prime|tourne|ultra|psycho|télé] _ [] s s s s [s] [s] [z] [z] / [vrai]_[em] / [_a|para|sinu]_[e|o|y] / [tran] _ [a|h|i] /[<V>] _ [<V>] e focus character a echo - Evidemment, implication Monsieur. parenthesis - Allons donc ! exclamation Prouve-le-moi. order Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit 6.Prosody generation : tones 6.Prosody generation : tones • From tones to F0 (Hz) by rule : H+ L L H L L H L L H L L c e ilin g L- H ra n g e lo w L flo o r L- s lo p e H+ H L L- /L L s ´ p E { s o n a HH Z g { o s j e HLn u d e { A) lZ f { A) S m A) NB : more theories than researchers … 6.Prosody generation : tones • From text to tones • From tones to F0 (Hz), corpus-based • Large speech dba, with known intonation groups, F0 and tones ce personnage grossier, te dérange-t-il • For each target intonation group : WS . . . o . o . . o . SG (. . . -)( . -) ( . . . - IG 1 (. . . /LL)( . HH) ( . . . H/H) IG 2 (. . . . HH) ( . . . H/H) - 6.Prosody generation : tones ) WS = word stress = lexical stress Phonetization SG = stress group IG = intonation group Synt.-Pros. Phrasing (only one stressed syllable) – find a list of similar intonation groups (in terms of tones, number of syllables, position in sentence, etc.) in the dba • Select the sequence of intonation groups in the dba which : – best represents the target groups – AND minimizes intonative discontinuities Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit 6.Prosody generation : tones If T A R G E T H,L, H+, L- = stressed syllables h,l = unstressed syllables) (l l l H) (l l H) Towards corpus-based techniques 1995-?: The database years – For automatic phonetization (L&H, ENST, Univ. Edinburgh, FPMs) (l l L-) V I T E R B I F0 (Hz) Very Large corpus – For automatic generation of intonation and phoneme duration (AT&T, FPMs, Univ. Aix, Univ. Edinburgh) – For automatic selection of units for concatenative synthesis (ATR, Univ. Edinburgh, AT&T, FPMs?) Bad Better Conclusion • Introduction • Acoustic speech synthesis (DSP) – Model-based (rule-based) approach – Instance-based (concatenative) approach • From text to phonemes and prosody (NLP) – – – – Preprocessing Morpho-syntactic analysis Phonetization Prosody Generation • Towards... Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit