Text-to-Speech Synthesis - TCTS Lab

Transcription

Text-to-Speech Synthesis - TCTS Lab
TTS: What for ?
PART V
Text-to-Speech
Synthesis (TTS)
• Telephone-based applications
– Telecommunications ($)
• Who’s calling
• Integrated messaging (fax, email, answering
machine)
• Automatic reverse directory
• Personal telephone attendant
– Voice access to databases (70% of calls require
very little interactivity)
• Price lists
• Cultural events
• Weather report
TTS: What for ?
• Multimedia
– CDRoms
– Talking books
– Interactive games
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
• Man-machine
communication
TTS: What for ?
TTS = NLP + DSP
TEXT-TO-SPEECH SYNTHESIZER
• Help to the
disabled
– Speech impairment
• Artificial voice
– Sight impairment
• Automatic reading of
electronic documents
• Automatic reading of
paper documents
(with OCR)
TTS: What for ?
• Fundamental
research
T
E
X
T
NATURAL LANGUAGE
PROCESSING
Phonetization
Intonation/
Duration
Generation
(a)
(b)
To be or not to
be, that is the
question.
Narrow
Phonetic
Transcription
DIGITAL SIGNAL
PROCESSING
Phones
Int/Dur
Speech
Synthesis
(c)
_ 210
t 40
U 55 0 173 75 173
b 80 10 160
i: 198 5 173 75 235
…
Automatic phonetization
• Dictionnary look-up ?
To be or not to
be, that is the
question.
_ t U b i: Q r n Q t
t U b i: _ D { t s D
@ k w e s tS @ n _
Be
Not
b i:
nQt
Or
Question
That
The
Or
k w e s tS @ n
D{t
D@
To
‘s
tU
s
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
S
P
E
E
C
H
Automatic phonetization
Intonation
• More complex than that!
Problem
Assimilation
Example
Level
Information
nasality or sonority word/sentence reading style,
pronunciation
assimimation, vocalic
of neighbors
harmonization
the, record, contrast
, read, est, couvent,
portions, etc.
word
homographs
Schwa
deletion
table rouge, je ne te
le redirai pas
sentence
Phonetic
liaisons
très utile, deux à
deux, plat exquis
sentence
syntactic
articulation,
New words
proopiomelancortin
word
Proper names
your name here ...
word
spelling
analogy
morphology,
analogy
Heterophonic
part-ofspeech,
meaning
(rare)
syntactic
articulation,
pronunciation
of neighbors,
speaking style
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
I saw him yesterday.
a.
b.
The term 'prosody' refers to certain properties of the speech signal.
d.
(a,b) Focus (c) Finality/continuity
(d) Grouping, using phrase-level accent
Intonation
time (s)
0
0.2
freq. (Hz)
200
0.4
0.6
0.8
1
1.2
Phoneme Duration
1.4
1.6
1.8
100
l
e
t
c.
E k n i g d ´ t { E t m a) n y m e { i g d ´ l a p a { ç l
• Why ups and downs?
– Stress (word level) Accent (phrase level)
(Phonetization↑)
2
Q
l
I
s
I
z
´
d
v
e
n
tS
• Not constant
• Not fixed for a given phoneme
• Linked to intonation
(longer on accented syllables)
• Modify slightly unnatural !
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
´
z
Intonation/Duration
Challenges : a summary
Intelligible – Natural – Cost effective
'Twas brillig, and the slithy toves Did gyre and gimble in the wabe
All mimst were the borogroves, And the mome raths outgrabe.
Lewis Carroll, Jabberwocky
• Accurate automatic phonetization (≠dictionnary look-up)
• Prosody generation(i.e., intonation and phoneme durations)
must be “coherent”; easy to produce unnatural prosody
• Synthesis of phoneme sequences with corresponding prosody
– Coarticulation!
– Segmental quality should be maintained after pitch and
duration modification
• Engineering
– Low design and maintenance cost
– Low computational and
memory cost
– Easy adaptation to
other languages
Coarticulation !!!
Contents
F1 (Hz)
• Introduction
800
a
• Acoustic speech synthesis (DSP)
600
400
ç
ø
o
200
u
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
E
O
y
e
• From text to phonemes and prosody (NLP)
i
F2 (Hz)
0
0
1000
2000
3000
• Synthesis : be able to mimic coarticulation!
• (Recognition : be able to overcome it!)
–
–
–
–
Preprocessing
Morpho-syntactic analysis
Phonetization
Prosody Generation
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Von Kempelen’s talking
machine (1791)
1. John Holmes’ formant
synthesizer (1964)
Main bellows
Nostrils
Mouth
Small bellows
'S' pipe
'S' lever
'Sh' lever
Haskins Labs (1968)
InfoVox (1983-95)
DecTalk (1983)
Rule-based Synthesis
(J.S. Liénard, LIMSI)
'Sh' pipe
Omer Dudley’s Voder
(Bell Labs, 1936)
Noise
Source
UV
Resonnance Control
Oscillator
1. John Holmes’ formant
synthesizer (1964)
Amplifier
F0
V
Glottal
pulses
1
2 3 4
6
5
7
8
10
AV
9
B-Q
resonators
F1
F2
F3
F Z ,n
F P ,n
"Quiet"
Energy switch
wrist bar
t-d
p-b
k-g
Voder
Console
Keyboard
Pitch-control
pedal
Noise
B-Q
resonators
A NV
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
+
F1
+
F2
+
FZ
+
x(n)
2. Diphone concatenation
(1977)
Contents
F0
• Introduction
• Acoustic speech synthesis (DSP)
_
d
50ms
80ms
– Model-based (rule-based) approach
– Instance-based (concatenative)
approach
• From text to phonemes and prosody (NLP)
–
–
–
–
_d
o
g
160ms
_
70ms 50ms
do
og
g_
Prosody
Diphone
Database
Modification
Preprocessing
Morpho-syntactic analysis
Phonetization
Prosody Generation
Smooth joints
1
x 10
4
0.5
0
-0.5
-1
2. Diphone concatenation
(1977)
0
1000
2000
3000
4000
5000
6000
7000
8000
Joe Olive’s LPC synthesizer
(1977)
P
V/UV
σ
coefficients
σ
1
A (z)
P
V
UV
Olive(1980)
FPMs (1989)
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
p
Christian Hamon’s PSOLA
(1988)
The MBROLA project
L
Cnet (1990)
Limsi (Paris , 1992)
T. Dutoit’s MBROLA (1993)
• Based on the same Poisson’s sum formula as
PSOLA, but using edited diphones
• Similar overal quality as PSOLA
• Same computatinoal load
• Completely automatic!
⇒ can be used to create lots of compatible
synthesizers
3. Automatic unit selection
(1997)
DiphoneDiphone-based
synthesis
F0
_
d
50ms
80ms
_d
Diphone
Database
o
g
160ms
do
_
70ms 50ms
og
g_
Prosody
Modification
Smooth joints
J’ai été conçu...
1
x 10
4
0.5
Ma voix...
0
-0.5
-1
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
0
1000
2000
3000
4000
5000
6000
7000
8000
3. Automatic unit selection
(1997)
Unit selectionselection-based
synthesis
_
d
80ms
o
_d
VERY
LARGE
CORPUS
T
A
R
G
E
T
F0
50ms
g
160ms
do
_
70ms 50ms
og
3. Automatic unit selection
g_
sent :
phonet :
stress :
tone :
dur :
f0 :
“To
_
t
U
l
210 40 55
be
b i:
^
H
80 198
…”
…
…
…
…
…
Target j
Target cost
tc(j,i)
Prosody
Modification
(ATR, 1996)
(Univ. Edinburgh, 1997)
Very
Large
corpus
Smooth joints
1
x 10
4
0.5
(AT&T, 1998)
0
-0.5
(L&H, 1999)
-1
0
1000
(Loquendo, 2001)
(Babel Technologies, 2002)
2000
3000
4000
5000
6000
7000
sent :
phonet :
stress :
tone :
dur :
f0 :
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
Target j
3. Automatic unit selection
Unit i
Very
Large
corpus
target cost
tc(j,i)
Concatenation cost
cc(i-1,i)
Unit i
Formants:
How to get the best sequence of units for a given
utterance? Viterbi search
Unit i
…
…
…
…
8000
3. Automatic unit selection
(1997)
Unit i-1
to
U
Unit i+1
Concatenation cost
cc(i,i+1)
• Target cost ?
How to predict which units will sound as they would
naturally connected? (should be perceptual)
• Concatenation cost ?
How to predict which sequences of units will sound
naturally connected? (should be perceptual)
sent :
phonet :
stress :
tone :
dur :
f0 :
Unit I+1
Concatenation
cost cc(i,i+1)
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
to
U
Formants:
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
…
…
…
…
sent :
phonet :
stress :
tone :
dur :
f0 :
Formants:
“…
teletubbies
…“
@ b i: s …
^
…
L
l
…
80 95 90 130 …
t
3. Automatic unit selection
From text to phonetics
Text
The NLP module
Unit i-1
Very
Large
corpus
Unit i
Concatenation
cost cc(i-1,i)
Text Analyzer
1
=0 in case of
successive units
Pre-Processor
Morphological
Analyzer
2
Contextual
Analyzer
sent :
phonet :
stress :
tone :
dur :
f0 :
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
to
U
…
…
…
…
sent :
phonet :
stress :
tone :
dur :
f0 :
“… to
t
bear.“
b E@
^
l
L
150 50 85 90 150
U
…
…
…
…
3
SyntacticProsodic
Parser
4
Letter-ToSound
module
5
Prosody
generator
6
Formants:
Formants:
…
Contents
M
L
D
S
or
F
S
s
to the DSP block
1. Pre-processing
• Introduction
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• From text to phonemes and prosody
(NLP)
–
–
–
–
Preprocessing
Morpho-syntactic analysis
Phonetization
Prosody Generation
'tgl.' = 'täglich', 'tägliche', 'täglichem', 'täglichen', 'täglicher', 'tägliches'
'Dr. Jones lives at the corner of Jones Dr. and St. James St.
Recognizing acronyms
'L.T.S.', 'UFO', FPMs, ...
Processing numbers
'3.14', '2.16 pm', '13:26', '08.11.94', 'the 16th'
Simple Regular grammars do most of the job
(Lex – FSTs)
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
2. Morphological analysis
3. Contextual analysis :
n-grams
Sentence W=(w1, w2, ..., wN)
St{ ↔
All possible sequences T=(t1, t2, ..., tN)
^
Best sequence of tags T : T = arg max P(T |W )
st{
T
P (W |T )P (T )
Tˆ = arg max
P (W )
T
( Bayes) = arg max P (W |T )P(T )
T
P(W |T ) = P (w1 ,w2 ,...,wN | t1 ,t2 , ...,t N )
= P (w1 | t1 ) P (w2 | w1 , t1 ,t2 ) P(w3 | w1 ,w2 ,t1 ,t2 ,t3 )...
N
Increasingly : use of graphotactic trained systems
(ex : TNT http://www.coli.uni-sb.de/~thorsten/tnt/)
Or even : brute force : inflected dictionnary
=
∏ P ( w |t )
i
i
(Strong hypothesis 1)
i=1
P(T )=P(t1 ,t2 , ...,t N ) =P(t1 ) P(t2 |t1 ) P(t3 |t2 ,t1 )...
N
= ∏ P (ti | ti −1 ,ti − 2 ,...,ti − n )
(Strong hypothesis 2)
i=1
3. Contextual analysis
Dogs
like
to
bark
3. Contextual analysis :
n-grams
Example : « Dogs like to bark », using bi-grams (n=2)
… (all possible paths) …
N_sg
P(N_pl,verb,Inf_marker,verb|dogs,like,to,bark)
= P(dogs|N_pl)
Verb
Infinitive
marker
N_pl
Prep
V_3sg
Verb
P(N_pl|_)
P(verb|N_pl)
P(Inf_marker|verb)
Adj
∏ P ( w |t )
i
P(to|Inf_marker)
i
i=1
P(bark|verb)
Prep
Conj
N
P(like|verb)
N_sg
P(verb|Inf_marker)
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
N
∏ P (t
i
i=1
| ti −1 ,ti − 2 ,...,ti − n )
4. Syntactic-Prosodic Phrasing
chink
Classification and Regression
trees (CARTs)
1
chunk
2
3
4
5
6
7
8
9
10
11 12 13 14
• Chinks’n chunks
Predict Color(n) SHapes, Sizes (n,n-1,n+1,n-2,n+2,…) ?
a prosodic phrase =
chink
chunk
a sequence of chinks (≈function words)
followed by a sequence of chunks (≈content words)
SOL : if SHape(n-1)=SHape(n) Color(n)=White
else Color(n)=Black
• Example :
This can be seen as a classification problem :
I asked them
if they were going home
to Idaho
and they said yes
and anticipated one more stop
before getting home
4. Syntactic-Prosodic Phrasing
No
sentence
final
• CART tree
No
1108/1118
No
327/361
Yes
82/120
No
tr
27/46
…
B
S
C
S
S
B
C
…
W
B
C
S
C
S
C
…
W
S
C
B
C
B
T
…
…
…
…
…
…
…
…
…
Classification and Regression
trees (CARTs)
« impure »
set
Yes
16/24
Q
No
Question which
splits into best
« purified » sets
No
fr
156/223
No
j3
156/216
Yes
9/11
« pure »
sets
No
j3
154/205
Yes
8/11
No
fl
19/38
No
11/14
…
C
Yes
Yes
7/7
Yes
st
101/166
No
1108/1118
SH(n+1)
S
…
No
tr
164/252
Yes
21/29
Yes
type
112/201
No
24/35
S(n+1)
-
No
1126/1225
No
fl
491/613
No
151/198
Yes
15/15
SH(n-1)
-
tr= utterance rate (in
words/second)
No
j2
1617/1838
Yes
tr
127/216
S(n-1)
S
j2 = tag of word on the left
No
tr
249/423
No
9/9
SH(n)
S
j3= tag of wor on the right
No
j3
1866/2261
Yes
et
127/225
S(n)
B
st = time to end of sentence
Yes
297/298
No
st
2974/3379
C(n)
No
j4
151/194
Yes
5/6
No
150/188
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
Classification and Regression
trees (CARTs)
5. Automatic Phonetization
• Decision trees for predicting phonemes+stress
« Purity » of a set?
= « Entropy » (bits)
= TRIE
= -P(Black) log2[P(Black)] -P(White) log2[P(White)]
Entropy = -(1/2 x -1) -(1/2 x -1)
=1 bit
Total entropy
after split =
0+0=0
Entropy =
-(1 x 0) -(0 x …)
=0 bit
Entropy =
-(1 x 0) -(0 x …)
=0 bit
d
b
c
u
v
w
first character on the right
h
i
first character on the left
e
i
o
second character on the right
_
e
s
second character on the left
(+ part-of-speech!)
[eI]
5. Automatic Phonetization
• Rule-based, usually formalized as in
generative phonology :
a [b] / l _ r
6. Prosody generation : patterns
- Si ces oeufs
minor
continuation
étaient frais
major
continuation
j'en prendrais
finality
Qui les vend ?
wh-question
• Example : s -> [s] or [z]
C'est bien toi,
question
s [z] / [éanti|hanti] _ [<V>]
ma jolie ?
s [s]/[anti|contre|impr‚|prime|tourne|ultra|psycho|télé]
_ []
s
s
s
s
[s]
[s]
[z]
[z]
/ [vrai]_[em]
/ [_a|para|sinu]_[e|o|y]
/ [tran] _ [a|h|i]
/[<V>] _ [<V>]
e
focus character
a
echo
- Evidemment,
implication
Monsieur.
parenthesis
- Allons donc !
exclamation
Prouve-le-moi.
order
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
6.Prosody generation : tones
6.Prosody generation : tones
• From tones to F0 (Hz) by rule :
H+
L
L
H
L
L
H
L
L
H
L
L
c e ilin g
L-
H
ra n g e
lo w
L
flo o r
L-
s lo p e
H+
H
L
L-
/L L
s ´ p E { s o n
a
HH
Z g { o s j
e
HLn u d e {
A)
lZ f { A) S m A)
NB : more theories than researchers …
6.Prosody generation : tones
• From text to tones
• From tones to F0 (Hz), corpus-based
• Large speech dba, with known intonation
groups, F0 and tones
ce personnage grossier, te dérange-t-il
• For each target intonation group :
WS
.
.
.
o
.
o
.
.
o
.
SG
(.
.
.
-)(
.
-) ( .
.
.
-
IG 1
(.
.
. /LL)(
.
HH) ( .
.
.
H/H)
IG 2
(.
.
.
.
HH) ( .
.
.
H/H)
-
6.Prosody generation : tones
)
WS = word stress = lexical stress Phonetization
SG = stress group
IG = intonation group Synt.-Pros. Phrasing
(only one stressed syllable)
– find a list of similar intonation groups (in terms
of tones, number of syllables, position in
sentence, etc.) in the dba
• Select the sequence of intonation groups in the
dba which :
– best represents the target groups
– AND minimizes intonative discontinuities
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
6.Prosody generation : tones
If
T
A
R
G
E
T
H,L, H+, L- = stressed syllables
h,l = unstressed syllables)
(l l l H)
(l l H)
Towards corpus-based
techniques
1995-?: The database years
– For automatic phonetization (L&H,
ENST, Univ. Edinburgh, FPMs)
(l l L-)
V
I
T
E
R
B
I
F0 (Hz)
Very
Large
corpus
– For automatic generation of
intonation and phoneme duration
(AT&T, FPMs, Univ. Aix, Univ.
Edinburgh)
– For automatic selection of units
for concatenative synthesis (ATR,
Univ. Edinburgh, AT&T, FPMs?)
Bad
Better
Conclusion
• Introduction
• Acoustic speech synthesis (DSP)
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
• From text to phonemes and prosody (NLP)
–
–
–
–
Preprocessing
Morpho-syntactic analysis
Phonetization
Prosody Generation
• Towards...
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit