Text-to-Speech Synthesis - TCTS Lab

Transcription

TTS: What for ?
PART V
Text-to-Speech
Synthesis (TTS)
• Telephone-based applications
– Telecommunications ($)
• Who’s calling
• Integrated messaging (fax, email, answering
machine)
• Automatic reverse directory
• Personal telephone attendant
– Voice access to databases (70% of calls require
very little interactivity)
• Price lists
• Cultural events
• Weather report
TTS: What for ?
• Multimedia
– CDRoms
– Talking books
– Interactive games
Copyright (c)2002 Faculté Polytechnique de Mons - T. Dutoit
• Man-machine
communication
TTS: What for ?
TTS = NLP + DSP
TEXT-TO-SPEECH SYNTHESIZER
• Help to the
disabled
– Speech impairment
• Artificial voice
– Sight impairment
• Automatic reading of
electronic documents
• Automatic reading of
paper documents
(with OCR)
TTS: What for ?
• Fundamental
research
T
E
X
T
NATURAL LANGUAGE
PROCESSING
Phonetization
Intonation/
Duration
Generation
(a)
(b)
To be or not to
be, that is the
question.
Narrow
Phonetic
Transcription
DIGITAL SIGNAL
PROCESSING
Phones
Int/Dur
Speech
Synthesis
(c)
_ 210
t 40
U 55 0 173 75 173
b 80 10 160
i: 198 5 173 75 235
…
Automatic phonetization
• Dictionnary look-up ?
To be or not to
be, that is the
question.
_ t U b i: Q r n Q t
t U b i: _ D { t s D
@ k w e s tS @ n _
Be
Not
b i:
nQt
Or
Question
That
The
Or
k w e s tS @ n
D{t
D@
To
‘s
tU
s
S
P
E
E
C
H
Automatic phonetization
Intonation
• More complex than that!
Problem
Assimilation
Example
Level
Information
nasality or sonority word/sentence reading style,
pronunciation
assimimation, vocalic
of neighbors
harmonization
the, record, contrast
, read, est, couvent,
portions, etc.
word
homographs
Schwa
deletion
table rouge, je ne te
le redirai pas
sentence
Phonetic
liaisons
très utile, deux à
deux, plat exquis
sentence
syntactic
articulation,
New words
proopiomelancortin
word
Proper names
your name here ...
word
spelling
analogy
morphology,
analogy
Heterophonic
part-ofspeech,
meaning
(rare)
syntactic
articulation,
pronunciation
of neighbors,
speaking style
I saw him yesterday.
a.
b.
The term 'prosody' refers to certain properties of the speech signal.
d.
(a,b) Focus (c) Finality/continuity
(d) Grouping, using phrase-level accent
Intonation
time (s)
0
0.2
freq. (Hz)
200
0.4
0.6
0.8
1
1.2
Phoneme Duration
1.4
1.6
1.8
100
l
e
t
c.
E k n i g d ´ t { E t m a) n y m e { i g d ´ l a p a { ç l
• Why ups and downs?
– Stress (word level) Accent (phrase level)
(Phonetization↑)
2
Q
l
I
s
I
z
´
d
v
e
n
tS
• Not constant
• Not fixed for a given phoneme
• Linked to intonation
(longer on accented syllables)
• Modify slightly unnatural !
´
z
Intonation/Duration
Challenges : a summary
Intelligible – Natural – Cost effective
'Twas brillig, and the slithy toves Did gyre and gimble in the wabe
All mimst were the borogroves, And the mome raths outgrabe.
Lewis Carroll, Jabberwocky
• Accurate automatic phonetization (≠dictionnary look-up)
• Prosody generation(i.e., intonation and phoneme durations)
must be “coherent”; easy to produce unnatural prosody
• Synthesis of phoneme sequences with corresponding prosody
– Coarticulation!
– Segmental quality should be maintained after pitch and
duration modification
• Engineering
– Low design and maintenance cost
– Low computational and
memory cost
– Easy adaptation to
other languages
Coarticulation !!!
Contents
F1 (Hz)
• Introduction
800
a
• Acoustic speech synthesis (DSP)
600
400
ç
ø
o
200
u
– Model-based (rule-based) approach
– Instance-based (concatenative) approach
E
O
y
e
• From text to phonemes and prosody (NLP)
i
F2 (Hz)
0
0
1000
2000
3000
• Synthesis : be able to mimic coarticulation!
• (Recognition : be able to overcome it!)
–
–
–
–
Preprocessing
Morpho-syntactic analysis
Phonetization
Prosody Generation
Von Kempelen’s talking
machine (1791)
1. John Holmes’ formant
synthesizer (1964)
Main bellows
Nostrils
Mouth
Small bellows
'S' pipe
'S' lever
'Sh' lever
Haskins Labs (1968)
InfoVox (1983-95)
DecTalk (1983)
Rule-based Synthesis
(J.S. Liénard, LIMSI)
'Sh' pipe
Omer Dudley’s Voder
(Bell Labs, 1936)
Noise
Source
UV
Resonnance Control
Oscillator
1. John Holmes’ formant
synthesizer (1964)
Amplifier
F0
V
Glottal
pulses
1
2 3 4
6
5
7
8
10
AV
9
B-Q
resonators
F1
F2
F3
F Z ,n
F P ,n
"Quiet"
Energy switch
wrist bar
t-d
p-b
k-g
Voder
Console
Keyboard
Pitch-control
pedal
Noise
B-Q
resonators
A NV
+
F1
+
F2
+
FZ
+
x(n)
2. Diphone concatenation
(1977)
Contents
F0
• Introduction
_
d
50ms
80ms
– Instance-based (concatenative)
approach
–
–
–
–
_d
o
g
160ms
_
70ms 50ms
do
og
g_
Prosody
Diphone
Database
Modification
Preprocessing
Phonetization
Prosody Generation
Smooth joints
1
x 10
4
0.5
0
-0.5
-1
2. Diphone concatenation
(1977)
0
1000
2000
3000
4000
5000
6000
7000
8000
Joe Olive’s LPC synthesizer
(1977)
P
V/UV
σ
coefficients
σ
1
A (z)
P
V
UV
Olive(1980)
FPMs (1989)
p
Christian Hamon’s PSOLA
(1988)
The MBROLA project
L
Cnet (1990)
Limsi (Paris , 1992)
T. Dutoit’s MBROLA (1993)
• Based on the same Poisson’s sum formula as
PSOLA, but using edited diphones
• Similar overal quality as PSOLA
• Same computatinoal load
• Completely automatic!
⇒ can be used to create lots of compatible
synthesizers
3. Automatic unit selection
(1997)
DiphoneDiphone-based
synthesis
F0
_
d
50ms
80ms
_d
Diphone
Database
o
g
160ms
do
_
70ms 50ms
og
g_
Prosody
Modification
Smooth joints
J’ai été conçu...
1
x 10
4
0.5
Ma voix...
0
-0.5
-1
0
1000
2000
3000
4000
5000
6000
7000
8000
(1997)
Unit selectionselection-based
synthesis
_
d
80ms
o
_d
VERY
LARGE
CORPUS
T
A
R
G
E
T
F0
50ms
g
160ms
do
_
70ms 50ms
og
g_
sent :
phonet :
stress :
tone :
dur :
f0 :
“To
_
t
U
l
210 40 55
be
b i:
^
H
80 198
…”
…
…
…
…
…
Target j
Target cost
tc(j,i)
Prosody
Modification
(ATR, 1996)
(Univ. Edinburgh, 1997)
Very
Large
corpus
Smooth joints
1
x 10
4
0.5
(AT&T, 1998)
0
-0.5
(L&H, 1999)
-1
0
1000
(Loquendo, 2001)
(Babel Technologies, 2002)
2000
3000
4000
5000
6000
7000
sent :
phonet :
stress :
tone :
dur :
f0 :
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
Target j
Unit i
Very
Large
corpus
target cost
tc(j,i)
Concatenation cost
cc(i-1,i)
Unit i
Formants:
How to get the best sequence of units for a given
utterance? Viterbi search
Unit i
…
…
…
…
8000
(1997)
Unit i-1
to
U
Unit i+1
Concatenation cost
cc(i,i+1)
• Target cost ?
How to predict which units will sound as they would
naturally connected? (should be perceptual)
• Concatenation cost ?
How to predict which sequences of units will sound
naturally connected? (should be perceptual)
sent :
phonet :
stress :
tone :
dur :
f0 :
Unit I+1
Concatenation
cost cc(i,i+1)
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
to
U
Formants:
…
…
…
…
sent :
phonet :
stress :
tone :
dur :
f0 :
Formants:
“…
teletubbies
…“
@ b i: s …
^
…
L
l
…
80 95 90 130 …
t
From text to phonetics
Text
The NLP module
Unit i-1
Very
Large
corpus
Unit i
Concatenation
cost cc(i-1,i)
Text Analyzer
1
=0 in case of
successive units
Pre-Processor
Morphological
Analyzer
2
Contextual
Analyzer
sent :
phonet :
stress :
tone :
dur :
f0 :
“…
bear.“
b E@
^
l
L
150 50 85 90 150
t
to
U
…
…
…
…
sent :
phonet :
stress :
tone :
dur :
f0 :
“… to
t
bear.“
b E@
^
l
L
150 50 85 90 150
U
…
…
…
…
3
SyntacticProsodic
Parser
4
Letter-ToSound
module
5
Prosody
generator
6
Formants:
Formants:
…
Contents
M
L
D
S
or
F
S
s
to the DSP block
1. Pre-processing
• Introduction
• From text to phonemes and prosody
(NLP)
–
–
–
–
Preprocessing
Phonetization
Prosody Generation
'tgl.' = 'täglich', 'tägliche', 'täglichem', 'täglichen', 'täglicher', 'tägliches'
'Dr. Jones lives at the corner of Jones Dr. and St. James St.
Recognizing acronyms
'L.T.S.', 'UFO', FPMs, ...
Processing numbers
'3.14', '2.16 pm', '13:26', '08.11.94', 'the 16th'
Simple Regular grammars do most of the job
(Lex – FSTs)
2. Morphological analysis
3. Contextual analysis :
n-grams
Sentence W=(w1, w2, ..., wN)
St{ ↔
All possible sequences T=(t1, t2, ..., tN)
^
Best sequence of tags T : T = arg max P(T |W )
st{
T
P (W |T )P (T )
Tˆ = arg max
P (W )
T
( Bayes) = arg max P (W |T )P(T )
T
P(W |T ) = P (w1 ,w2 ,...,wN | t1 ,t2 , ...,t N )
= P (w1 | t1 ) P (w2 | w1 , t1 ,t2 ) P(w3 | w1 ,w2 ,t1 ,t2 ,t3 )...
N
Increasingly : use of graphotactic trained systems
(ex : TNT http://www.coli.uni-sb.de/~thorsten/tnt/)
Or even : brute force : inflected dictionnary
=
∏ P ( w |t )
i
i
(Strong hypothesis 1)
i=1
P(T )=P(t1 ,t2 , ...,t N ) =P(t1 ) P(t2 |t1 ) P(t3 |t2 ,t1 )...
N
= ∏ P (ti | ti −1 ,ti − 2 ,...,ti − n )
(Strong hypothesis 2)
i=1
3. Contextual analysis
Dogs
like
to
bark
3. Contextual analysis :
n-grams
Example : « Dogs like to bark », using bi-grams (n=2)
… (all possible paths) …
N_sg
P(N_pl,verb,Inf_marker,verb|dogs,like,to,bark)
= P(dogs|N_pl)
Verb
Infinitive
marker
N_pl
Prep
V_3sg
Verb
P(N_pl|_)
P(verb|N_pl)
P(Inf_marker|verb)
Adj
∏ P ( w |t )
i
P(to|Inf_marker)
i
i=1
P(bark|verb)
Prep
Conj
N
P(like|verb)
N_sg
P(verb|Inf_marker)
N
∏ P (t
i
i=1
| ti −1 ,ti − 2 ,...,ti − n )
4. Syntactic-Prosodic Phrasing
chink
Classification and Regression
trees (CARTs)
1
chunk
2
3
4
5
6
7
8
9
10
11 12 13 14
• Chinks’n chunks
Predict Color(n) SHapes, Sizes (n,n-1,n+1,n-2,n+2,…) ?
a prosodic phrase =
chink
chunk
a sequence of chinks (≈function words)
followed by a sequence of chunks (≈content words)
SOL : if SHape(n-1)=SHape(n) Color(n)=White
else Color(n)=Black
• Example :
This can be seen as a classification problem :
I asked them
if they were going home
to Idaho
and they said yes
and anticipated one more stop
before getting home
4. Syntactic-Prosodic Phrasing
No
sentence
final
• CART tree
No
1108/1118
No
327/361
Yes
82/120
No
tr
27/46
…
B
S
C
S
S
B
C
…
W
B
C
S
C
S
C
…
W
S
C
B
C
B
T
…
…
…
…
…
…
…
…
…
trees (CARTs)
« impure »
set
Yes
16/24
Q
No
Question which
splits into best
« purified » sets
No
fr
156/223
No
j3
156/216
Yes
9/11
« pure »
sets
No
j3
154/205
Yes
8/11
No
fl
19/38
No
11/14
…
C
Yes
Yes
7/7
Yes
st
101/166
No
1108/1118
SH(n+1)
S
…
No
tr
164/252
Yes
21/29
Yes
type
112/201
No
24/35
S(n+1)
-
No
1126/1225
No
fl
491/613
No
151/198
Yes
15/15
SH(n-1)
-
tr= utterance rate (in
words/second)
No
j2
1617/1838
Yes
tr
127/216
S(n-1)
S
j2 = tag of word on the left
No
tr
249/423
No
9/9
SH(n)
S
j3= tag of wor on the right
No
j3
1866/2261
Yes
et
127/225
S(n)
B
st = time to end of sentence
Yes
297/298
No
st
2974/3379
C(n)
No
j4
151/194
Yes
5/6
No
150/188
trees (CARTs)
5. Automatic Phonetization
• Decision trees for predicting phonemes+stress
« Purity » of a set?
= « Entropy » (bits)
= TRIE
= -P(Black) log2[P(Black)] -P(White) log2[P(White)]
Entropy = -(1/2 x -1) -(1/2 x -1)
=1 bit
Total entropy
after split =
0+0=0
Entropy =
-(1 x 0) -(0 x …)
=0 bit
Entropy =
-(1 x 0) -(0 x …)
=0 bit
d
b
c
u
v
w
first character on the right
h
i
first character on the left
e
i
o
second character on the right
_
e
s
second character on the left
(+ part-of-speech!)
[eI]
5. Automatic Phonetization
• Rule-based, usually formalized as in
generative phonology :
a [b] / l _ r
6. Prosody generation : patterns
- Si ces oeufs
minor
continuation
étaient frais
major
continuation
j'en prendrais
finality
Qui les vend ?
wh-question
• Example : s -> [s] or [z]
C'est bien toi,
question
s [z] / [éanti|hanti] _ [<V>]
ma jolie ?
s [s]/[anti|contre|impr‚|prime|tourne|ultra|psycho|télé]
_ []
s
s
s
s
[s]
[s]
[z]
[z]
/ [vrai]_[em]
/ [_a|para|sinu]_[e|o|y]
/ [tran] _ [a|h|i]
/[<V>] _ [<V>]
e
focus character
a
echo
- Evidemment,
implication
Monsieur.
parenthesis
- Allons donc !
exclamation
Prouve-le-moi.
order
6.Prosody generation : tones
• From tones to F0 (Hz) by rule :
H+
L
L
H
L
L
H
L
L
H
L
L
c e ilin g
L-
H
ra n g e
lo w
L
flo o r
L-
s lo p e
H+
H
L
L-
/L L
s ´ p E { s o n
a
HH
Z g { o s j
e
HLn u d e {
A)
lZ f { A) S m A)
NB : more theories than researchers …
• From text to tones
• From tones to F0 (Hz), corpus-based
• Large speech dba, with known intonation
groups, F0 and tones
ce personnage grossier, te dérange-t-il
• For each target intonation group :
WS
.
.
.
o
.
o
.
.
o
.
SG
(.
.
.
-)(
.
-) ( .
.
.
-
IG 1
(.
.
. /LL)(
.
HH) ( .
.
.
H/H)
IG 2
(.
.
.
.
HH) ( .
.
.
H/H)
-
)
WS = word stress = lexical stress Phonetization
SG = stress group
IG = intonation group Synt.-Pros. Phrasing
(only one stressed syllable)
– find a list of similar intonation groups (in terms
of tones, number of syllables, position in
sentence, etc.) in the dba
• Select the sequence of intonation groups in the
dba which :
– best represents the target groups
– AND minimizes intonative discontinuities
If
T
A
R
G
E
T
H,L, H+, L- = stressed syllables
h,l = unstressed syllables)
(l l l H)
(l l H)
Towards corpus-based
techniques
1995-?: The database years
– For automatic phonetization (L&H,
ENST, Univ. Edinburgh, FPMs)
(l l L-)
V
I
T
E
R
B
I
F0 (Hz)
Very
Large
corpus
– For automatic generation of
intonation and phoneme duration
(AT&T, FPMs, Univ. Aix, Univ.
Edinburgh)
– For automatic selection of units
for concatenative synthesis (ATR,
Univ. Edinburgh, AT&T, FPMs?)
Bad
Better
Conclusion
• Introduction
–
–
–
–
Preprocessing
Phonetization
Prosody Generation
• Towards...

Text-to-Speech Synthesis - TCTS Lab

Transcription

Documents pareils

The Faculté de Droit Virtuelle - The online Law School

Infotel

Identifying a language or an accent: from segments to - Hal-SHS

Audio guide FR-NL-EN Ville en poche FR

Introduction to Linguistics Sound System and Word

Offre de partenariat – programme Horizon 2020 Détails de l`appel à

Importance of Prosody in Swiss French Accent for Speech Synthesis

DSM: Thesis SL-DSM-16-0434 - instn

Plaquette bilingue Faculté des Sciences - Aix

abstract

Vente aux enchères silencieuse de la revue de droit d`Ottawa

IrcamCorpusExpressivity: Nonverbal Words and Restructurings

FRENCH CLEFT SENTENCES AND THE SYNTAX