Discrete/Continuous Modelling of Speaking Style in HMM

Transcription

Discrete/Continuous Modelling of Speaking Style in HMM
Discrete/Continuous Modelling of Speaking Style
in HMM-based Speech Synthesis:
Design and Evaluation
Nicolas Obin 1,2 , Pierre Lanchantin 1
Anne Lacheret 2 , Xavier Rodet 1
Analysis-Synthesis Team, IRCAM, Paris, France
Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France
1
2
[email protected], [email protected], [email protected], [email protected]
Abstract
This paper presents an average discrete/continuous HMM
which is applied to the speaking style modelling of various discourse genres in speech synthesis, and assesses whether the
model adequately captures the speech prosody characteristics
of a speaking style. Incidentally, the robustness of the HMMbased speech synthesis is evaluated in the conditions of realworld applications. The paper is organized as follows: the
speaking
style corpus MODELLING
design is OF
described
section 2; the averDISCRETE/CONTINUOUS
SPEAKING in
STYLE
IN HMM-BASED SPEECH SYNTHESIS:
age discrete/continuous
HMM
model
is
presented
in section 3;
DESIGN AND EVALUATION
the evaluation is presented
and discussed
in sections 4 and 5.
1,2
1
This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model
the symbolic and acoustic speech characteristics of a speaking style. The proposed model is used to model the average
characteristics of a speaking style that is shared among various
speakers, depending on specific situations of speech communication. The evaluation consists of an identification experiment
of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style.
Index Terms: speaking style, speech synthesis, speech
prosody, average modelling.
Nicolas Obin , Pierre Lanchantin
Anne Lacheret 2 , Xavier Rodet 1
1
2
log(1/speech rate)
The following is a description of the four
selected DG’s:
−1.2
Fig. 1. Prosodic description of the speaking styles depending on the speaker. Mean and variance of f0 and
speech rate.
P2
−1.4
−1.6
log syllable duration
P3
study was supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research;
2008-2012.
1 This study was supported by ANR Rhapsodie 07 Corp-030-01; reference prosody corpus of spoken French; French National Agency of research;
2008-2012.
log f0
−2
−2.2
4.5
log
logf0 f0
−1.8
Each speaker has his own speaking style
which constitutes his vocal S1
5.4
signature, and a part of his identity. Nevertheless, a speaker continu5.5
ously adapt his speaking style according
S3 S4
5.3to specific communication
5.4
S1
S5S3 S4
situations, and to his emotional state. In particular, each situational
5.3
S5
S2
context determines a specific mode of 5.2
production associated with itS6
S2
S6
5.2
M2
- a genre - which is defined by a set of conventions of form and conM2
5.1
M7 M6
tent that are shared among all of its productions
[1]. In particular,
5.1
5
M5
M7
J5 J4 M6
a specific discourse genre (DG) relates to a specific speaking style.
J3
M3
4.9
M1
Consequently, a speaker adapts his speaking
style
to
each
specific
J1
5
P5 M4M5 P4
4.8
J4 4.7
situation depending on the formal conventions that are associated
J2
J5
J3
with the situation, his a-priori knowledge about these conventions,
P3
M3
4.6
4.9
and his competence to adapt his speaking style. Thus, each comM1 P1 P2
J1 4.5
munication act instantiates a style which is composed of a style that
P5
4.8
−2.2
−2
−1.8
−1.6
M4
P4−1.4 −1.2
depends on the speaker identity, and a conventional speaking style
log syllable duration
that is conditioned by a specific situation.
J2
In speech synthesis, methods have been4.7
proposed to model and adapt
Fig. 1. Prosodic description of the speaking styles dethe symbolic [3, 4] and acoustic speech characteristics of a speaking
pending on the speaker. P3
Mean and variance of f0 and
4.6 synthesis [2]. However,
style, with application to emotional speech
speech rate.
no study exists on the joint modelling of the symbolic and acoustic
P2
4.5
characteristics of speaking style, and speaking
style acoustic modP1
log f0
elling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5].
−2.2
−2
−1.8
−1.6
−1.4
−1.2
This paper presents an average discrete/continuous HMM which is
log syllable duration
applied to the speaking style modelling of various discourse genres
log(1/speech
rate)
4.6
4.8
5
4.9
5.1
P1
M5
M3
M1
P4
P5 M4
J2
J1
M2
M7 M6
J5 J4 J3
S5
S2
S1
S4
S6
S3
5.2
5.3
5.4
5.5
1 This study was partially funded by “La Fondation Des Treilles”,
and supported by ANR Rhapsodie 07 Corp-030-01; reference prosody
corpus of spoken French; French National Agency of research; 20082012.
sists of four different DG’s: catholic mass ceremony, political, journalistic, and sport commentary. In order to reduce the DG intravariability, the different DGs were restricted to specific situational
contexts (see list below) and to male speakers only.
The following is a description of the four
Figure 1: Prosodicselected
description
of the speaking
DG’s:
styles depending on the speaker. Mean and variance of f0 and speech rate (syllable per second).
1 This
1. INTRODUCTION
5.5
1. INTRODUCTION
Each speaker has his own speaking style which constitutes his vocal
signature, and a part of his identity. Nevertheless, a speaker continuously adapt his speaking style according to specific communication
situations, and to his emotional state. In particular, each situational
context determines a specific mode of production associated with it
- a genre - which is defined by a set of conventions of form and content that are shared among all of its productions [1]. In particular,
a specific discourse genre (DG) relates to a specific speaking style.
Consequently, a speaker adapts his speaking style to each specific
situation depending on the formal conventions that are associated
with the situation, his a-priori knowledge about these conventions,
and his competence to adapt his speaking style. Thus, each communication act instantiates a style which is composed of a style that
depends on the speaker identity, and a conventional speaking style
that is conditioned by a specific situation.
In speech synthesis, methods have been proposed to model and adapt
the symbolic [3, 4] and acoustic speech characteristics of a speaking
style, with application to emotional speech synthesis [2]. However,
no study exists on the joint modelling of the symbolic and acoustic
characteristics of speaking style, and speaking style acoustic modelling generally limits to the modelling of emotion, with rare extensions to other sources of speaking styles variations [5].
This paper presents an average discrete/continuous HMM which is
applied to the speaking style modelling of various discourse genres
2.1. Corpus Design
For the
purpose of speaking
style speech synthesis, a 4-hour
ABSTRACT
in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidenmulti-speakers speech database
was
designed.
tally, the robustness
of the HMM-based
speech synthesis isThe
evaluated speech
the conditions of real-world applications. The paper is organized
database consists of four inasdifferent
DG’s:
catholic
cerefollows: the speaking
style corpus design
is described inmass
section
2; the average discrete/continuous HMM model is presented in secmony, political, journalistic,
and
sport
commentary.
In order
tion 3; the
evaluation
is presented
and discussed in sections 4 and
5.
to reduce the DG intra-variability,
the&different
2. SPEECH
TEXT MATERIALDGs were re2.1. Corpus
Design
stricted to specific situational
contexts
(see list below) and to
For the purpose of speaking style speech synthesis, a 4-hour multimale speakers only.
speakers speech database was designed. The speech database con-
This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model the
symbolic and acoustic speech characteristics of a speaking style.
The proposed model is used to model the average characteristics
of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation
consists of an identification experiment of 4 speaking styles based
on delexicalized speech, and compared to a similar experiment on
natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style.
Index Terms: speaking style, speech synthesis, speech prosody, average modelling.
4.7
For the purpose of speaking style speech synthesis, a 4-hour multispeakers speech database was designed. The speech database consists of four different DG’s: catholic mass ceremony, political, journalistic, and sport commentary. In order to reduce the DG intravariability, the different DGs were restricted to specific situational
contexts (see list below) and to male speakers only.
in speech synthesis, and assesses whether the model adequately captures the speech prosody characteristics of a speaking style. Incidentally, the robustness of the HMM-based speech synthesis is evaluated
in the conditions of real-world applications. The paper is organized
as follows: the speaking style corpus design is described in section
2; the average discrete/continuous HMM model is presented in section 3; the evaluation is presented and discussed in sections 4 and
5.
2. SPEECH & TEXT MATERIAL
ABSTRACT
This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles1 . A discrete/continuous HMM is presented to model the
symbolic and acoustic speech characteristics of a speaking style.
The proposed model is used to model the average characteristics
of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation
consists of an identification experiment of 4 speaking styles based
on delexicalized speech, and compared to a similar experiment on
natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style.
Index Terms: speaking style, speech synthesis, speech prosody, average modelling.
[email protected], [email protected], [email protected], [email protected]
1
Analysis-Synthesis Team, IRCAM, Paris, France
Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France
2
Nicolas Obin 1,2 , Pierre Lanchantin 1
Anne Lacheret 2 , Xavier Rodet 1
log f0
DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLE
IN HMM-BASED SPEECH SYNTHESIS:
DESIGN AND EVALUATION
Modyco Lab., University of Paris Ouest - La Défense, Nanterre, France
2.1. Corpus
Design [email protected], [email protected]
[email protected],
[email protected],
1. Introduction
Each speaker has his own speaking style which constitutes his
vocal signature, and a part of his identity. Nevertheless, a
speaker continuously adapt his speaking style according to specific communication situations, and to his emotional state. In
particular, each situational context determines a specific mode
of production associated with it - a genre - which is defined by
a set of conventions of form and content that are shared among
all of its productions [1]. In particular, a specific discourse
genre (DG) relates to a specific speaking style. Consequently, a
speaker adapts his speaking style to each specific situation depending on the formal conventions that are associated with the
situation, his a-priori knowledge about these conventions, and
his competence to adapt his speaking style. Thus, each communication act instantiates a style which is composed of a style
that depends on the speaker identity, and a conventional speaking style that is conditioned by a specific situation.
In speech synthesis, methods have been proposed to model and
adapt the symbolic [2, 3] and acoustic speech characteristics of
a speaking style, with application to emotional speech synthesis
[4]. However, no study exists on the joint modelling of the symbolic and acoustic characteristics of speaking style, and speaking style acoustic modelling generally limits to the modelling
of emotion, with rare extensions to other sources of speaking
styles variations [5].
2. Speech & Text Material
Analysis-Synthesis Team, IRCAM, Paris, France
The following is a description of the four selected DG’s:
mass: Christian church sermon (pilgrimage and Sunday highmass sermons); single speaker monologue, no interaction.
political: New Year’s French president speech; single speaker
monologue; no interaction.
journal: radio review (press review; political, economical,
technological chronicles); almost single speaker monologue
with a few interactions with a lead journalist.
sport commentary: soccer; two speakers engaged in monologues with speech overlapping during intense soccer sequences
and speech turn changes; almost no interaction.
The speech database consists of natural speech multi-media audio contents with strongly variable audio quality (background
noise: crowd, audience, recording noise, and reverberation).
The speech prosody characteristics of the speech databased are
illustrated in figure 1.
3. Speaking Style Model
A speaking style model λ(style) is composed of discrete/continuous context-dependent HMMs that model the symbolic/acoustic speech characteristics of a speaking style.
“
”
(style)
(style)
λ(style) = λsymbolic , λacoustic
(1)
During the training, the discrete/continuous context-dependent
HMMs are estimated separately. During the synthesis, the symbolic/acoustic parameters are generated in cascade, from the
symbolic representation to the acoustic variations. Additionally, a rich linguistic description of the text characteristics is
automatically extracted using a linguistic processing chain [6]
and used to refine the context-dependent HMM modelling (see
[7] and [8] for a detailed description of the enriched linguistic
contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discrete
(style)
HMM λsymbolic is estimated from the pooled speakers associated with the speaking style.
The prosodic grammar consists of a hierarchical prosodic
representation that was experimented as an alternative to ToBI
[9] for French prosody labelling [10]. The prosodic grammar
is composed of major prosodic boundaries (FM , a boundary
which is right bounded by a pause), minor prosodic boundaries
(Fm , an intermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average model
(style)
λsymbolic is to be estimated. Let l = (l(1) , . . . , l(R) )
the total set of prosodic symbolic observations, and
l(r) = [l(r) (1), . . . , l(r) (Nr )] the prosodic symbolic sequence associated with speaker r, where l(r) (n) is the
prosodic label associated with the n-th syllable.
Let
q = (q(1) , . . . , q(R) ) the total set of linguistic contexts
observations, and q(r) = [q(r) (1), . . . , q(r) (Nr )] the linguistic context sequence associated with speaker r, where
(r)
(r)
q(r) (n) = [q1 (n), . . . , qL (n)]> is the (Lx1) linguistic
context vector which describes the linguistic characteristics
associated with the n-th syllable.
(style)
An average context-dependent discrete HMM λsymbolic is estimated from the pooled speakers observations. Firstly, an av(style)
erage context-dependent tree Tsymbolic is derived so as to minimize the information entropy of the prosodic symbolic labels
l conditionally to the linguistic contexts q . Then, a context(style)
dependent HMM model λsymbolic is estimated for each termi(style)
nal node of the context-dependent tree Tsymbolic .
3.1.2. Continuous HMM
(style)
For each speaking style, an average acoustic model λacoustic
that includes source/filter variations, f0 variations, and statedurations, is estimated from the pooled speakers associated
with the speaking style based on the conventional HTS system
[11].
Let R be the number of speakers from which an average
model is to be estimated. Let o = (o(1) , . . . , o(R) ) the
total set of observations, and o(r) = [o(r) (1), . . . , o(r) (Tr )]
the observation sequences associated with speaker r, where
(r)
(r)
o(r) (t) = [ot (1), . . . , ot (D)]> is the (Dx1) observation
vector which describes the acoustical property at time t. Let
q = (q(1) , . . . , q(R) ) the total set of linguistic contexts
observations, and q(r) = [q(r) (1), . . . , q(r) (Tr )] the linguistic context sequence associated with speaker r, where
(r)
(r)
q(r) (t) = [q1 (t), . . . , qL (t)]> is the (Lx1) linguistic context
vector which describes the linguistic properties at time t.
(style)
An average context-dependent HMM acoustic model λsymbolic
is estimated from the pooled speakers observations. Firstly, a
context-dependent HMM model is estimated for each of the
linguistic contexts. Then, an average context-dependent tree
(style)
Tacoustic is derived so as to minimize the description length of
(style)
the context-dependent HMM model λacoustic .
The acoustic module models simultaneously source/filter variations, f0 variations, and the temporal structure associated with
a speaking style. Speakers f0 were normalized with respect
to the speaking style prior to modelling. Source, filter, and
normalized f0 observation vectors and their dynamic vectors
(style)
are used to estimate context-dependent HMM models λacoustic .
Context-dependent HMMs are clustered into acoustically similar models using decision-tree-based context-clustering (MLMDL [11]). Multi-Space probability Distributions (MSD) [12]
are used to model continuous/discrete parameter f0 sequence
to manage voiced/unvoiced regions properly. Each contextdependent HMM is modelled with a state duration probability
density functions (PDFs) to account for the temporal structure
of speech [13]. Finally, speech dynamic is modelled according
to the trajectory model and the global variance (GV) that model
local and global speech variations over time [14].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenated sequence of context-dependent HMM models
(style)
λsymbolic associated with the linguistic context sequence
q = [q1 , . . . , qN ], where qn = [q1 , . . . , qL ]> denotes
the (Lx1) linguistic context vector associated with the n-th
phoneme.
Firstly, the prosodic symbolic sequence bl is determined so as
to maximize the likelihood of the prosodic symbolic sequence l
conditionally to the linguistic context sequence q and the model
(style)
λsymbolic .
bl = argmax p(l|q, λ(style) )
symbolic
l
(2)
Then, the linguistic context sequence q augmented with the
prosodic symbolic sequence bl is converted into a concatenated
(style)
4 1 Exper men a Se up
sequence of context-dependent models λacoustic .
m
m
m
• journal: radio
review (press review; political, economical, tech- 40 mation
conditionof the10
prosodic
symbolic
speech
u e mances
pe DG
we e labels
se ec qedproso
n MM
he speak
m entropy
b is inferred so as to maximize the likeThe acoustic sequence o
ngally
s y to
e co
and contexts
emovedq .om
hea context-dependent
a n ng se Lex ca
ac
single speaker
monologue
the pus
linguistic
Then,
HMM
lihood of the nological
acoustic chronicles);
sequence oalmost
conditionally
to the
model with a
m fi e mha nsu ed ha
cess was emoved
us ng am band pass
he
(style)
(style)
few interactions with a lead journalist.
λacoustic .
for eacha terminal
nodeand
of thehecontextmodel equency
λsymbolic ois estimated
equency
h ghes
mmowes
mhe undamen
(style)
(style)
(style)
s ha mon c was nc uded
• sport commentary:
speakers
engaged
in mono- equency
b = argmax
o
max p(o|q, λsoccer;
)p(q|λ
) (3)
dependento trees fiTsymbolic
.
acoustictwo
acoustic
q
o
m
m
m
logues with speech overlapping during intense soccer sequences 4 2 Sub ec ve Evamua on
MM
m
m
MM
b
First,
the
state
sequence
q
is
determined
so
as
to
maximize
• journal: radio
reviewturn
(press
review;almost
political,
economical, tech- mation entropy of the prosodic
m m symbolic labels qproso conditionand speech
changes;
no interactions.
the likelihood of the state sequence conditionally to the model
m
m
m
The
eva
ua
on
cons
s
s
o
a
mu
p
e
cho
ce
den
fica on ask
Continuous
nological chronicles); almost single speaker monologue with a ally to the 3.1.2.
linguistic
contexts HMM
q . Then, a context-dependentm HMM
(style)
om
speech
p
osody
pe
cep
on
The
eva
ua
on
was
conduc ed
b
λacoustic . Then,
the
observation
sequence
c
is
determined
so
The •speech
database
consists
of
natural
speech
multi-media
audio
m symbolic labels q
HAreview; political,
O ODY Xeconomical,
A ON
journal:
(press
tech- mm mation (style)
conditionentropy of the prosodic
interactions
with
aradio
leadreview
journalist.
(style)
is
estimated
for
each
terminal
node
of
the
contextmodel
λ
acco
d
ng
o
c
owd
sou
c
ng
echn
que
us
ng
soc
a
ne wothat
ks
as tofew
maximize
the
likelihood
of
the
observation
sequence
connological
chronicles);
almost
single
speaker
monologue
with
a
ally
to
the
linguistic
contexts
q
.
Then,
a
context-dependent
HMM
symbolic
For
each
speaking
style,
an
average
acoustic
model
λ
contentsfew
with
strongly
variable
audio
quality
(background
noise:
mbveqnode
mFq ench
acoustic
(style)
interactions
with
a(press
lead
journalist.
ou
na
d
o
v
w
p
v
w
po
onom
h
ond
on
m
on
n
opy
o
h
p
o
od
ymbo
4
5
P
o
od
Pa
am
E
ma
on
50
sub
ec
s
nc
ud
ng
25
na
speake
s
15
non
na
ve
is
estimated
for
each
terminal
of
the
contextmodel
λ
•
journal:
radio
review
review;
political,
economical,
techconditionmation
entropy
of
the
prosodic
symbolic
labels
(style)
HA
O
ODY
X
A
ON
b, speakers
ditionally
the nostate
sequence
q
the ng
model
λ mono
under
• sport to
commentary:
soccer;
two
engaged
in ogu
monodependent
tree
og commentary:
h on
mo
p reverberation).
k acoustic
wmonoha speech
ytoothehmlinguistic
nguT
on . x. qq. Then,
Thvariations,
na context-dependent
on x fd0 pvariations,
nd nHMM
HMM
includes
source/filter
state-durations,
chronicles);
almost
single
speaker
monologue
ally
contexts
crowd,
audience,
recording
noise,
The
m and
• nological
sport
soccer;
twoand
speakers
engaged
inwith
symbolic
dependent
tree
Tspeake
F
ench
s
10
non
F
ench
speake
s
34
expe
and
16
Fund
m
n
F
u
n
dynamic constraint
o
=
Wc.
MMnodofothehcontextw interactions
w hoverlapping
ou nduring intense soccer sequences model
few
a lead
m dforoeach hterminal
m n node
on x
mod λλ
is estimated
logues
with
speech
4n5 overlapping
P oon
odwith
Pa
amd journalist.
E databased
masoccer
on sequences
logues with
speech
during
intense
is estimated
with the
na ve
s ene sfrom
pathecpooled
pa ed speakers
n h s associated
expe men
Pa speaking
c pan s
prosody
characteristics
ofsoccer;
the
speech
are illustrated
independent
figCHAPTER
4. turn
SPEECH
MODELLING
and
nomspeakers
interactions.
pom speech
omm
n achanges;
yPROSODY
o almost
wo
p &
k SYNTHESIS:
ng gA dGHinn monomono
• sport
commentary:
two
engaged
d p nd n tree TT m . m
3.1.2.
Continuous
HMM
m
K
m
m
b
(4)
c
=
r
R
m hnoverlapping
F pp
u nng
52 speech turn
STATE-OF-THE-ART
CHAPTER
4.Fund
PROSODY
MODELLING
&&&SYNTHESIS:
bno
bSPEECH
q
q
we
e
g
ven
a
b
e
desc
p
on
o
he
d
e
en
speak
ng
s y es
ogu
w
h
p
ov
du
ng
n
n
o
qu
n
CHAPTER
4.
SPEECH
PROSODY
MODELLING
SYNTHESIS:
and
changes;
almost
interactions.
R
m
m
m
CHAPTER
4.
SPEECH
PROSODY
MODELLING
SYNTHESIS:
logues
with
speech
during
intense
soccer
sequences
The
speech
database
consists
of
natural
speech
multi-media
audio
style
conventional HTS system ([14]).
urem 1. and
m
K Gbased
MOonantheaverage
52
STATE-OF-THE-ART
5252
ndCHAPTER
p with
h turn
n
h ngPROSODY
mo MODELLING
nointeractions.
nSTATE-OF-THE-ART
STATE-OF-THE-ART
speech
changes;
almost
no
SPEECH
& on
SYNTHESIS:
m4.ustrongly
m
A GH 3.1.2.
For
each
speaking
style,
acoustic
model
λa speakthat
Continuous
HMM
contents
variable
audio
quality
(background noise:
Then
hey
we
e
asked
o
assoc
a
e
ng
s
y
e
o
each
o
3
1
2
Con
nuou
HMM
m
=
A
phonetizer
is
used
to
convert
the
input
text
into
a
sequence
of
phonemes.
A
syllabifier
3.1.2.
Continuous
HMM
where:AA52
m4.mKinput
m
m
CHAPTER
PROSODY
MODELLING
SYNTHESIS:
m a asequence
mhAmulti-media
m mThe
includes
M variations, f variations, and state-durations,
crowd,
audience,
recording
noise,
reverberation).
speech
phonetizer
isisused
to
convert
text
into
A
syllabifier
Th
p
h the
dthe SPEECH
b into
on
omsyllables.
naofand
uphonemes.
aAtSTATE-OF-THE-ART
p &prosodic
mu
d audio
ud
o
phonetizer
used
to
convert
input
text
into
sequence
of
phonemes.
syllabifier
MM he source/filter
The
speech
database
consists
of
natural
speech
The52isisisspeech
database
consists
of
natural
speech
multi-media
used
to merge
the
sequence
phonemes
am
sequence
of
the
level, audio
speech
u
e
ances
Fo
h
s
pu
pose
pa
c
pan
s
we
e
g
ven
m
m
A
phonetizer
is
used
to
convert
the
input
text
into
a
sequence
of
phonemes.
A
syllabifier
STATE-OF-THE-ART
M
used
to
the
sequence
into
ayreview;
of
At
level,
is MMestimated
the
speakers
associated
with
used
tomerge
mergeiscontents
the
sequence
phonemes
intotext
asequence
sequence
ofsyllables.
syllables.
Atthe
theprosodic
level,
prosody
characteristics
of
speech
are
illustrated
in
fig- entropy
Fo
hspeaking
p from
k ngstyle,
y pooled
v
g acoustic
ou
h
on
n convert
wphonemes
hthe
ong
vintothe
ud ofodatabased
qu
yprosodic
kg
ound noise:
no
For
eachof
an naverage
λ λthe speaking
that
=mod
with
strongly
variable
audio
quality
(background
mspeakers
journal:
radio
review
(press
political,
economical,
techconditionmation
the
prosodic
symbolic
labels
q m model
(style)
A
phonetizer
used
to
input
absequence
phonemes.
Absyllabifier
mMMthe
−1
>
pauses
are
to
m level,
is •used
toidentified
mergeure
the
sequence
into
adsequence
of
syllables.
Atv theb prosodic
Meach
Let
Rthefibeons
number
ofvsystem
from
which
an
average
pauses
are
identified
to1.
pauses
are
toto
hmsource/filter
ee
op
W
Σ
W.
(5) For
R
style
on
conventional
HTS
nlinguistic
udbased
ou
v. Then,
on
f variations,
on([14]).
du λon
owd
ud phonemes
n=
o maudio
ngaq
no
on
speaking
style,
anafaverage
acoustic
model
thatmodel is to
b
q
contents
with
strongly
variable
quality
includes
andnd
state-durations,
crowd,
audience,
recording
noise,
and
reverberation).
The
speech
used
toidentified
merge
the
sequence
phonemes
into
ainto
sequence
of msyllables.
At theMODEL
prosodic
level,
A isphonetizer
is used
convert
the
input
text
sequence
ofnd
phonemes.
A syllabifier
nological
chronicles);
almost
single
speaker
monologue
with
anoise:
contexts
qvariations,
context-dependent
HMM
b
m(background
mTh
m pally hto the
3.
SPEAKING
STYLE
acoustic
M
pauses
are
identified
to
M
m
MM
m
m
• journal:
review;
political,
techmation
ofm
the
prosodic
symbolic
q q(1)oofconditionD
dfrom
om
hfor
poo
dspeakers
plabels
kb associated
dond
w (R)
hthe
hspeaking
p k =ng
is pauses
used
to
merge
intojournalist.
a of
sequence
of
Atbonom
the prosodic
level,
ptheradio
osequence
ov the
h speech
pmsyllables.
hdatabased
d economical,
dare illustrated
u h din
nmodel
figonentropy
are
identified
too review
is opy
estimated
the
pooled
with
prosody
characteristics
figfew
interactions
with
•
ou
na
dody
vhphonemes
w a(press
plead
w
po
on
m
n
o
h
p
o
od
ymbo
is
estimated
each
terminal
node
the
contextλ
−1reverberation). The speech
>
estimated.
Let
ose
=from
(o
, .nd
.one
,HMM
oHMM
) model
thestate-durations,
total
observations,
MR
pausesaudience,
are identified
to
includes
variations,
variations,
and
crowd,
noise,
and
nological
almost
single
speaker
monologue
linguistic
contexts
qq
. of
Then,
context-dependent
Let
the
number
is to
yngu
bobesource/filter
onconfidence
hxconventional
onv
nspeakers
on
HTS
m
14
u chronicles);
1recording
adbe
ec
y.an
speak
ng
sset
y eof when
ce a n
(6)
rqb =
W
Σ
style
based
on
the
system
ure
4.6• no
Abstract
Model:
Text
To
Prosodic
Structure
og commentary:
h1.
on
mo
ng
pµSTYLE
k .Structure
mono
oguinwith
wmonoh a mally
ytoothe
h
on
Th
na HTS
on
xfy0which
don
p([14]).
naverage
3.
SPEAKING
MODEL
b
q
4.6
Abstract
Structure
bspeakers
q
Model:
Text
To
Prosodic
sport
soccer;
two
engaged
dependent
tree
T
.
few
interactions
with
a
lead
journalist.
(r)
(r)
(r)
be
estimated.
Let
o
=
(o
,
.
.
.
,
o
)
the
total
set
of
observations,
each hterminal
ofothehcontextmodel
(style)
4.6 w
Abstract
Model:
Text
To
Prosodic Structure
Structure
n withon
w hoverlapping
d ouTo
nduring
4.6
Abstract
Model:
Text
Prosodic
m
dfor
oce
m
n node
nod
on
xis with
mod λλis estimated
oisbestimated
he
cho
logues
speech
intense
soccer
sequences
=
[o
(1),
.
.
.
,
o
(T
)]
the
observation
sequences
and
o
4.6
Abstract
Structure
from
the
pooled
speakers
associated
the
speaking
Model:
Text
To
Prosodic
Structure
prosody
characteristics
of
the
speech
databased
are
illustrated
in
figr
A
speaking
style
model
λ
is
composed
of
discrete/continuous
The
prosodic
structure
model
is
to
infer
a
sequence
of
prosodic
labels
that
are
associated
⇓
L
R
h
numb
o
p
k
om
wh
h
n
v
g
mod
o
R
m
m
m
=
[o
(1),
.
.
.
,
o
(T
)]
is
the
observation
sequences
and
o
The
prosodic
structure
model
is
to
infer
a
sequence
of
prosodic
labels
that
are
associated
Let
R
be
the
number
of
speakers
from
which
an
average
model
is
to
A are
speaking
λthe
is composed
of
discrete/continuous
m
=
•prosodic
sport
commentary:
soccer;
two
engaged
inassociated
and
µ
the tree TT K G⇓ . MO
and Σ4.6
Abstract
Model:
Text
To
Prosodic
Structure
3 model
SPEAK
NG
MODEL
The
structure
model
isytostyle
infer
aalmost
sequence
prosodic
labels
3.
SPEAKING
STYLE
MODEL
b•
q
and
speech
turn
noofspeakers
interactions.
b
q
po
omm
n respectively
achanges;
o
wo
p covariance
kSTYLE
ngthat
g are
d matrix
n monomono anddependent
with
relevant
prosodic
events.
d p nd nbe
(r)es when wo
(r) speak
with
relevant
prosodic
events.
bassociated
m HMM
L on
=(o
or,ec
o d) (t)
o set
oobservations,
(r)
�
estimated.
Let
oo=conventional
,where
. . . ,wo
oo
thehe=
total
of
context-dependent
HMMs
that
modelStructure
the
symbolic/acoustic
speech
con
us
sewith
en
speak
yon
3.1.2. speech
Continuous
with
speaker
[o
(1),
. .ob
. ,ng
o vs(D)]
4.6
Model:
Text
To
Prosodic
Structure
logues
with
speech
overlapping
during
intense
soccer
sequences
Abstract
Text
To
Prosodic
The
prosodic
structure
model
toText
infer
sequence
of.prosodic
prosodic
labels
thatsymbolic/acoustic
arequ
associated
⇓⇓⇓don
with
relevant
prosodic
Abstract
Model:
To
Prosodic
Structure
style
based
the
HTS
system
([14]).
ure 4.6
1.vector
ogu
w
h events.
pModel:
his consists
pp
ng
du
nmodel
nlabels
othe
n
context-dependent
HMMs
that
m
=
The
prosodic
structure
model
isisov
infer
aasequence
sequence
labels
that
are
associated
bofng
mean
for
the
state
q
associated
speaker
r,
where
o
(t)
=qu[o=propThe
prosodic
structure
model
toytosequence
infer
a speaking
of
prosodic
that
are
associated
The
speech
database
of
natural
speech
multi-media
audio
=
o
1
o
T
h
ob
v
on
nt (1), . . . , ot (D)]
nd
o
characteristics
of
a
style.
=
[o
(1),
.
.
.
,
o
(T
)]
is
the
observation
sequences
and
o
is
the
(Dx1)
observation
vector
which
describes
the
acoustical
A
p
k
ng
mod
λ
ompo
d
o
d
on
nuou
with
relevant
prosodic
events.
and
speech
turn
changes;
almost
no
interactions.
speaking
style
model
λ
is composed
of discrete/continuous
⇓
ng
s
y
es
a
e
poss
b
e
4.6
Abstract
Model:
Text
To
Prosodic
Structure
nd
p
h
u
n
h
ng
mo
no
n
on
⇓
M
with
relevant
prosodic
events.
with
relevant
prosodic
events.
speaking
an
average
acoustic
model
that
⇓=w⇓the
contents Longtemps
with
strongly
quality
(background
3.1.2.
Continuous
, variable
me audio
suis h couché
deh symbolic/acoustic
bonne
heurenoise:
.
sentence
erty
time
Let
q
=robservation
(q
,o.o. . (t)
,q
the
total
ofo linguistic
on
x d pconsists
HMM
mod dethe
ymbo
pFor
3
1 2heach
Con
HMM
oM at HMM
dwith
p(Dx1)
k r,
wh
oλ (1),
1 . .set
D the acoustical propcontext-dependent
HMMs
model
speech
associated
where
=
. , odescribes
(D)]
==)[ostate-durations,
characteristics
ofojejean speaking
style.
Longtemps
, nd
me
suis that
couché
bonne
heure ou
.
sentence
isstyle,
vector
which
mnuou
mt.hspeaker
The
speech
database
of
speech
multi-media
⇓observation
includes source/filter
variations,
fon
variations,
crowd,
noise,
reverberation).
speech
Th
p audience,
hcharacteristics
b recording
op natural
na
u astyle.
mThe
ud. o
“
”d audio
MM v on
hd strongly
k and
ng
y p h(background
oban
vand
owhich
hdescribes
d” [q
b the
hsacoustical
p)]opcomp
of
speaking
oDx1
aobservations,
ndec
svector
se
ndec
on”
e e y unsu e
=and
(1),
.that
. .ou
, q when
(T
is
qwhec
Longtemps
, on
je jeame
suis
couché
demu
bonne
heure
sentence
is contexts
theh (Dx1)
propFor
each
speaking
style,
average
acoustic
model
λ
contents
with
variable
audio
quality
noise:
(1)
(R)
Longtemps
,
me
suis
couché
de
bonne
heure
.
sentence
is
estimated
from
the
pooled
speakers
associated
with
the
speaking
M
prosody
characteristics
of
the
speech
databased
are
illustrated
in
figλ
=
λ
,
λ
(1)
Fo
h
p
k
ng
y
n
v
g
ou
mod
λ
h
on
n
w
h
ong
y
v
b
ud
o
qu
y
b
kg
ound
no
Longtemps
,
je me suis
couché
de bonne heure
.
sentence
⇓suis
merty
mqwe
Let
RyMMM
be
the
number
speakers
from
anon
model
m
L
q=
=of
q
q
o. . set
olinguistic
=)r,ngu
the
context
sequence
with
speaker
where
erty
atlinguistic
time
t.ec
Let
(q
, . . .q
,associated
qand
)(q
thehwhich
total
ofaverage
=h,du
myDsetisasto
at
time
t.
Let
=
.
,
q
the
total
ofa linguistic
couché
deonbonne
heure
.
sentence
includes
source/filter
variations,
fHTS
variations,
state-durations,
crowd,
recording
and
reverberation).
The
speech
Sub
s
e
asked
o
use
s
poss
b
y
on
ve y as
3.
STYLE
MODEL
style
based
on
the
conventional
system
([14]).
ure
1. audience,
n
ud
ou
fi
v
on
f
v
on
nd
owd
udLongtemps
n SPEAKING
o d, ngjenoise,
nome⇓
nd
v
b
Th
p
h
M
“
”
“
”
qthe
1(r)
T)](r)is
on (t)
xtheobservations,
on. . . and
q is
(1)
(R)
=ob
[q v(t),
, qnd
the
(L’x1)
augmented
qfrom
==
[q
(1),
. . . , q q (Tlinguistic
contexts
q(t)]
⇓
is estimated
pooled
speakers
associated
with
speaking
the
databased
are
illustrated
in”nfig(r)
“
m
d
om
h
poo
d
p
k
o
d
w
h
h
p
k
ng
m
pprosody
o ody characteristics
h During the of
oλ
h speech
p⇓
h
d
b
d
u
d
fig
training,
the
discrete/continuous
context-dependent
λ
=
λ
λ
1
be
estimated.
Let
o
=
(o
,
.
.
.
,
o
)
the
total
set
of
observations,
=
λ
,
λ
(1)
eso
⇓
⇓
h
ngu
on
x
qu
n
o
d
w
h
p
k
r
wh
=
[q
(1),
.
.
.
,
q
(Tr )] is
contexts
observations,
and
q
context
vector
which
describes
the
linguistic
properties
at
time
t.
the
linguistic
context
sequence
associated
with
speaker
r,
where
style
based
on
the
conventional
HTS
system
([14]).
ure
1.
(style)
(style)⇓⇓
m average
fromy which
model is to
y R
b beMM
dthe
onnumber
h m onvof nspeakers
on HTS
m an
14
prosodic
u 1
HMMs
separately.
During(style)
the synthesis, the Let
sym3. are
SPEAKING
STYLE
MODEL
(r)
(r)
m Lwe
=[q=⇓ (o(t),
x1observations,
n linguistic
d ngu om he pa c pan s
⇓
⇓=To
λdiscrete/continuous
, λdiscrete/continuous
λestimated
=o
theh (L’x1)
augmented
qq(1)(t)(r)
Add
ona
ma
ons
e=ugm
gobservation
eaned
(style)
prosodic
structure
be
estimated.
,.(1),
.. .. ., ,qn.o.o.(t)]
)otheistotal
set
of
bolic/acoustic
parameters
are
generated
in
cascade,
symacoustic
Du
ng Model:
h λtraining,
nText
ng isthe
hcomposed
dsymbolic
on
nuou
on from
x d the
p nd
n andAnoLet
4.6 Abstract
Prosodic
Structure
=
[o
,
(T
)]
is
the
sequences
During
the
context-dependent
the
linguistic
context
sequence
associated
with
speaker
r,
where
r
A speaking
style
model
of
m hhan
mnguv acoustic
context-dependent
HMM
λmt.
⇓STYLE
on average
x vector
vmof ospeakers
wh hdescribes
dM
b the
p op is
context
linguistic
at otime
structure
F
*representation
RRbe
from
which
model
to
3.
SPEAKING
MODEL
bolic
acoustic
Additionally,
aym
rich
Land
h=number
numb
k (T
om
naverage
gproperties
mod
prosodic
[o speech
(1),
.p, oexpe
)] se
iswhtheexpe
observation
sequences
ob the
prosodic
na
vemodel
HMM
dtoSTYLE
pcomposed
y *During
Du
ng the
h synthesis,
yn** h
h Let
⇓⇓o(o. .which
A speaking
style
model
λm NG
isthe
ofvariations.
discrete/continuous
HMMs
are
estimated
separately.
the
sym= anguage
MM a(r) m na �ve F ench
SPEAK
MODEL
(r)
F4.6 Abstract
isLet
estimated
from
the
pooled
speakers
observations.
Firstly,
Fprosodic
*3*Model:
*
(r)
(r)
(r)
be
estimated.
o
=
,
.
.
.
,
o
)
the
total
set
of
observations,
(r)
�
Text
To
Prosodic
Structure
structure
structure
linguistic
description
of
the
text
characteristics
is
automatically
exb
m
d
L
o
=
o
o
h
o
o
ob
v
on
bo *HMMs
ouHMMs
pthat
mmodel
g symbolic/acoustic
nsymbolic/acoustic
d inn cascade,
d
om
h symym associated
context-dependent
that
model
the
parameters
generated
from
the
context-dependent
the
speech
4.6
Abstract
Text
To are
Prosodic
associated
with=speake
speaker
r,
where
(t)
[o. .is
.ou
.speake
. ,=
o [o
(D)]
⇓context-dependent
speaker
where
o, (1),
(t)
(1),
.F. ,the
oaugmented
P
*bolic/acoustic
*Model:
(t)
[qopna
(t),
.HMM
qacoustic
(t)]
isfor
(L’x1)
linguistic
qwith
Fstructure
* * Structure
****speech
non
ve
F
ench
non
ench
speake
age
context-dependent
model
estimated
each
MMd)]
An
v(1),
g
on
x=HMM
nd
n=HMM
mod
λ . of
tnthe
t (D)]
An
model
λ
F
* model
[ooaverage
the
observation
sequences
oho =
1isr,
Fprosodic
*a
A
speaking
style
composed
mcontext-dependent
tracted
processing
chain
[9] and
used
to refine
the
bo
pusing
nλ
on
h ompo
ou d* of
vdiscrete/continuous
on
Add
on
y and
4.6
Model:
Text
To
Prosodic
=linguistic
1 ...,o
o (TTwhich
han
ob the
vL acoustical
on
qum propndthe
characteristics
speaking
style.
tooisthe
acoustic
Additionally,
a is
rich
(Dx1)
observation
vector
describes
p During
kAbstract
ng
y of
mod
λa linguistic
ovariations.
dStructure
on
nuou
P
* bolic
the
training,
the
discrete/continuous
context-dependent
Then,
FA
***representation
***speech
F
**
m dcontexts.
omthe
ho pooled
poo
dspeakers
p average
k headphones
ob
v(D)]
on oFirstly,
Fno yatree
is
estimated
from
observations.
m
Fstructure
* HMMs
*u om
context-dependent
that
model
the
symbolic/acoustic
and
s
en
ng
cond
on
Exper
pa
c
pan
s
associated
with
speaker
r,
where
(t)
=
[o
(1),
.
.
.
,
o
context-dependent
HMM
modelling
(see
[10]
and
[11]
for
a
detailed
characteristics
of
a
speaking
style.
syllable
Longtemps
##
je
me
suis
couché
de
bonne
heure
##
ngu
d
p
on
o
h
x
h
y
x
is
the
(Dx1)
observation
vector
which
describes
the
acoustical
proplinguistic
description
of
the
text
characteristics
is
automatically
exerty
at
time
t.
Let
q
=
(q
,
.
.
.
,
q
)
the
total
set
of
linguistic
context
vector
which
describes
the
linguistic
properties
at
time
t.
on x d * p* nd **n* HMM h mod h ymbo
ou ** *p h
o
dcontext-dependent
w
r wh
= ois estimated
1themdescription
odforoDeach
P
MMlength
mhofo ofthemthe
onh xp d kis
p derived
nd
n HMM
HMM
hm
Fcharacteristics
** * ”contexts).
P
soo as describes
tomod
minimize
T
model
mdescription
=acoustical
F
of
style.
iscontexts
the
(Dx1)
observation
propthe
enriched
linguistic
“ngu
da*using
uspeaking
ngofa linguistic
p mocoung chain
hDuring
nbonne
9m and
nd
u* synthesis,
dtoorefine
fin
tracted
[9]
used
the
syllable
Longtemps
##
je
me
suisprocessing
ché
de
heure
##
omare
pestimated
the
symhhthe
Dx1
ob
v contexts.
vand
o which
h=
d an(1)
b vthe
hom
pd)]op
(R)
[q
(1),
. .context-dependent
.ou
,va
q
(T
isnd n tree
observations,
qwh
eononvector
ac
ua
yThcom
ng
ous
doma
ns speech and aud o
MM we
m k ng y separately.
m
ngu
x
n
n
g
on
x
p
P Fh HMMs
*
*
*
*
linguistic
Then,
average
*
*
*
.
context-dependent
HMM
model
λ
Table
4.2:temps
text-to-prosodic-structure
conversion.
time
Let
qq==t.sequence
(qq
, .q
. . ,associated
λIllustration
=
, couλ ché
(1)
erty
Let
=qqngu
(qm) the
.cs.o. ,set
q dofolinguistic
)cr,the
totalngPaset
ofhpan
linguistic
on
x d ##
p ##
ndofjethe
njeλHMM
HMM
mod
10 and
nd
11 for
o
derty
⇓suis
syllable
me
dede
bonne
heure
##
context-dependent
modelling
(see
[10]
[11]
a detailed
m t. Mat
Ltime
h,total
ngu
theyatdlinguistic
context
speaker
where
m s we e encou
syllable LongLongtemps
me suis
cou- ng
ché
heure
##
Wbonne
d og
v dsoes
oas =
nsmwith
pans
on
hofothe
“
c(r)
isMMm
derived
to ominimize
the
description
length
TTm M echno
P
*”
* m
. .h.mus
, q q (T
contexts
observations,
bolic/acoustic
are
cascade,
from
d*3.1.
p* on
oparameters
hthe
n Discrete/Continuous
h d linguistic
ngugenerated
on
x in
“
m of
m”
the
contexts).
Tabledescription
4.2: Illustration
text-to-prosodic-structure
conversion.
(style)
m(L’x1)
Training
ofenriched
the
Models
(r)
(r)
=
q (1),
1augmented
T)] is
x symob[q v(t),on. . . and
q is
“ofofmthe
”
(t)
=
, qnd q
(t)]
the[q
linguistic
qonMMthe
=
λ
, λλ
(1)
syllable
Longtemps
jethe
me
suisprosodic
cou- structure
ché deconversion.
bonne heure
##
Two
main Table
approaches
can be##
distinguished
in
modelling:
on
the
one
The
acoustic
module
models
simultaneously
source/filter
variations,
on
x
d
p
nd
n
HMM
mod
λ
m
m
m ⇓
m
=
[q
(1),
.
.
.
,
q
(T
contexts
observations,
and
q
An
average
context-dependent
HMM
acoustic
model
4.2:λ
Illustration
text-to-prosodic-structure
During
the
training,
discrete/continuous
context-dependent
.
context-dependent
HMM
model
λ
λ
=
λ
1
the
linguistic
context
sequence
associated
with
speaker
r,
where
aged
oquuse
headphones
⇓formal models
r )] is λsymbolic
(style)
(style)
h nguvector
on
x describes
n the
o m d wproperties
h p k at time
r wht.
Table(style)
4.2: Illustration
of the
text-to-prosodic-structure
context
which
linguistic
Rconversion.
hand,
expert
approaches
attempt
at
elaborating
thatsynthesis,
account
for thethe
observed
m
m
3.1.1.
Discrete
HMM
bolic
representation
to
the
acoustic
variations.
Additionally,
a
rich
f
variations,
and
the
temporal
symbolic
associated
with
a
speaksyllable
Longtemps
##
je
me
suis
couché
de
bonne
heure
##
HMMs
are
estimated
separately.
During
the
sym⇓
λlinguistic,
, λacoustic
λ3 1training,
(1)
Two
main approaches
can
be
distinguished
in
structure
modelling:
on the one
is theh (L’x1)
augmented
qq (t) ==[q (t), . . . , q (t)]
T arespect
ng
o
D prosodic
Con nuou
Mod
SModels
bn=
du
nh Discrete/Continuous
M
Training
of
the
Models
prosodic
with
to
para-linguistic,
and
extra-linguistic
conL x1the
ugm
n respect
dlinguistic
ngu
4.6.1 variations
Expert
o3.1.
symbolic
m r, where Firstly, a
During
the
the
discrete/continuous
context-dependent
the
linguistic
context
sequence
with
speaker
isSpeakers
estimated
from
pooled
observations.
Two
main
approaches
can
be distinguished
informal
prosodic
structure
modelling:
onen
ing
style.
fmod
were
normalized
with
toMM
the
Th
ou describes
modu
muproperties
nassociated
ou
ysource/filter
ou speakers
fiMMm
v speaking
on
bolic/acoustic
parameters
generated
in
cascade,
from
sym4.2: Illustration
ofhelaborating
theare
text-to-prosodic-structure
conversion.
The
acoustic
module
models
simultaneously
variations,
hand,
expert
approaches
attempt
atdistinguished
models
that
account
the
observed
oh For
Du
ng
ncanng
dmethods
on
nuou
on
xstatistical
donthe
pthethe
nd
context
the
linguistic
at time
straints.
OnTable
the
other
hand,
statistical
attempt
at
elaborating
a for
model
each
speaking
style,
an
average
context-dependent
discrete
Two
main
approaches
be
inDuring
prosodic
structure
modelling:
on
one
An
context-dependent
HMM
acoustic
model
λmt.
on average
x vector
v exo which
whprior
h d⇓to modelling.
ngu
pfilter,
op and
o 3.1.1.
linguistic
description
of
the
text
characteristics
is
automatically
HMMs
are
separately.
the
synthesis,
the
symhand,
expert
approaches
attempt
atto
elaborating
formal
models
that
account
for the
observed
MMbtheh h
style
Source,
normalized
fha speakobserva3 estimated
1the
1t m
D
HMM
prosodic
variations
with
respect
linguistic,
para-linguistic,
and
extra-linguistic
conf
v
on
nd
mpo
ymbo
o
d
w
p
k
Discrete
HMM
bolic
representation
to
the
acoustic
variations.
Additionally,
a
rich
f
variations,
and
temporal
symbolic
associated
with
HMM
d
p
y
Du
ng
h
yn
h
h
ym
(r)
(r)
⇓
which
accounts
for
prosodic
variations
from
the
observation
of
statistical
regularities
on
m
m
(r)
�
Table
4.2:
Illustration
of
the
text-to-prosodic-structure
conversion.
hand,
expert
attempt
at elaborating
formal models
that
for the
observed
twith
Results
&
Discussion
is estimated
from
the
pooled
observations.
Firstly,
a context-linguistic
HMM
λstatistical
estimated
theaccount
pooled
speakers
prosodic
variations
respect
linguistic,
para-linguistic,
and
extra-linguistic
con- associated
4.6.1
Models
context-dependent
HMM
isothe
oapproaches
bis
nd in
Models
bolic/acoustic
parameters
are
cascade,
the
symstraints.
On
the
other
hand,
methods
attempt
atnfrom
elaborating
afrom
statistical
model
tion
vectors
their
dynamic
vectors
are
used
estimate
l lExpert
(t)
[q⇓d1and
.speakers
.Mm,5isqnormalized
(t)]
(L’x1)
augmented
qing
ng
y =Speakers
Sp
k(t),
w
m
dis
wthe
hmodel
pto to
hestimated
p k ng for each of the
am
linguistic
description
ofStothe
text
characteristics
ex- An average
large
speech
corpora.
style.
f f.were
with
respect
speaking
Zgenerated
bo
ou
gZdu
d automatically
om
ym
t pwith
o
ah
egeis
prosodic
variations
respect
linguistic,
para-linguistic,
and
extra-linguistic
concontext-dependent
HMM
acoustic
model
λλ
Two
main
approaches
can
be
distinguished
inny@
prosodic
modelling:
ondhthe
one
Lno
straints.
On
the
other
hand,
statistical
methods
attempt
atvelaborating
a statistical
Fo
pthe
ktoacoustic
ng
n structure
on
x[9]
pmodel
nd
n discrete
d An
context-dependent
HMM
model
estimated
for each
of the m
with
the
speaking
style.
tracted
using
a
linguistic
processing
chain
and
used
to
refine
the
For
each
speaking
style,
an
average
context-dependent
S
v
g
on
x
p
nd
n
HMM
ou
mod
4.6
Abstract
Model:
Text
To
Prosodic
Structure
l
bolic
representation
to
variations.
Additionally,
a
rich
S
which
accounts
for
the
prosodic
variations
from
the
observation
of
statistical
regularities
on
4.6
Abstract
Model:
Text
To
Prosodic
Structure
Z
MM
m
@
R
W
W
i
During
the
training,
the
discrete/continuous
context-dependent
yfrom
p the
o too
mod models
ng Source,
Sou
fi . Context-dependent
noFirstly,
mm
df
v
4.6.1
Expert
Models
tracted
linguistic
processing
chain
to
refine
the
a ostatistical
eand
HMMs
dependent
HMM
λ
style
prior
modelling.
filter,
andndnormalized
bo
pusing
ntaprosodic
on
helaborating
ou
vProsodic
Add
y model
h
dd used
mattempt
straints.
On
the
other
hand,
methods
at[9]
elaborating
aon
statistical
Hi on
mob context-dependent
uthat
estimated
pooled
speakers
a fobservawhich
accounts
for the
variations
fromTo
the
observation
statistical
regularities
on
smodels
4.6
Abstract
Model:
Text
Structure
u
sH
hand,
expert
approaches
attempt
atdistinguished
formal
for
observed
4.6.1
Expert
linguistic
Then,
an
average
treee
tModels
Then,
ancontexts.
average
tree
S of
dcontexts.
poo
pdescribes
k observations.
obcontext-dependent
onu was
ymm
m@
d ifrom
om
haccount
dthe
pdetailed
oislinguistic
d mcontext
Two
main
approaches
can
beof
inm
prosodic
modelling:
thekexone
den
fica
pe
mance
es
ma
us ng
large
speech
@
λaon
is
the
pooled
speakers
kstructure
linguistic
description
the
text
characteristics
is
automatically
k[10]
vector
which
the
linguistic
properties
atxaretime
t. a measu
onvectors
vom
o hinto
nd
h don
dyn
vo
o vare
dtoFof
oestimate
oned
l l corpora.
context-dependent
HMM
modelling
[11]
for
aon
@
2poo
2
which
accounts
for thepprosodic
from
the
observation
regularities
onassociated
clustered
acoustically
similar
models
using
decision-tree-based
tion
and
their
dynamic
vectors
used
contextngu
dHMM
o variations
hlinguistic,
xestimated
hpara-linguistic,
uWom
ycon-x[11]
eofdstatistical
m
large
speech
corpora.
ZZ
H
u ofeand
context-dependent
HMM
model
is m
estimated
the
s(see
context-dependent
HMM
modelling
(see
[10]
and
for
prosodic
variations
with
respect
and
extra-linguistic
prosodic
grammar
consists
aStructure
hierarchical
prosodic
reprex d is
p derived
nd n MM
HMM
mdescription
dforo each
h o of
hm
wThe
h the
haModel:
pseparately.
kto
ng
y@@To
so(style)
ason
tomod
minimize
the
length
the
Tona detailed
hand,
expert
approaches
attempt
atprocessing
elaborating
formal
models
that
account
for
thefin
observed
with
the
speaking
style.
4.6
Abstract
Text
Prosodic
b
ki [9]
tracted
using
linguistic
chain
used
to
refine
the
HMMs
are
estimated
During
the
synthesis,
the
symm
bOO
SSand
large
speech
corpora.
description
ofaModels
enriched
linguistic
contexts).
@
2
4.6.1
Expert
context-clustering
(ML-MDL
[15]).
Multi-Space
probability
Distrid
based
Cohen
s
Kappa
s
a
s
c
16
Cohen
s
Kappa
s
n
d
u
ng
ngu
p
o
ng
h
n
9
nd
u
d
o
h
i
Con
x
d
p
nd
n
HMM
d
p
nd
n
HMM
mod
λ
n 9 model
ms H
.so
Context-dependent
are
models
λv g context-dependent
linguistic
contexts.
Then,
anis
average
mattempt
u asd an a alternative
s H [10]
4.6.1
Expert
Models
straints.
On
the other
hand,HMM
statistical
methods
atuelaborating
statistical
sentation
that modelling
was
experimented
to T O BI
[12] dependent
ngu
onMMm
x HMM
Th model
n
nλ
on
pminimize
nd n treeHMMs
9 R conderived
asx using
todulength
themparamedescription
length of athes
THMM
context-dependent
(see
[11]
for
prosodic
variations
respect
to
linguistic,
para-linguistic,
extra-linguistic
kk10 and
. models
context-dependent
O
R d
butions
(MSD)
[16]
are
model
acoustic
on xdescription
d p ndwith
nmostly
HMM
mod
ng
ndand
11
o a detailed
d mod2contexts).
2b
@@ linguistic
u
dinto
nas
oacoustically
ou
yused
mpto
mod continuous/discrete
ng
dof
on
b be
d ween wo a e s
nregularities
clustered
similar
decision-tree-based
of
the
enriched
m
m
is
derived
so
to
minimize
the
description
the
T
Expert
approaches
concern
hierarchical
prosodic
structure
c
measu
es
he
opo
on
o
ag
eemen
for
French
prosody
labelling
[13].
The
prosodic
grammar
is
which
accounts
for
the
prosodic
variations
from
the
observation
of
statistical
on
9
bolic/acoustic
parameters
are
generated
in
cascade,
from
the
symTh
p
o
od
g
mm
on
o
h
h
p
o
od
p
The
prosodic
grammar
consists
of
a
hierarchical
prosodic
repredescription
of
the
enriched
linguistic
contexts).
d
v
d
o
o
m
n
m
h
d
p
on
ng
h
o
h
T
(style)
straints.
On
the
other
hand,
statistical
methods
attempt
at
elaborating
a
statistical
model
bbOO
4.6.1
R1980,
(style)
ter
fx sequence
manage
voiced/unvoiced
d3.1. Expert
p oninoModels
hof the
n Discrete/Continuous
hprosodic
d ngufrontierson ([Cooper
x Models
4.6.1
Expert
Models
onaverage
u context-dependent
ngto
ML
MDL
15 Multi-Space
Mu
Spregions
obproperly.
b model
y D Each
nn
context-clustering
(ML-MDL
probability
Distrielling,
and
particular
Training
composed
major
prosodic
(F n9,9 avRboundary
which
HMM
acoustic
λmsymbolice mon o s
HMM
model
λλon [15]).
large
speech
corpora.
on that
hofvariations
w experimented
xp
n boundaries
d as ofand
n Paccia-Cooper,
oon
Bcontext-dependent
sentation
was
an
alternative
to
TTOOBI
[12]
context-dependent
HMM
model
λpduration
which
forn1983,
the
prosodic
themobservation
statistical
The
module
simultaneously
source/filter
variations,
hmodels
co
o. omodel
andom
ag
on12 acoustic
x An
d pbu
nd
n(MSD)
HMM
mod
meemen
context-dependent
HMM
modelled
with
anuou
state
probabilR
Gee
andaccounts
Grosjean,
Ferreira,
1988, from
Abney,
1992,prosodic
Watson
and regularities
Gibson,mod2004]
4.6.1
Expert
Models
onw
MSD
16ec
u isdto
mod
on
d
p. Ou
m measu
[16]
are
used
continuous/discrete
parameMMm
acoustic
Expert
approaches
mostly
concern
structure
bolicExpert
representation
to
the
acoustic
variations.
Additionally,
a
rich
approaches
mostly
concern
hierarchical
prosodic
structure
modis
right
bounded
by labelling
ahierarchical
pause),
minor
prosodic
boundaries
(Ff ,isvariations,
an butions
o
F
n
h
p
o
ody
b
ng
13
Th
p
o
od
g
mm
⇓
for
French
prosody
[13].
The
prosodic
grammar
⇓
3.1.1.
Discrete
HMM
large
speech
corpora.
and
the
temporal
symbolic
associated
with
a
speakfor
English;
[Dell,
1984,
Bailly,
1989,
Monnin
and
Grosjean,
1993,
Ladd,
1996,
3.1.
Training
of
the
Discrete/Continuous
Models
PEECH
YNTHE
ity
density
functions
(PDFs)
toween
account
fordregions
theagtemporal
structure
of
MM
m
m
f
qu
n
o
m
n
g
vo
d
unvo
on
p
op
y
E
h
elling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
ter
f
sequence
to
manage
voiced/unvoiced
properly.
Each
⇓
elling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
3
1
T
a
n
ng
o
h
D
Con
nuou
Mod
he
ag
eemen
be
he
ngs
o
he
pa
c mpana s and he
is
estimated
from
the
pooled
speakers
observations.
Firstly,
4.6.1
Expert
Models
intermediate
and
prominences
(P).
Expert approaches
mostly
hierarchical
prosodic
mod- y which
mprobabilompo
d of
o boundary),
mconcern
o for
p oFrench,
od prosodic
bound
bound
wh
h ou Speakers
The
module[17].
models
simultaneously
variations,
composed
major
prosodic
boundaries
(FFstructure
, some
a boundary
Delais-Roussarie,
2000,
Mertens,
[Barbosa,
2006]and
forGibson,
other
ingacoustic
style.
were
normalized
respect
to
the
speaking
Th
muspeech
ouwith
ysource/filter
ou
vduration
onon
speech
dynamic
and Grosjean,
1983,particular
Ferreira,
1988,
Abney,
Watson
2004]
onmodu
x d fpmod
ndFinally,
nMHMM
HMM
mod
dwith
wishafimodelled
duaccording
p obtobthe
Gee
Abney, 1992,
1992,
Watson
and
Gibson,
2004]
context-dependent
isnmodelled
state
linguistic
description
ofassume
the
text
characteristics
automatically
3.1.
of2004b]
Discrete/Continuous
Models
For
each
style,
ana aprosodic
average
context-dependent
discrete
elling,
and
inTraining
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980, (Fex3.1.1.
Discrete
HMM
gh
bound
dthe
by
pMonnin
u structure
mGrosjean,
noresults
p is
o1993,
od
bound
Fffstyle
n prior
languages).
Expert
models
that
from
theboundaries
integration
and
the
temporal
symbolic
associated
with
athat
speakisspeaking
right
bounded
by
pause),
minor
prosodic
, variations,
an
⇓
g
ound
u
h
The
measu
e
va
es
om
1
o
1
s pe ec
to
modelling.
Source,
filter,
and
normalized
f
observa3
1
1
D
HMM
v
on
nd
h
mpo
ymbo
o
d
w
h
p
k
English;
[Dell,
1984,
Bailly,
1989,
Monnin
and
Ladd,
1996,
for
and
Grosjean,
1993,
Ladd,
1996,
trajectory
model
and
the
global
variance
(GV)
model
local
and
y
d
n
y
un
on
PDF
o
oun
o
h
mpo
u
u
o
ity density The
functions
(PDFs)
to account
foristhe
temporal
of
context-dependent
HMM
model
for each
of the
4.6.1
Expert
Models
4.6.1
andExpert
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
MM1 variations,
m m structure
ofGee
various
and
potentially
constraints,
in
particular
syntactic
rhythmic
motheestimated
msMMpdaccording
Let
RMertens,
beconflictual
the
number
of
speakers
an
average
model
HMM
λ intermediate
estimated
from
the
pooled
speakers
associated
4.6.1
Expert
Models
acoustic
module
models
source/filter
n Models
m
dis
bound
y French,
nd
p[Barbosa,
o od from
p2006]
omwhich
n nand
P
ing
ff were
normalized
with
to
boundary),
and
prosodic
prominences
(P).
Delais-Roussarie,
French,
[Barbosa,
for
some
other
2000,
2004b]
for
2006]
for
some
other
tion
vectors
their
dynamic
vectors
are
used
estimate
contextngstyle.
y Speakers
Sp
kd
m
w
hrespect
ptomodelled
hspeaking
global
speech
variations
p and
h[17].
17 w
Feemen
n noyspeech
p over
dyn
m[?].
mod
d ngtoag
otheheemen
tracted
using
aspeaking
used
For
each
style,
ann1989,
average
context-dependent
discrete
sag
0hdand
stime
chance
1simultaneously
pek ongec
Con us on
speech
Finally,
dynamic
is
English;
[Dell,
1984,
Bailly,
and[9]
Grosjean,
1993,
Ladd,
1996, the
[?,for
Dell,
1984]
constraints.
Fo
hExpert
plinguistic
kModels
ng
yprocessing
v Monnin
gchain
on
x qdand
p
nd
n integration
dto refine
4.6.1
4.6.1
Expert
Models
with
the
speaking
style.
style
prior
to
modelling.
Source,
filter,
normalized
f
observaλ
is
to
be
estimated.
Let
=
(q
,
.
.
.
,
q
)
languages).
prosodic
structure
results
from
the
integration
Expert
models
assume
that
a
prosodic
structure
results
from
the
Expert
approaches
mostly
concern
hierarchical
prosodic
y p linguistic
otrajectory
o
ng Sou
nd
no
m
d that
fHMMs
vare
MMmod
m o massociated
contexts.
an
context-dependent
.gThen,
Context-dependent
dependent
HMM
λ ndthehfiglobal
o models
ymodel
mod
ob
vthe
n average
GV
hob
mmodel
m local
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,
2006]
forassociated
somemodother
and
variance
(GV)
The
linguistic
module
mostly
concerns
the pooled
extraction
of structure
prominent
syn3.1.1.
Discrete
HMM
HMM
λ
is
estimated
from
the
speakers
fngs
and
symbolic
withToaaspeakof
constraints,
particular
syntactic
and
rhythmic
0hvariations,
ahmm
we
cons
de
ed
as
equa
ymod
poss
band
enda ngstree
nde
Lthe
bconflictual
hprosodic
numb
oprosodic
k[10]
om
hobservations,
vdetailed
mod
EVALUAT
ON
tion
vectors
and
their
dynamic
vectors
used
totemporal
estimate
HMM
λ Expert
m
d of
om
hininpstructure
poo
dfrom
pPaccia-Cooper,
kwh
on1980,
d g model
total
set
symbolic
Let
R
be
the
number
of
speakers
which
an
average
of
variousand
and
constraints,
particular
syntactic
and
rhythmic
context-dependent
HMM
modelling
(see
and
[11]
amodelling,
inpotentially
particular
([Cooper
and
onand
v oglobal
ndacoustically
ve over
o are
m contexton
x
clustered
into
similar
models
tactic
boundaries
from
deep
syntactic
parsing,
based
onfrom
syntactic
conlanguages).
models
assume
that
afrontiers
prosodic
results
thefor
integration
g ob
p dyn
vm
on
ov
muusing
?d odecision-tree-based
variations
time
[?].
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
with
the
speaking
style.
(style)
[?,
Dell,
1984]
constraints.
The
prosodic
grammar
consists
of
a.dPaccia-Cooper,
hierarchical
prosodic
repre3.2. speech
Generation
of the
Speech
ParametersHMMs Mare
[?,
Dell,
1984]
constraints.
w
h
h
p
k
ng
y
M
M
m
o
b
m
L
q
=
q
q
.
Context-dependent
dependent
HMM
models
λ
=
[q
(1),
.
.
,
q
(N
)]
is
the
prosodic
symq
λ
is
to
be
estimated.
Let
q
=
(q
,
.
.
.
,
q
)
stituency
(Constituent-Depth
[Cooper
and
1980],
φ-phrases
of
various
and
potentially
conflictual
constraints,
in
particular
syntactic
and
rhythmic
Geeelling,
andlinguistic
Grosjean,
1983,particular
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson, 2004]
c
s
on
a
ngs
we
e
e
a
ve
y
a
e
3%
o
he
o
a
a
ngs
and
context-clustering
(ML-MDL
[15]).
Multi-Space
probability
DistriE
p
m
n
S
up
Con
x
d
p
nd
n
HMM
d
p
nd
n
HMM
mod
λ
is
derived
so
as
to
minimize
the
description
length
of
the
T
and
in
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
ing
style.
Speakers
f
were
normalized
with
respect
to
the
speaking
The
module
mostly
concerns
the
extraction
of
prominent
syn0
m
The
linguistic
module
mostly
concerns
the
extraction
of
prominent
synthat
was
experimented
aso contexts).
an
alternative
T
O BI v[12]
acoustic
[Gee
and Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/ toRight-hand-side
h[Dell,
osequence
ostyle,
pparsing,
od
ymbo
ob
onq clustered
nddiscrete
description
of the
the
enriched
similar
models
using
decision-tree-based
total
setlinguistic
of
prosodic
symbolic
observations,
and
[?,sentation
Dell,
1984]
constraints.
bolic
associated
with
speaker
r,
where
butions
(MSD)
[16]
are
to gu
model
paramefor
English;
1984,
Bailly,
1989,
Monnin
and
Grosjean,
1993,
Ladd,
1996,
u(n)
dinto
nDuring
oacoustically
ou
yused
mF
mod
ng
dconverted
onhe
b a concatenated
dfica
tactic
boundaries
from
deep
syntactic
based
on
syntactic
conFor
each
speaking
an
average
context-dependent
M into
m mon con us on
m ma x
text
isu
first
The
prosodic
grammar
consists
of
a hierarchical
prosodic
repreGee
Grosjean,
1983,
Ferreira,
1988,
Abney,
Watson
and
Gibson,
2004]
tactic
boundaries
from
deep
syntactic
parsing,
based
on
syntactic
cone 3continuous/discrete
pmm
esen
smproperly.
den
forand
prosody
labelling
[13].
The
prosodic
grammar
is
(style)
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
Th
pFrench
o (Constituent-Depth
od
gthe
mm
on
oand
hq1992,
h(N
oprominent
od
psyn3 2 Generation
Gemoved
nthe
asynthesis,
onprior
o[15]).
h the
Sp
hParameters
Pa
am
3.2.
ofvoiced/unvoiced
the
Speech
context-clustering
(ML-MDL
Multi-Space
probability
DistriThe
linguistic
module
mostly
concerns
the
extraction
ofpfor
T
m
=
1
N
h
p
o
od
ym
stituency
[Cooper
Paccia-Cooper,
1980],
φ-phrases
=
[q
(1),
.
.
.
,
)]
is
the
prosodic
symq
is
prosodic
label
associated
with
syllable
n.
Let
ter
f
sequence
to
manage
regions
Each
style
to
modelling.
Source,
filter,
and
normalized
f
observaDelais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,
2006]
some
other
on
x
u
ng
ML
MDL
15
Mu
Sp
p
ob
b
y
D
0
sentation
that
was
experimented
as
ann alternative
too T
BI
[12]
for
[Dell,
1984,
Bailly,
1989,
Monnin
and(FGrosjean,
1993,
1996,
stituency
[Cooper
Paccia-Cooper,
φ-phrases
.λprobabilcontext-dependent
HMM
model
λacoustic
MM (MSD)
mOve
maused
sequence
of
context-dependent
HMM
associated
(style)
composed
prosodic
boundaries
av on
boundary
n English;
on(Constituent-Depth
hqof1983,
w=
xp
d and
n , 1980],
TOOLadd,
B which
12
tactic
boundaries
from
deep
parsing,
based
syntactic
con- contexts
DG
butions
[16]
are
to
continuous/discrete
parame[Gee
and
Grosjean,
Delais-Roussarie,
2000],
Left-hand-side
Right-hand-side
mfica
sconverted
amodels
onwhere
,m
.an.estimated
,for
q
)structure
the
set
of
linguistic
context-dependent
modelled
afinuou
state
duration
bo
qu(q
n that
o[13].
d Left-hand-side
w[Barbosa,
htotal
p the
k// pooled
whspeakers
non MSD
bu
16
owmodel
mod
onfirst
d = den
m
languages).
Expert
models
assume
prosodic
results
from
the
bolic
sequence
associated
with
speaker
r,rintegration
where
(n)
[Gee
Grosjean,
Delais-Roussarie,
2000],
Right-hand-side
for
French
prosody
The
prosodic
grammar
HMM
λomajor
is.syntactic
from
associated
Du
hHMM
ynu voiced/unvoiced
hisdsco
hetextevea
xwith
onv
dinto
npmEach
n pe
dm o mance κ =
Delais-Roussarie,
2000,
2006]
for
other
fisome
the
synthesis,
the
issequence
withng
linguistic
context
qproperly.
[q
,oa. .concatenated
. , on
qare], used
is
by
alabelling
pause),
minor
prosodic
(F
, isanqpater
stituency
[Cooper
and
1980],
φ-phrases
oand
F2
nbounded
h(Constituent-Depth
p1983,
ody
b2004b]
ng
13French,
Th
p oboundaries
od
g speech
mm
mDuring
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
F gu
eright
Gene
aMertens,
on
oand
dlabel
sc
einassociated
ePaccia-Cooper,
con
nuous
symbolic
fLlinsequence
tomthe
manage
regions
tion
vectors
and
their
dynamic
vectors
to
estimate
contextity
density
functions
(PDFs)
to
account
for
the
temporal
structure
of
h
p
o
od
b
o
d
w
h
y
b
n
=
[q
(1),
.
.
.
,
q
(N
)]
is
the
observations,
q
f
qu
n
o
m
n
g
vo
d
unvo
d
g
on
p
op
y
E
h
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
is
the
prosodic
with
syllable
n.
Let
of [Gee
various
and
potentially
conflictual
constraints,
particular
syntactic
and
rhythmic
composed
of
major
prosodic
boundaries
(F
,
a
boundary
which
0
38
±
0
04
wh
ch
s
compa
ab
e
o
ha
obse
ved
o den
languages).
models
assume
abound
prosodic
structure
resultsbound
from
the yintegration
MMcontext
m o vectord MM
and Grosjean,
1983,
Delais-Roussarie,
2000], prominences
Left-hand-side
/ Right-hand-side
n HMM
on
nd nHMM
HMM
mod
λ
qqu =
[qo context-dependent
, . is
. . modelled
, qx ]d pdenotes
the (Lx1)
linguistic
intermediate
sequence
of
models
associated
ompo
dExpert
oofboundary),
m
o Discrete/Continuous
p o and
odthatprosodic
F mModels
h
context-dependent
state
duration
MM(style)
mλprobabil3.1.ame
Training
the
m,couche
speech
speech
dynamic
modelled
q )dependency
hassociated
o boundaries
o su
ngu
on
context
with
speaker
where
onx x d[17].
passociated
ndFinally,
n HMM
mod
dwith
wishan.
duaccording
on
p obtobthe
the
eright
swith
oconstraints.
he
sen
emps
e (P).
me
swh
qguistic
=speaking
(q
, . .style.
.sequence
, “Long
q
the
total
set
of
linguistic
contexts
[?,Boundary
Dell,
1984]
mlinguistic
bounded
by
aqence
pause),
minor
(F
ann r,
[Watson
and
ofis
various
and
potentially
constraints,
syntactic
and
with
unit
w
h
h
ngu
on
x
qu
n
q
=
q
q
wh
gh
bound
d Gibson,
by conflictual
p2004]),
u syntactic
m no prosodic
pinoparticular
od (Dependency-Grammar-based
bound
Frhythmic
with
the
linguistic
context
sequence
q
=
[q
,
.
.
.
,
q
],
where
ity
density
functions
(PDFs)
to
account
for
the
temporal
structure
of
fica
on
om
na
u
a
speech
κ
=
0
45
±
0
03
m
m
L
na
u
a
trajectory
anddependent
the
global
model ulocal
and
HMMs The
are
HMM
models
λlinguistic
m
y dn n The
y model
unacoustic
on
PDF
o variance
oun o (GV)
h that
mpo
u
module
models
simultaneously
source/filter
=
qwhich
1isofan
N
hcontext
ob
v mostly
on
nd
q . . the
(n)
=
qprominences
(n)]
the
(Lx1)
linguistic
q
[q
.(P).
. P.prominent
,q
)] oismbed
the
linobservations,
and
intermediate
boundary),
and
prosodic
linguistic
module
concerns
synm.theContext-dependent
mo x v ovariations,
acoustic
fiqo (N
[?,
constraints.
Let
R1984]
the
number
speakers
from
average
model
de The
bonne
heure”
“For
a(n),
me
go
= [q ,speech
nois modelled
h (Lx1)
Lx1
ngu tocontext
on
nDell,
m
dbe
bound
y [qof
nd
p qoong
od. ,=
p extraction
om
n (1),
n used
qmq Finally,
. ,over
q hdynamic
]dyn
denotes
the
speech
m dfimpe
global
speech
time
[?].
4.6.1
Expert
Models
pwithhw[17].
17
F=variations
nden
y . .pfica
mod
daccording
o gn
d ngfican
o h yvector
gu
on
xdescribes
qu nparsing,
obased characteristics
donof
w syntactic
h speaker
p conkassociated
r where
wh
mm
on
o
mance
s
depends
on
he speak
4.6.1
Expert
Models
vector
which
thethelinguistic
guistic
context
sequence
associated
with
r,
tactic
boundaries
from
deep
syntactic
m
Firstly,
the
prosodic
symbolic
q̄
is
inferred
so
as
to
maximize
The
linguistic
module
mostly
concerns
extraction
prominent
syno
d
w
h
ngu
un
n
m
m
associated
with
linguistic
unit
n.
λDiscrete
isHMM
to
be estimated. Let wq
= (q
, . . F. , q
) trajectory
4.6.1
Expert
Models
model
and
the
global
variance
(GV)
that
model
local
andndM using
3.1.1.
clustered
into
acoustically
similar
models
decision-tree-based
f
variations,
and
the
temporal
symbolic
associated
with
a
speakM
M
earstituency
y”
o
y
mod
nd
h
g
ob
v
n
GV
h
mod
o
0
syllable
n. deep
Let
be
the
number
speakers
from
which
an
model
nfrom=
nwhbased
ngu
on x
(n)
(n),
. ,consists
qparsing,
the
(Lx1)
linguistic
context
q
[Cooper
Paccia-Cooper,
1980],
the log-likelihood
of the
symbolic
sequenceaqy s subs
condi- an a y den fied
L
RThe
b (Constituent-Depth
hprosodic
numb
oprosodic
pn k. . and
om(n)]
vLx1
gφ-phrases
mod
ng s yove time
figu
eprosodic
4 spo
commen
tactic
boundaries
syntactic
syntactic
conthe R
total
set
of [qof
symbolic
andprosodic
global
grammar
ofhisobservations,
anh/onaverage
hierarchical
repreob hspeech
p hvariations
v m onwover
m [?].
?
4.6.1
Expert
mdsotheoasrespect
m
v(Constituent-Depth
o bewhich
whestimated.
h describes
d
b 2000],
hq
ngu = (q
h 1980],
dgwith
w
vector
linguistic
associated
λqand Grosjean,
is
tooModels
,Right-hand-side
. . . ,w
qq
)o
[Gee
1983,
Left-hand-side
3.2.
Generation
of
the
Parameters
M nq and
mMulti-Space
mqthe
Ftionally
yκ
hto
p0
oSpeech
od
ymbo
omaximize
m
x mtheκ
stituency
and
Paccia-Cooper,
φ-phrases
context-clustering
(ML-MDL
[15]).
probability
Firstly,
the
prosodic
symbolic
q̄q̄
is na
inferred
to
.speaking
the
linguistic
context
sequence
model
λfied
λ
m . .d.[Cooper
L the
q(N
=is characteristics
ing
style.
Speakers
f
were
normalized
with
to
=
[qb Delais-Roussarie,
(1),
, qLet
)]discrete
prosodic
sym0
=
68
±
0
05
ou
a
y
den
=m 0 50 ±Distri0 06
R
m
m
m
An
average
context-dependent
HMM
λ
is
estimated
y
b
n
mm
syllable
n.
the
total
set
of
prosodic
symbolic
observations,
and
m
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
sentation
that
was
experimented
as
an
alternative
to
T
O
BI
[12]
h
og
k
hood
o
h
p
o
od
ymbo
qu
n
q
ond
For Boundary
each
speaking
style,
an
average
context-dependent
discrete
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/
Right-hand-side
the
log-likelihood
of
the
prosodic
symbolic
sequence
q
condi4
Evaluation
h
o
o
p
o
od
ymbo
ob
v
on
nd
m
bolic sequence
associated
with speaker
r, where
qan average
(n)
m contextm hden
from
pooled
observations.
Firstly,
During
the asynthesis,
the
text
isPa
firstam
converted
into
a concatenated
3.2.
Generation
ofo the
Speech
Parameters
po
ca
d
scou
se
mode
a
e
y
fied
κ
=
0
28±0
07
and
butions
(MSD)
[16]
are
used
to
model
continuous/discrete
parame=
[q the
(1),
. . .speakers
,syntactic
q
(N
)] is (Dependency-Grammar-based
the
prosodic
symq
style
prior
to
modelling.
Source,
filter,
and
normalized
f
observa3
2
G
n
on
h
Sp
h
Boundary
[Watson
and
Gibson,
2004]),
dependency
0
on
y
o
h
ngu
on
x
qu
n
q
nd
mod
λ
.
tionally
to
the
linguistic
context
sequence
q
and
the
model
λ
=
1
N
h
p
o
od
ym
q
m
=
m
m
is (style)
the prosodic
syllable
Let
fi q̄ m = argmax
wHMM models
P(q λ |q, λ associated
)
(2)
v associated
g label
on Txassociated
d labelling
p nd
d [13].
HMM
λn.
d
isn with
derived
so
asThe
to
minimize
them
infordependent
tree
An
average
context-dependent
discrete
HMM
λ
sequence
ofiscontext-dependent
forqu(q
French
prosody
prosodic
grammar
mon
wisgestimated
bolic
with
speaker
r,
where
qon
MMκused
mn to
HMM
from
the
speakers
associated
The
p oposed
mode
eva
ed
av(n)speak
ng
mass
yis sfirst
ghconverted
yto manage
den
fied
=
0d 12
±
0 context06 properly.
n compa
qλsymbolic
=sequence
.estimated
. .pooled
,has
the
set
of
linguistic
contexts
During
the
synthesis,
thehon
textxsequence
into
a concatenated
bo
n is,the
oq d)been
dspeakers
wk htotal
p ua
kpooled
r based
wh
n
om
h
poo
p
ob
v
F
y
n
on
x
Du
ng
h
yn
h
fi
onv
d
n
o
on
ter
f
voiced/unvoiced
regions
Each
from
observations.
Firstly,
an
average
contexttion
vectors
and
their
dynamic
vectors
are
estimate
m
with
the
linguistic
context
sequence
q
=
[q
,
.
.
.
,
q
],
where
0
m
isobservations,
the
label
associated
Let
hcomposed
pdoon
bmajor
d(1),with
w
h.v, qsyllable
b )] n.nis
= o[qexpe
.boundaries
the
and
qcep
aHMM
gmax
P qon
of
models
λλ|q,q
associated
P(q
λλcontext
(2)
(Fmcontexts
alin-boundary
h==argmax
fica
om
u)o vector
a speech
s ywith
e den
ficaprosodic
and
ed
oqnquoa =n which
m 2 he den fica on s
M
d.men
d ysoo(N
n ,mL
h sequence
podnd
nof
nd
nden
HMM
mod
d
[qo context-dependent
, . on
. .son
, qxq̄q̄
]dwpdenotes
the
(Lx1)
linguistic
is set
derived
as too compa
minimize
the
infordependent
tree
TTua prosodic
the
speaking
qguistic
=
(q
. .pe
.sequence
,q
total
of
(style)
Gna
mstate
q
= context
q ,style.
q ) the
hassociated
o
o linguistic
ngu
x
with
speaker
r, onwhere
context-dependent
HMM
modelled
with
a
m
m
with
the
linguistic
context
sequence
qλ
==
[qmq, ... is
.Context-dependent
,q
], where
=
N
associated
with
linguistic
unit
n.
are probabildependent
HMM
models
w
h
h
ngu
on
x
qu
n
q
q
wh
compa
ab
e
n
he
case
o
he
spo
commen
aduration
y and
he ou
speak ng
s
y
e
den
fica
on
expe
men
w
h
na
u
a
speech
w
(1),
. . ,minor
q
)] is the
lin-n
observations,
q
mo m HMMs
acoustic
bounded
aq pause),
prosodic
boundaries
(F[qm ,, .an
=
1is .the
q (N
N
hcontext
ob
vis right
on
nd(n),
q . . . ,=
= [qand
qby [q
(n)]
(Lx1)
linguistic
q (n)
qq ==
. . , q ] denotes
theh (Lx1)
linguistic
context
vector
d
no
Lx1
ngu
on
x
v
guistic
context
sequence
associated
with
speaker
r,
where
ity
density
functions
(PDFs)
to
account
for
the
temporal
structure
na
speak
ng
s
y
es
κ
=
0
70
±
0
03
and
κ
15 Fo
hewhich
pu xdescribes
posequonthesuch
necessa
y
gu
on
o a compa
d w h son
pDGkassociated
rwas
whwith
vector
linguistic
characteristics
associated
with
linguistic
unit
n.
na
u
a
na
u a of=
fi
clustered
into
acoustically
similar
models
using
decision-tree-based
Firstly,
the
o Sub
d wprosodic
h nguu symbolic
m
nun nq̄ mis inferred som as to maximize
intermediate
prosodic
prominences
[q (n),
. .consists
. , q (n)]
isa the
(Lx1)
linguistic
context
qsyllable
n.
=ng
n boundary),
h
Lx1
ngu
onmen
x (P).
q
grammar
ofand
hierarchical
prosodic
reprem 0E 54
±mprosodic
0 [17].
05 symbolic
espec
ve
y qHoweve
he e s aaccording
d op n to
den
o The
p ovprosodic
de(n)
anwhich
s=
e eva
ua
on nscheme
o wbo
h expe
s thenRlog-likelihood
ofspeech
the
sequence
condiFinally,
speech
dynamic
is
modelled
the
w
fi
vector
describes
the
linguistic
characteristics
associated
with
m . MM m Districontext-clustering
[15]).
Firstly,
the
prosodic
symbolic
q̄(ML-MDL
is inferred
so oasMulti-Space
to omaximize
v o wh h d
b h ngu
h
o
dw h
Ftionally
p olinguistic
ymbo
m
m probability
context
and
model
λ xmass
n.
fica
on
o q̄sequence
he po nqsequence
cadtheand
speak ng s y es wh ch s
pasentation
cu syllable
aAn
was
no experimented
poss b e discrete
o con
oTalternative
he
ngumistosestimated
o y hto=the
asDGan
Tc Ocon
BI en
[12]
context-dependent
HMM
λ
mod
the
ofo the
symbolic
q q he
condiy average
bthat
n was
hTlog-likelihood
og k hood
h prosodic
p o= odm model
ymboand the
qu nglobal
ond (GV) that model local and
trajectory
variance
butions
(MSD)
used
totheomodel
from Let
the pooled
speakers
observations.
Firstly,
an
average
contextespec
a [16]
yxsequence
s are
gn
he continuous/discrete
mass
s .y e κna =u paramenafor
u aFrench
speech
uR ebeances
wh
chof
p ov
desλ ev
den
cues
DG
tionally
linguistic
context
qqand
model
λλ
a = 0 34±0 05
the
number
speakers
from
anod average
othe
qu fican
nN |q,
An
average
discrete
iswhich
estimated
labelling
grammar
isons ytomodel
mThe
mP(q
q̄h fi ngu
= argmax
λnd h mod
)
(2)
=mon tospeech
An
v prosody
g context-dependent
on Tx d
p nd isn derived
d [13]. HMM
HMM
λ prosodic
minforso
as to
minimize
the
dependent
tree
global
variations
over
time
[?].
and
κ
=
0
38
±
0
04
espec
veproperly.
y Th sEach
nd ca es ha
den fica
on
a
s
ng
e
keywo
d
wou
d
be
su
fic
en
o
den
y
a
=
ter
f
sequence
manage
voiced/unvoiced
regions
(style)
(1)
(R)
from
pooled
speakers
observations.
Firstly,
ann average
contextna u a
0
om the
h
poo
d
p
k
ob
v
on
F
y
v
g
on
x
G
m
Tgmax
w |q,qmλλ
N P(q
λofsymbolic
isprosodic
to beisson
estimated.
=hexinfor(q
, . . . , q=proso
q̄q̄ ) ==argmax
)
(2)2
composed
major
boundaries
a boundary
which
proso
Mmq,nproso
derived
as Let
to(F
T
P somehow
q
hea mode
a
ed
o
cap
u
e
he
e
evan
DG
Thus
a compa
d vequ
d soo ed
oominimize
memove
m the
nca
o access context-dependent
ddependent
p ndsuch
n tree
T
HMM is modelled with a state duration probabil-cues o he
theon total
set cofdminor
prosodic
and= co espond ng speak ng s y e Neve he ess a a ge con us on
is right
bounded
pause),
prosodic
(Fm , an
and
o ocus
hebypaosod
mens
on onsymbolic
yboundariesobservations,
ity density3.2.
functions
(PDFs)ofto the
account
for Parameters
the temporal structure of
(r)
(r)
(r)
Generation
Speech
intermediate
boundary),
prosodic
= and
[qproso
(1), . . prominences
. , qproso (Nr )](P).is the prosodic speech
sym- [17]. Finally,
qproso
mMM m according to the
m m
speech dynamic is modelled
=(r)
m
=
bolic sequence associated with speaker r, where qproso
(n) = model
trajectory
global variance
model
localaand
Duringand
theN the
synthesis,
the text is(GV)
firstthat
converted
into
concatenated
Let R be isthe the
number
of
speakers
from
which
an
average
model
=
m
(style)
prosodic label associated with syllable n. global
Let speech
variations
over
time
[?].
MM
m
sequence of context-dependent HMM models λsymbolic associated
(style)
(1)
(R)
(R) q
λsymbolic qis to=be estimated.
, qproso ) contexts
(q(1) , . . . , qLet
) proso
the =
total(qproso
set ,of. . .linguistic
=m
with the linguistic context sequence
= m q = [q1 , . . . , qN ], where
the totalobservations,
set of prosodic
observations,
= [q(r) (1),
. . . , q(r) (Nr )]andis the lin- q = [q , .m. .m, q ]� denotes the
and q(r) symbolic
m
m m context vector
(r)
(r)
(r)
n
1
L
3.2. Generation of the Speech Parameters (Lx1) linguistic
[qproso
(1), . . .sequence
, qproso (Nrassociated
)] is the with
prosodic
sym-r, where
qproso =guistic
context
speaker
m
associated
with
linguistic
unit
n.
(r)
(r)
(r)
bolic sequence
where
(n) context
= [q (n),with
. . . , speaker
q (n)]�r,is the
(Lx1)qproso
linguistic
q(r) (n)associated
proso
(style)
symbolic
proso
(style)
symbolic
(style)
m
symbolic
(style)
m
symbolic
52
52
(style)
acoustic
CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
CHAPTER 4. SPEECH PROSODY MODELLING
& SYNTHESIS:
STATE-OF-THE-ART
0
STATE-OF-THE-ART
0
(style)
symbolic
(style)
symbolic
m
(style)
is used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,
The prosodic structure model is to infer a sequence of prosodic labels that are associated
proso
(style)
symbolic
(r)
(r)
(style)
(1)
(r)
(style)
The prosodic structure
structure model
model is to infer
infer aa sequence
sequence of
of prosodic
prosodic labels
labelsthat
thatare
areassociated
associated
Longtemps isisto,toinfer
je a sequence
me suis of
couché labels
de that
bonne
heure
.
sentence structure
The
prosodic
prosodic
are associated
with
relevant prosodic
prosodicmodel
events.
events.
with
relevant
prosodic
events.
Longtemps
mea sequence
suis
couché
.
sentence
The
prosodic
structure
model, is toje infer
of
prosodic de
labelsbonne
that areheure
associated
CHAPTER
4. SPEECH PROSODY MODELLING & SYNTHESIS:
with relevant prosodic
events.
(style)
(style)
(style)
52
Longtemps
je me
me suis
suis
couché
de STATE-OF-THE-ART
bonne heure
heure
sentence
Longtemps
,,
je
couché
de
bonne
..
(r)
(r)
0
(r)
1
(r)
(r)
(1)
1
(r)
(r)
(r)
(r)
(r)
(r)
(1)
(style)
(r)
acoustic
1
(r)
(r)
�
L
(r)
0
1
(r)
L
L
(style)
symbolic
m
(r)
t
(r)
(R)
(r)
r
�
�
(style)
m
symbolic
�
r
r
(style)
acoustic
(style)
symbolic 0
0
(style)
symbolic
m
(style)
acoustic
0
(style)
(style)
acoustic
acoustic
0
(style)
acoustic
0
0
0
0
(style)
acoustic
0
(style)
acoustic
(style)
symbolic
n
1
nn
L
1
(style)
m
symbolic
�
1
��
LL
1
N
NN
proso
proso
(style)
symbolic
proso
(style)
(style)
(style)
m
symbolic
proso
symbolic
symbolic
proso
proso
n
n
n
1
L
�
proso
qproso
qp
qproso
�
L �
L
1
1 (style)
N
(style)
prososymbolic
m
symbolic
m
1
N
N
proso
proso
(style)
symbolic
(style)
symbolic
m
(r)
t
0
L
L
(style)
symbolic
r
(style)
symbolic
(r)
t
(r)
(style)
(R)
R
acoustic
(r)
0
1
(r)
(r)
t
�
�
0
f0 [Hz] f [Hz]
0
f0 [Hz]
f0 [Hz]
f [Hz]
(R)
proso
(r)
�
��
(r)
t
(r)
��
r
(r)
(r)
t
(style)
(r)acoustic
t
(r)
(style)
acoustic
(R)R
r
(R)
R
(r)
(1)
(style)
acoustic
amplitudeamplitude
amplitude
amplitude
amplitude
proso
(r)
(r) (r)�
L
(r)
(R)
LL
(1)
(r)
t
(R)
(1)
acoustic
�prosodic
(r)
�
structure
modExpert
approaches mostly
mostly(r)concern
concern hierarchical
hierarchical
prosodic
structure
modL
L
elling,
([Cooper
and
1980,
Expert
approaches
mostly 1prosodic
concern
hierarchical
prosodic
structure (R)
modelling,
and inin particular
particular
prosodic frontiers
frontiers
([Cooper
and Paccia-Cooper,
Paccia-Cooper,
1980,
(style)and
(1)
R2004]
Gee
and
Grosjean,
1988,
Abney,
1992,
Watson
and
(r)
(r) Ferreira,
(r)
elling,
and
in 1983,
particular
prosodic
frontiers
([Cooper
Paccia-Cooper,
1980,
Gee
and
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992, and
Watson
and Gibson,
Gibson,
2004]
proso
proso
proso
symbolic
(style)
r
proso
proso
proso
for
[Dell,
1984,
Bailly,
1989,
Monnin
and
Grosjean,
Ladd,
1996,
mEnglish;
Gee
and
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson 1993,
and
Gibson,
2004]
for
English;
[Dell,
1984,
Bailly,
1989,
Monnin
andprosodic
Grosjean,
1993,
Ladd,
1996,
Expert
approaches
mostly
concern
hierarchical
structure
modsymbolic
(r)1996,
Delais-Roussarie,
2000,
Mertens,
2004b]
French,
[Barbosa,
2006]
some
other
for
English;
1984,
Bailly,
1989,for
Monnin
Grosjean,
1993,for
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,and
[Barbosa,
2006]
forLadd,
some
other
elling,
and in[Dell,
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
proso
(r)
(r)
(r)
languages).
Expert
models
assume
that
aa prosodic
structure
results
integration
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,
2006]from
for the
some
other
languages).
Expert
models
assume
thatproso
prosodic
results
from
the
integration
Gee
and Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
rinstructure
proso
proso
ofof
various
and
conflictual
constraints,
particular
and
rhythmic
languages).
Expert
models
assume
that(style)
aconstraints,
prosodic
structure
resultssyntactic
from the(style)
integration
various
andpotentially
potentially
conflictual
inand
particular
syntactic
and
rhythmic
for
English;
[Dell,
1984,
Bailly,
1989,
Monnin
Gros
(r)
[?,
Dell,
1984]
constraints.
m
of [?,
various
and potentially
constraints, in particular
syntactic and
rhythmic
symbolic
(1)
(R)
symbolic
Dell,
constraints.
D
R 1984]
M conflictual
B
m
proso
The
module
concerns
[?,
Dell,linguistic
1984]
constraints.
m
m
The
linguistic
module mostly
mostly
concerns the
the extraction
extraction of
of mprominent
prominent synsyn(r)
(r)the
(r) of on
tactic
boundaries
from
deep
syntactic
parsing,
based
syntactic synconThe
linguistic
module
concerns
extraction
prominent
m
tactic
boundaries
frommostly
deep (style)
syntactic
parsing,
based
on
conr syntactic
(1)
(R) syntactic
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
tactic
boundaries
from deep
parsing,
based on
syntacticφ-phrases
conD
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
m
R
symbolic
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/
Right-hand-side
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/
Right-hand-side
T
m
m
m
(r) 2004]), syntactic
(r) dependency (r)
Boundary
[Watson
and
(Dependency-Grammar-based
[Gee
and Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
(r)
(r)syntactic
(r)
� dependency
Boundary
[Watson
andGibson,
Gibson,
2004]),
(Dependency-Grammar-based
m
r/ Right-hand-side
Boundary [Watson
dependencyC(Dependency-Grammar-based
1 Gibson,
C and
D 2004]),Lsyntactic
C
G
G
D
R
R
(r)
(r)
�
B (r)
W
G
D
G mm
�
r
(r)
0
CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
52
STATE-OF-THE-ART
isprosodic
used to merge the sequence phonemes
into a sequence
of syllables.
At the prosodic level,
(style)
(style)
(style)
52structure
STATE-OF-THE-ART
pauses are identified toCHAPTER 4. SPEECH PROSODY
m MODELLING
acoustic& SYNTHESIS:
symbolic
AF
phonetizer
is used to convert
the input text into a sequence of phonemes.
A syllabifier
*
*
STATE-OF-THE-ART
M
A 52
phonetizer
is used to convert the input text into a sequence of phonemes.
A syllabifier
prosodic
is used
phonemes into a sequence of syllables.
At the prosodic* level,
Fm to merge the sequence
*
*
isprosodic
used
to merge the sequence phonemes into a sequence of syllables. At the prosodic level,
structure
A phonetizer
is
pauses
are identified
to to* convert the input text into a sequence
P
* used
* of phonemes. A syllabifier
*
structure
pauses
are identified to
used to merge the sequence
phonemes
into a sequence of syllables. At the prosodic
FM
**
* level,
(r)
M
(style)
FisM
prosodic
*
**
M
prosodic
pauses
are identified
prosodic
structure
is to infer
a sequence
prosodic
that are heure
Longtemps
je me
suis of
coude bonne
##
FThe
*to
* labels
Fsyllable
*
**
m
**model ##
*ché
*associated
m
mstructure
Mm
structure
with
relevant
prosodic
events.
P P FM
**
*** *
**
****
(r)
(style)
m
FM
Table 4.2:* Illustration
of the text-to-prosodic-structure conversion. * *
*
M MFprosodic
m
The
structure model
is to infer a sequence of prosodic* labels that are associated
The
structure
model
is to infer
a sequence
of
prosodic
that
are heure
associated
temps
##
je me
me
suis coucouché
de
bonne
##
Fsyllable
*events.
* * labels
* * ##
Pprosodic
* temps
* ##
Longtemps
##
je
ché
de
m
Longsuis
ché
de
bonne
heure
##
with
relevant
prosodic
Msyllable
m
Longtemps
, je
je me
me suis
suis coucouché
de bonne
bonne heure
heure
.
sentence
m relevant
Two
approaches
can
be distinguished
in prosodic
structure
modelling:
the one
Pwith
*prosodic
*events.
*
*associated
The main
prosodic
structure
model
is to infer a sequence
of prosodic
labels
that areon
mhand,
syllable
Longtemps
##
je text-to-prosodic-structure
me formal
suis coude bonne
heure
##
approaches
attempt at
modelsché
thatconversion.
account
for the
observed
4.2: Illustration
Illustration
ofelaborating
the
text-to-prosodic-structure
conversion.
Table
4.2:
of
the
with
relevant
prosodic
events.
M expert
Table
4.2:
Illustration
of
the
text-to-prosodic-structure
conversion.
prosodic
with
respect
to
and bonne
extra-linguistic
conLongtemps
,
je linguistic,
me suis
suispara-linguistic,
couché
de
heure
.
sentence variations
(style)
(style)
(style)
syllable
Longtemps
##
je
me
couché
de
bonne
heure
##
Longtemps
,
je
me
suis
couché
de
bonne
heure
.
sentence
m mainOn
straints.
the
other
hand,
methods
attempt
at elaborating
a statistical
Table
4.2:can
Illustration
of the
text-to-prosodic-structure
conversion.
acoustic
symbolic
Two
approaches
can
be statistical
distinguished
in prosodic
prosodic
structure
modelling:
onthe
themodel
one
approaches
be
distinguished
in
structure
modelling:
on
one
which
accounts
for
the prosodic
variations
from
the observation
ofaccount
statistical
regularities
on.
Longtemps
, elaborating
je me
suis
couché
de bonne
heure
sentence
hand,
approaches
attempt
at
elaborating
formal
models
that
account
forthe
the
observed
expert
approaches
attempt
at
models
that
for
observed
Two
main
approaches
can
be distinguished
informal
prosodic
structure
modelling:
on
the one
(r)
(style)
(style)
(style)
large
speech
corpora.
Table
4.2: with
Illustration
thelinguistic,
text-to-prosodic-structure
conversion.
prosodic
prosodic
variations
with
respect
to
para-linguistic,
and
extra-linguistic
conTwo
main
approaches
can
beatof
distinguished
inPROSODY
prosodic
structure
modelling:
the
one
respect
to
para-linguistic,
and
extra-linguistic
conCHAPTER
4.4.linguistic,
SPEECH
MODELLING
&
SYNTHESIS:
hand,
expert
approaches
attempt
elaborating
formal
models
that
account
for
theonobserved
CHAPTER
SPEECH
PROSODY
MODELLING
&
SYNTHESIS:
acoustic
symbolic
structure
straints.
other hand,
hand,
statistical
methods
attempt
at elaborating
elaborating
statistical
model
hand, expert
approaches
attempt
at SPEECH
elaborating
formal models
that account
for the observed
On the
other
statistical
methods
attempt
at
aastatistical
model
m PROSODY
52
STATE-OF-THE-ART
CHAPTER
4.
MODELLING
& SYNTHESIS:
prosodic
variations
with
respect
to
linguistic,
para-linguistic,
and
extra-linguistic
con52
STATE-OF-THE-ART
F
* respect
*
which
the with
prosodic
variations
from the
thepara-linguistic,
observationof
ofstatistical
statistical
regularities
on
M accounts
(r)
prosodic
variations
to linguistic,
and
extra-linguistic
confor the
prosodic
variations
from
observation
regularities
on
52
STATE-OF-THE-ART
Two
main
approaches
can
be
distinguished
in
prosodic
structure
modelling:
on
the
one
prosodic
straints.
On On
the
other
hand,
statistical
methods
attempt
atatelaborating
aastatistical
F
* of phonemes.
* model
large
corpora.
m speech
straints.
other
statistical
methods
attempt
elaborating
statistical
model
corpora.
A
isthe
toto*hand,
convert
the
input
text
into
aa sequence
A
syllabifier
prosodic
Aphonetizer
phonetizer
isused
used
convert
theSPEECH
input
text
into
sequence
of phonemes.
A observed
syllabifier
120
structure
CHAPTER
4.
PROSODY
MODELLING
& regularities
SYNTHESIS:
hand,
expert
approaches
attempt
at
elaborating
formal
models
that
for
the
P
* for
*prosodic
*of
* level,
which
accounts
the
prosodic
variations
from
the
observation
ofofaccount
statistical
on
which
accounts
the
variations
from
the
statistical
regularities
on
used
the
sequence
phonemes
ainto
of
At
120
structure
Ais
phonetizer
isfor
used
to
the
input into
text
a observation
sequence
phonemes.
A
syllabifier
usedto
tomerge
merge
the
sequence
phonemes
into
asequence
sequence
of syllables.
syllables.
At the
the prosodic
prosodic
level,
prosodic
Fis
*convert
*
52
M
prosodic
variations
with
to concern
linguistic,
para-linguistic,
andSTATE-OF-THE-ART
extra-linguistic
conapproaches
mostly
hierarchical
prosodic
mod120
speech
corpora.
large
speech
corpora.
pauses
are
identified
toto
Flarge
*respect
* level,
isExpert
used
to
merge
the
sequence
phonemes
into a sequence
of syllables.
At
thestructure
prosodic
M
pauses
are
identified
structure
F
*
*
*
(style)
m
elling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
syllable
Longtemps
##
je
me
suis
couché
de
bonne
heure
##
straints.
hand,
statistical
methods
at elaborating
a statistical
Fphonetizer
*
* model
pauses
are the
identified
to
m100On
A
isother
to**convert
the input
text attempt
into a sequence
phonemes.
A *syllabifier
FM
* symbolic
* 2004]
P100
*used1983,
* ofWatson
120
Gee
and
Grosjean,
Ferreira,
1988,
Abney,
1992,
and
Gibson,
Pused
*
* of statistical
* level,
which
accounts
for* the
prosodic
variations
fromathe
observation
on
is120
to merge
the sequence
phonemes into
sequence
of syllables.
At the regularities
prosodic
F
*
*
* 1996,
m English;
for100
1984,
Bailly,
1989,
Monnin andprosodic
Grosjean,
1993,
Ladd,
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modTable [Dell,
4.2: Illustration
of the
text-to-prosodic-structure
conversion.
mostly
structure
large
speech
corpora.
pauses
identified
P areapproaches
* to
* ## concern
* modsyllable
Longtemps
je meforhierarchical
suis
cou- [Barbosa,
ché * de 2006]
bonne
##
Delais-Roussarie,
Mertens,
2004b]
French,
for heure
some
other
(style)
elling,
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
syllable
temps
##prosodic
je mefrontiers
suis
couché and
de Paccia-Cooper,
bonne
heure1980,
##
8080 and Longin 2000,
particular
([Cooper
100
languages).
Expert
models
assume
that
aaprosodic
structure
results
from
the
integration
100
The
prosodic
structure
model
is
to
infer
sequence
of
prosodic
labels
that
are
associated
m
Gee
and
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
symbolic
The
prosodic
structure
model
is
to
infer
sequence
of
prosodic
labels
that
are
associated
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modGrosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
Two
main approaches
can be distinguished
prosodic
structure
modelling:
on
the one
(style)
80
syllable
Longtemps
##
je text-to-prosodic-structure
me in suis
couché conversion.
de
bonneand
heure
##
Table
4.2:
Illustration
of
the
of
various
and
potentially
conflictual
constraints,
particular
syntactic
rhythmic
with
relevant
prosodic
events.
The
prosodic
structure
model
isBailly,
to
infer
a
sequence
ofinmodels
prosodic
labels
that
are
associated
for
English;
[Dell,
1984,
Bailly,
1989,
Monnin
and
Grosjean,
1993,
Ladd,
1996,
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modwith
relevant
prosodic
events.
elling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
[Dell,
1984,
1989,
Monnin
and
Grosjean,
1993,
1996,
hand,
expert
approaches
attempt
at
elaborating
formal
that
account
forLadd,
the
observed
Table
4.2: Illustration
of
the
text-to-prosodic-structure
conversion.
acoustic
60
60
[?,
Dell,
1984]
constraints.
with
relevant
prosodic
events.
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,and
2006]
for
some other
other
Gee
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
andsome
Gibson,
2004]
Mertens,
2004b]
for
French,
[Barbosa,
2006]
for
prosodic
variations
with
respect
to linguistic,
para-linguistic,
and
extra-linguistic
conelling,
and
in 2000,
particular
prosodic
frontiers
([Cooper
Paccia-Cooper,
1980,
8080 and
Tablemodule
4.2:
Illustration
of the
conversion.
The
linguistic
mostly
concerns
the
extraction
of
prominent
synTwo
main
approaches
can
be
distinguished
in prosodic
structure
modelling:
on
the model
one
languages).
models
assume
that
a text-to-prosodic-structure
prosodic
structure
results
from
the
integration
for
English;
[Dell,
1984,
Monnin
and
Grosjean,
1993,
Ladd,
1996,
The
prosodic
structure
model
is
toBailly,
infer
sequence
of1992,
prosodic
labels
that
are
associated
Expert
models
assume
that
a1989,
prosodic
structure
results
from
the
integration
(style)
straints.
On
the
other
hand,
statistical
methods
attempt
at
elaborating
a
statistical
60
Gee
and
Grosjean,
1983,
Ferreira,
1988,
Abney,
Watson
and
Gibson,
2004]
Two
main
approaches
can
be
distinguished
in
prosodic
structure
modelling:
on
the
one
Longtemps
, , elaborating
je
me
suis
couché
de
heure
sentence
tactic
boundaries
from
deep
syntactic
parsing,
based
onbonne
syntactic
con-..
Longtemps
je
me
suis
couché
de
bonne
heure
sentence
hand,
expert
approaches
attempt
at
formal
models
that
account
for
the
observed
of
various
potentially
conflictual
constraints,
in
particular
syntactic
and
rhythmic
40
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,
2006]
for
some
other
with
relevant
prosodic
events.
and
potentially
conflictual
constraints,
in
particular
syntactic
and
rhythmic
acoustic
which
accounts
for
the
prosodic
variations
from
the
observation
of
statistical
regularities
on
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modexpert approaches
attemptBailly,
elaborating
formal
models
thatde
account
for the
observed
40
forhand,
English;
[Dell,
1984,
1989,
Monnin
andstructure
Grosjean,
1993,
Ladd,
1996,
Longtemps
, at
me suis
couché
bonne
heure
.
sentence
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
60
Two
main
approaches
can
be
distinguished
in
prosodic
the
one
prosodic
variations
with
respect
to jelinguistic,
para-linguistic,
and modelling:
extra-linguistic
con60
[?,
Dell,
constraints.
languages).
Expert
models
assume
that
a prosodic
structure
results
from the on
integration
1984]
constraints.
large
speech
corpora.
prosodic
variations
with
respect
to linguistic,
para-linguistic,
and
extra-linguistic
conelling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
Mand
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,
[Barbosa,
2006]
forRight-hand-side
some
other
hand,
expert
approaches
attempt
at elaborating
formal
models
that
account
for
the rhythmic
observed
straints.
On
the
other
hand,
statistical
methods
attempt
at
elaborating
statistical
model
The
module
mostly
concerns
the in
extraction
of a/prominent
prominent
synof40various
and
potentially
conflictual
constraints,
particular
syntactic
linguistic
module
mostly
concerns
the
extraction
of
synstraints.
On[Watson
the
other
hand,
statistical
methods
attempt
at
elaborating
a statistical
model
Gee
and
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
Boundary
and
Gibson,
syntactic
dependency
(Dependency-Grammar-based
Longtemps
, 2004]),
je
suis
couché
de
bonne
heure
.
sentence
prosodic
variations
with
respect
to
linguistic,
para-linguistic,
and
extra-linguistic
conwhich
accounts
for
the
prosodic
variations
from
the
observation
of
statistical
regularities
on
languages).
Expert
models
assume
that
ame
prosodic
structure
results
from
the
integration
tactic
boundaries
from
deep
syntactic
parsing,
based
on
syntactic
con[?,
1984]
constraints.
from
deep
syntactic
parsing,
based
on
syntactic
con40 Dell,
m
which
accounts
for
the1984,
prosodic
variations
fromMonnin
the
observation
of statistical
regularities
on
40
for
English;
[Dell,
Bailly,
1989,
and
Grosjean,
1993,
Ladd,
1996,
straints.
On
the other
hand,
statistical
methods
attempt
at elaborating
statistical
model
0
large
speech
corpora.
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
The
linguistic
module
mostly
concerns
the
extraction
of a prominent
syn(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
oflarge
various
and
potentially
conflictual
constraints,
in
particular
syntactic
andφ-phrases
rhythmic
0.4speech
corpora.
which
accounts
for1983,
the
prosodic
variations
the
observation
of2006]
statistical
regularities
on
and
1983,
Delais-Roussarie,
2000],
Left-hand-side
/ for
Right-hand-side
M
Delais-Roussarie,
2000,
Mertens,
2004b]
for from
French,
[Barbosa,
some other
tactic
boundaries
from
deep
syntactic
parsing,
based
syntactic
conGrosjean,
Delais-Roussarie,
2000],
Left-hand-side
/on
Right-hand-side
M
[?,[Gee
Dell,
1984]
constraints.
0.4
prosodic
prosodic
large
speech
corpora.
Boundary
and Gibson,
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
[Watson
and
2004]),
syntactic
dependency
(Dependency-Grammar-based
languages).
Expert
models
assume
that
a
prosodic
structure
results
from
the
integration
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modstructure
Theprosodic
linguistic module mostly concerns the extraction of prominent syn0
structure
mm
0.4
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/
Right-hand-side
(style)
0.2
elling,
and potentially
in particular
prosodic
frontiers
([Cooper and
Paccia-Cooper,
oftactic
various
and
constraints,
in particular
and rhythmic
FF
**conflictual
** 1980,
structure
MM boundaries
0.4
from
deep 2004]),
syntactic
parsing,
basedsyntactic
on
syntactic
conBoundary
[Watson
and
syntactic
dependency
symbolic
Gee
and Grosjean,
1983,
1988,
Abney,
1992, **(Dependency-Grammar-based
Watson and Gibson,
FF
**Gibson,
* ** 2004]
0.2
m
[?,
Dell,
constraints.
M
0.4
F
* Ferreira,
prosodic
m1984]
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
(style)
(1)
(R)
Expert
approaches
concern
structure
for
English;
[Dell,*mostly
1984,
1989, hierarchical
Monnin
andprosodic
Grosjean,
1993,
Ladd,
1996,
*module
**mostlyBailly,
** proso
**modFPmPlinguistic
*prosodic
* proso
Expert
approaches
mostly
concern
hierarchical
structure
mod*symbolic
structure
proso
The
concerns
the extraction
of
synelling,
and(style)
in
particular
prosodic
frontiers
([Cooper
and2006]
Paccia-Cooper,
1980,
0 Grosjean,
0.2
[Gee
and
1983,
Delais-Roussarie,
2000],
/ prominent
Right-hand-side
Delais-Roussarie,
2000,
Mertens,
2004b] for
French,Left-hand-side
[Barbosa,
for
some
other
P
*
*
*
*
0.2
elling,
and
in
particular
prosodic
frontiers
([Cooper
and
Paccia-Cooper,
1980,
F
*
*
Mand
Expert
approaches
mostly
concern
hierarchical
prosodic
structure
modsymbolic
tactic
boundaries
from
deep
syntactic
parsing,
based
onfrom
syntactic
conGee
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
languages).
Expert
models
assume
that
a
prosodic
structure
results
the
integration
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
syllable
Longtemps
##
je
me
suis
couché
de
bonne
heure
##
m
Gee
and
Grosjean,
1983,
Ferreira,
1988,
Abney,
1992,
Watson
and
Gibson,
2004]
syllable
Longtemps
##prosodic
je
me frontiers
suis coude1993,
bonneLadd,
heure
##
0
Felling,
*
*ché and
* 1996,
0.2
m
and
in particular
([Cooper
Paccia-Cooper,
1980,
for
English;
[Dell,
1984,
Bailly,
1989,
Monnin
and
Grosjean,
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
R
of
various
and
potentially
conflictual
constraints,
in
particular
syntactic
and
rhythmic
(r)
(r)
(r)
(style)
(1)
(R)
syllable
temps
##
je 1989,
me suis
cou-andché
bonne
heure
##
for
English;Long[Dell,
1984,
Bailly,
Monnin
Grosjean,
1993,
Ladd,
1996,
P
* 2000,
*Mertens,
* de
* other
−0.2
Gee
Grosjean,
Ferreira,
1988,
Abney, [Barbosa,
1992,
Watson
and
Gibson,
2004]
Delais-Roussarie,
2004b]
for2000],
French,
2006]
some
r / for
proso
proso
prosoproso
proso
proso
0 and
[?,
1984]
constraints.
m1983,
0Dell,
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
Right-hand-side
symbolic
Delais-Roussarie,
2000,
Mertens,
2004b]
for
French,Left-hand-side
[Barbosa,
2006]
for
some
other
Table
4.2:
of
text-to-prosodic-structure
conversion.
Table
4.2:Illustration
Illustration
ofthe
the
text-to-prosodic-structure
conversion.
for linguistic
English;
[Dell,
1984,
Bailly,
Monnin
and results
Grosjean,
1993,
Ladd, 1996,
languages).
Expert
models
assume
that
a1989,
prosodic
structure
from
the integration
(r)
The
module
mostly
concerns
the
extraction
of
prominent
synlanguages).
Expert
models
assume
that
a prosodic
structure
results
from
the integration
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
syllable
Longtemps
##
je
me
suis
couché
de
bonne
heure
##
Table
4.2:
Illustration
of
the
text-to-prosodic-structure
conversion.
−0.2
0
2000,
Mertens,
for French,
[Barbosa,
2006]
other
of Delais-Roussarie,
variousboundaries
and potentially
conflictual
constraints,
in particular
syntactic
and some
rhythmic
proso
from
deep 2004b]
syntactic
parsing,
based
on for
syntactic
conoftactic
various
and
potentially
conflictual
constraints,
in particular
syntactic
and rhythmic
−0.4
(r) models
(r)
(r)structure
Two
main
approaches
can
distinguished
in
prosodic
modelling:
the
languages).
Expert
assume
that
a time
prosodic
structure
results
from the on
integration
[?,
Dell,
1984]
constraints.
Two
main
approaches
canbe
be
distinguished
inand
prosodic
structure
modelling:
on
the one
one
−0.2
[s]
stituency
(Constituent-Depth
[Cooper
Paccia-Cooper,
1980],
φ-phrases
−0.2
[?,
Dell,
1984]
constraints.
raccount
proso
proso
0
hand,
expert
approaches
attempt
atofproso
elaborating
formal
models
that
for
the
observed
Table
4.2:
Illustration
the
text-to-prosodic-structure
conversion.
Two
main
approaches
can
be
distinguished
in2000],
prosodic
structure
on
one
of
various
and
potentially
conflictual
constraints,
particular
syntactic
rhythmic
The
linguistic
module
mostly
concerns
the in
extraction
of
prominent
synhand,
expert
approaches
attempt
elaborating
models
thatmodelling:
account
forand
thethe
observed
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
Left-hand-side
Right-hand-side
(1)at
(R)formal
M
(r)
The
linguistic
module
mostly
concerns
the
extraction
of /prominent
synprosodic
variations
with
respect
to
linguistic,
para-linguistic,
and
conhand,
expert
approaches
attempt
at
elaborating
formal
models
that
account
for the observed
−0.4
[?,
Dell,
1984]
constraints.
tactic
boundaries
from
deep
syntactic
parsing,
based
on extra-linguistic
syntactic
con−0.2
prosodic
variations
with
respect
to
linguistic,
para-linguistic,
and
extra-linguistic
con- proso
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
time
[s]
tactic
boundaries
from
deep
parsing,
based
on
syntactic
constraints.
On
hand,
statistical
methods
attempt
at
elaborating
statistical
model
prosodic
variations
with
respect
to syntactic
linguistic,
para-linguistic,
and1980],
extra-linguistic
con−0.4
m
The
linguistic
module
mostly
concerns
the
extraction
of aaprominent
synstituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
φ-phrases
straints.
Onthe
theother
other
hand,
statistical
methods
attempt
at
elaborating
statistical
model
Two
main
approaches
can
be
distinguished
in
prosodic
structure
modelling:
on
the
one
(r)
(r)
(r)
0
timeand
[s] Paccia-Cooper, 1980],
−0.4 accounts
stituency
(Constituent-Depth
[Cooper
φ-phrases
which
for
the
prosodic
variations
from
the
observation
of
statistical
regularities
on
r constraints.
On
the
other
hand,
statistical
methods
attempt
at elaborating
a statistical
model
tactic
boundaries
from
deep
syntactic
parsing,
based
syntactic
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/on
Right-hand-side
which
accounts
for
the
prosodic
variations
from
the
observation
ofaccount
statistical
regularities
on
time
[s]
hand,
expert
approaches
attempt
at
elaborating
formal
models
for
the observed
Mthat
[Gee
and
Grosjean,
1983,
Delais-Roussarie,
2000],
Left-hand-side
/ Right-hand-side
Rthe
Mof and
(1)
(R)
large
speech
corpora.
which
for
the
prosodic
variations
from
observation
statistical
regularities
on
stituency
(Constituent-Depth
[Cooper
and
Paccia-Cooper,
1980],
φ-phrases
Boundary
[Watson
and
Gibson,
2004]),
syntactic
dependency
(Dependency-Grammar-based
largeaccounts
speech
corpora.
prosodic
variations
with
respect
to linguistic,
para-linguistic,
extra-linguistic
con−0.4
Boundary
[Watson
and Gibson, 2004]), syntactic
dependency
(Dependency-Grammar-based
mmodel
time
[s]
large
speech
corpora.
[Gee
and
1983, Delais-Roussarie,
2000],
Right-hand-side
straints.
OnGrosjean,
the other
at elaborating
a/ statistical
m
(r) methods
(r) Left-hand-side
(r) hand, statistical
(r) attempt
(r)�
(r)
Boundary
[Watson
Gibson,
2004]),
syntactic
(Dependency-Grammar-based
which
accounts
for theand
prosodic
variations
from thedependency
observation
of statistical regularities
on
r
1
L
large speech corpora.
(1)
proso
r
R
(R)
(r)
(1)
(1)
Longtemps
, 4.
je input
me text
suis
couché
dephonemes.
bonne
heure
.
sentence
CHAPTER
PROSODY
MODELLING
& SYNTHESIS:
A phonetizer
to convert
the
into
of
A
syllabifier
Longtemps
, SPEECH
je me
suis a sequence
couché
de bonne
heure
.
sentence is used
(style)
symbolic
(r)
Expert
approaches
(R)
(r)
symbolic
m
pauses
are structure
identified model
to
The
prosodic
with
relevant
prosodic events.is to infer a sequence of prosodic labels that are associated
with relevant prosodic events.
symbolic
(style)
acoustic
proso
4. SPEECH
MODELLING
& SYNTHESIS:
A phonetizer is usedCHAPTER
to convert the
input textPROSODY
into a sequence
of phonemes.
A syllabifier
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifier
CHAPTER
4. SPEECH
SPEECH
PROSODY
MODELLING
SYNTHESIS:
CHAPTER
4.
&&SYNTHESIS:
52
STATE-OF-THE-ART
is used
to to
merge
thethe
sequence
phonemes
into
sequence
syllables.
At
prosodic
is used
merge
sequence
phonemes
intoa aPROSODY
sequenceofofMODELLING
syllables.
At the
the
prosodic level,
level,
52
STATE-OF-THE-ART
STATE-OF-THE-ART
pauses
areare
identified
toto
pauses
identified
CHAPTER
4. SPEECH
PROSODY
MODELLING
SYNTHESIS:
A phonetizer is used to convert
the input
text into
a sequence
of phonemes.& A
syllabifier
A 52
phonetizer is used to
to convert
convert the
the input
input text
text into
into aa sequence
sequence of
ofphonemes.
phonemes.
syllabifier
AAsyllabifier
STATE-OF-THE-ART
is used
to to
merge
the sequence
phonemes
At
the
prosodic
level,
is used
sequence
phonemesinto
intoaaasequence
sequenceof
ofsyllables.
syllables. At
Atthe
theprosodic
prosodiclevel,
level,
merge the sequence
phonemes
into
sequence
of
syllables.
pauses
identified
toto
pauses
to to convert the input text into a sequence of phonemes. A syllabifier
A are
phonetizer
is used
are
identified
proso
proso
qproso
qproso
qp
proso
proso
proso
(style)
symbolic
proso
(style)
(style)
symbolic
m
symbolic
(style)
symbolic
m
S
83
7
AL
237
357
64
2
28
116
460
47
T
43
38
73
470
NATURAL SPEECH
SYNTHESIZED SPEECH
0.7
M
AS
166
AL
0.8
390
Cohen’s Kappa
0.5
0.4
0.3
JO
U
R
N
PO
LI
TI
C
0.6
AS
S
S
M
T
O
53
390
166
83
209
237
245
357
54
64
3
18
28
87
116
348
460
28
53
43
32
38
41
73
365
470
7
2
47
SP
SP OR
O T
R
T
J
JO OU
U RN
R A
N L
AL
MMA
AS S
SS
P
PO OL
LI ITI
TI C
C AL
AL
PO PO
L
LI
TI ITIC
C
AL AL
JO JO
U UR
R
N NA
AL L
SP SP
O OR
R T
T
JOURNAL
0.68
0.70
0.51
0.54
0.28
0.12
0.38
0.34
POLITICAL
SPORT
SP
131
M
AS
154
MASS
Figure 4: Mean identification scores and 95% confidence interval obtained for natural and synthesized speech.
(a) natural speech
165
0.1
0
R
AL
U
JO
PO
M
LI
TI
R
N
C
AS
S
AL
SP
O
R
0.2
(b) synthetic speech
Figure 3: Identification confusion matrices. Rows represent
synthesized speaking style. Columns represent identified speaking style.
exists between the political and the mass speech that is inherent
to a similarity in the speaking style and the formal situation in
which the speech occurs. Additionally, the conventional HMMbased speech synthesis system failed into modelling adequately
the breathiness and the creakiness that is specific to the political
speaking style, especially within unvoiced segments.
ANOVA analysis was conducted to assess whether the identification performance depends on the language of the participants. Analysis reveals a significant effect of the language
(F(2, 59) = 15, p < 0.001) (F(48,2)=5.9, p-value=0.005), and
confirms results obtained for natural speech. This confirms evidence that there exists variations of a speaking style depending
on the language and/or cultural background.
Finally, an informal evaluation of the quality of the synthesized
speech suggests that the speaking style modelling is robust to
the large variety of audio quality.
6. Conclusion
In this study, the ability and the robustness of a HMM-based
speech synthesis system to model the speech characteristics of
various speaking styles were assessed. A discrete/continuous
HMM was presented to model the symbolic and acoustic speech
characteristics of a speaking style, and used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consisted of an identification experiment of 4 speaking styles based on delexicalized speech,
and compared with a similar experiment on natural speech. The
evaluation showed that the discrete/continuous HMM consis-
tently models the speech characteristics of a speaking style, and
is robust to the differences in audio quality. This proves evidence that the discrete/continuous HMM speech synthesis system successfully models the speech characteristics of a speaking style in the conditions of real-world applications.
7. References
[1] A.-C. Simon, A. Auchlin, M. Avanzi, and J.-P. Goldman, Les voix des
Français.
Peter Lang, 2009, ch. Les phonostyles: une description
prosodique des styles de parole en français.
[2] H. Schmid and M. Atterer, “New statistical methods for phrase break prediction,” in International Conference On Computational Linguistics, Geneva,
Switzerland, 2004, pp. 659–665.
[3] P. Bell, T. Burrows, and P. Taylor, “Adaptation of prosodic phrasing models,” in Speech Prosody, Dresden, Germany, 2006.
[4] J. Yamagishi, T. Masuko, and T. Kobayashi, “HMM-based expressive
speech synthesis - Towards TTS with arbitrary speaking styles and emotions,” in Special Workshop in Maui, Maui, Hawaı̈, 2004.
[5] S. Krstulović, A. Hunecke, and M. Schröder, “An HMM-based speech synthesis system applied to german and its adaptation to a limited set of expressive football announcements,” in Interspeech, 2007.
[6] E. Villemonte de La Clergerie, “From metagrammars to factorized
TAG/TIG parsers,” in International Workshop On Parsing Technology, Vancouver, Canada, Oct. 2005, pp. 190–191.
[7] N. Obin, P. Lanchantin, A. Lacheret, and X. Rodet, “Towards improved
HMM-based speech synthesis using high-level syntactical features,” in
Speech Prosody, Chicago, U.S.A., 2010.
[8] A. Lacheret, N. Obin, and M. Avanzi, “Design and Evaluation of Shared
Prosodic Annotation for Spontaneous French Speech: From Expert Knowledge to Non-Expert Annotation,” in Linguistic Annotation Workshop, Uppsala, Sweden, 2010.
[9] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price,
J. Pierrehumbert, and J. Hirschberg, “ToBI: a standard for labeling english prosody,” in International Conference of Spoken Language Processing, Banff, Canada, 1992, pp. 867–870.
[10] N. Obin, V. Dellwo, A. Lacheret, and X. Rodet, “Expectations for Discourse
Genre Identification: a Prosodic Study,” in Interspeech, Makuhari, Japan,
2010, pp. 3070–3073.
[11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Simultaneous modeling of spectrum, pitch and duration in HMM-based
speech synthesis,” in European Conference on Speech Communication and
Technology, Budapest, Hungary, 1999, pp. 2347–2350.
[12] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markov
models based on multi-space probability distribution for pitch pattern modeling,” in International Conference on Audio, Speech, and Signal Processing, Phoenix, Arizona, 1999, pp. 229–232.
[13] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden
semi-Markov model based speech synthesis,” in International Conference
on Speech and Language Processing, Jeju Island, Korea, 2004, pp. 1397–
1400.
[14] T. Toda and K. Tokuda, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Transactions
on Information and Systems, vol. 90, no. 5, pp. 816–824, 2007.
[15] N. Obin, A. Lacheret, and X. Rodet, “HMM-based prosodic structure model
using rich linguistic context,” in Interspeech, Makuhari, Japan, 2010, pp.
1133–1136.
[16] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and
Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.