End-to-End Discriminative Training for Machine Translation

Transcription

End-to-End Discriminative Training for Machine Translation
End-to-End Discriminative Training
for Machine Translation
Percy Liang
Dan Klein
Alex Bouchard-Côté
Ben Taskar
UC Berkeley
Computer Science Division
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
Features
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
Features
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
NN JJ → JJ NN
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
1
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
Features
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
NN JJ → JJ NN
Lang. model
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
1
-4.253
...
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
Features
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
NN JJ → JJ NN
Lang. model
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
1
-4.253
···
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
Features
w
NN JJ → JJ NN 1
2.3
Lang. model
-4.253 1.7
···
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
Features
w
NN JJ → JJ NN 1
2.3
Lang. model
-4.253 1.7
···
That’s it.
2
Discriminative machine translation
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
Training examples
x1: mr barn crespo wishes to make a comment .
y1: m. barn crespo veut intervenir .
x2: i consider it very important to continue this work .
y2: à mon avis , il est très important de continuer dans cette voie-là .
x3: this all augurs well , and represents real progress .
y3: tout cela est de bon augure et constitue un réel progrès .
...
Features
w
NN JJ → JJ NN 1
2.3
Lang. model
-4.253 1.7
···
That’s it. Well, actually. . .
2
Why is this hard?
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
3
Why is this hard?
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
• Correct output is ill-defined
3
Why is this hard?
x:
le parlement adopte la résolution législative
y:
parliament adopted the legislative resolution
• Correct output is ill-defined
• Correspondences are missing
3
Why is this hard?
x:
y:
le parlement adopte la résolution législative
DT NN
VBD DT
NN
JJ
• Correct output is ill-defined
• Correspondences are missing
3
Why is this hard?
x:
le parlement adopte la résolution législative
???
y:
parliament adopted the legislative resolution
• Correct output is ill-defined
• Correspondences are missing
3
Why is this hard?
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
• Correct output is ill-defined
• Correspondences are missing
3
Why is this hard?
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
• Correct output is ill-defined
• Correspondences are missing
• Hidden correspondence is abused
3
Why is this hard?
x:
le parlement adopte la résolution législative
h:
y:
parliament has adopted the resolution
• Correct output is ill-defined
• Correspondences are missing
• Hidden correspondence is abused
3
Why is this hard?
x:
le parlement adopte la résolution législative
h:
y:
parliament has adopted the resolution
• Correct output is ill-defined
• Correspondences are missing
• Hidden correspondence is abused
3
Why is this hard?
x:
le parlement adopte la résolution législative
h:
y:
parliament has adopted the resolution
• Correct output is ill-defined
• Correspondences are missing
• Hidden correspondence is abused
3
Discriminative approaches
Reranking [Shen, et al. ’04; Och, et al. ’04]
x
Base
y1
y2
y3
Discrim.
w
y1
y2
y3
4
Discriminative approaches
Reranking [Shen, et al. ’04; Och, et al. ’04]
x
Base
y1
y2
y3
Discrim.
w
y1
y2
y3
MERT [Och ’03]
Discrim.
x
w
···
y100
y101
y102
···
4
Discriminative approaches
Reranking [Shen, et al. ’04; Och, et al. ’04]
x
Base
y1
y2
y3
Discrim.
w
y1
y2
y3
MERT [Och ’03]
Discrim.
x
w
···
y100
y101
y102
···
Local [Tillmann, Zhang ’05]
Discrim.
x
w
···
(part of y)100
(part of y)101
(part of y)102
···
4
Discriminative approaches
Reranking [Shen, et al. ’04; Och, et al. ’04]
x
Base
y1
y2
y3
Discrim.
w
Local [Tillmann, Zhang ’05]
Discrim.
x
w
···
(part of y)100
(part of y)101
(part of y)102
···
MERT [Och ’03]
y1
y2
y3
Discrim.
x
w
···
y100
y101
y102
···
Our end-to-end approach
Discrim.
x
w
···
y100
y101
y102
···
4
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
5
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
GIZA++
SRI
Lang.
Model
alignments
Phrase
extractor
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
5
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
GIZA++
SRI
Lang.
Model
features
alignments
Phrase
extractor
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
features
Perceptron trainer
5
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
GIZA++
SRI
Lang.
Model
features
Restrict
to length
5–15
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
Perceptron trainer
5
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
GIZA++
SRI
Lang.
Model
features
Restrict
to length
5–15
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Perceptron trainer
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
feature weights
Evaluator
BLEU
5
Experimental setup
Train (’99–’01)
414 sentence pairs
13.3M words
French-English Europarl corpus
GIZA++
SRI
Lang.
Model
features
Restrict
to length
5–15
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Perceptron trainer
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
feature weights
Evaluator
BLEU
Decoder
5
Experimental setup
x:
le parlement adopte la résolution législative
h:
Train (’99–’01)
y: parliament
adopted the legislative resolution
414 sentence
pairs
13.3M words
Features: Φ(x, y, h) =
GIZA++
SRI
Lang.
Model
features
0
French-English
· · · 0 1 0.2 · · · Europarl corpus
Restrict
to length
5–15
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Perceptron trainer
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
feature weights
Evaluator
BLEU
Decoder
5
Experimental setup
x:
le parlement adopte la résolution législative
h:
Train (’99–’01)
y: parliament
adopted the legislative resolution
414 sentence
pairs
13.3M words
Features: Φ(x, y, h) =
Weights: Restrictw =
GIZA++
SRI
Lang.
Model
features
to length
5–15
0
French-English
· · · 0 1 0.2 · · · Europarl corpus
· · · 2 0.1 −5 · · ·
0
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Perceptron trainer
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
feature weights
Evaluator
BLEU
Decoder
5
Experimental setup
x:
le parlement adopte la résolution législative
h:
Train (’99–’01)
y: parliament
adopted the legislative resolution
414 sentence
pairs
13.3M words
Features: Φ(x, y, h) =
Weights: Restrictw =
GIZA++
SRI
Lang.
Model
features
0
French-English
· · · 0 1 0.2 · · · Europarl corpus
· · · 2 0.1 −5 · · ·
0
to length
5–15 = argmax Φ(x, y, h) · w
Decode(x)
y ,h
alignments
Phrase Train.5–15
67K sentence pairs
extractor
715K words
features
Perceptron trainer
Dev.5–15 (’02)
Test.5–15 (’03)
first 1K sentence pairs
10.4K words
first 1K sentence pairs
10.8K words
examples
feature weights
Evaluator
BLEU
Decoder
5
Experimental setup
x:
le parlement adopte la résolution législative
h:
Train (’99–’01)
y: parliament
adopted the legislative resolution
414 sentence
pairs
13.3M words
Features: Φ(x, y, h) =
Weights: Restrictw =
GIZA++
SRI
Lang.
Model
features
0
French-English
· · · 0 1 0.2 · · · Europarl corpus
· · · 2 0.1 −5 · · ·
0
to length
5–15 = argmax Φ(x, y, h) · w
Decode(x)
y ,h
alignments
Dev.5–15 (’02)
Levels of distortion:
Phrase Train.5–15
67K• sentence
pairs
first 1K sentence pairs
monotonic
extractor
715K words
10.4K words
features
• limited distortion
• examples
full distortion
Perceptron trainer
feature weights
Test.5–15 (’03)
first 1K sentence pairs
10.8K words
Evaluator
BLEU
Decoder
5
Perceptron training
For each training example (x, y): [Collins ’02]
w ← w +Φ(x, yt)
yt
= y
yp
−Φ(x, yp)
= Decode(x)
6
Perceptron training
For each training example (x, y): [Collins ’02]
w ← w +Φ(x, yt)
yt
= y
yp
−Φ(x, yp)
= Decode(x)
w ← w +Φ(x, yt, ht)
−Φ(x, yp, hp)
yt, ht = ???
yp, hp = Decode(x)
6
Perceptron training
For each training example (x, y): [Collins ’02]
w ← w +Φ(x, yt)
yt
= y
yp
−Φ(x, yp)
= Decode(x)
w ← w +Φ(x, yt, ht)
−Φ(x, yp, hp)
yt, ht = ???
yp, hp = Decode(x)
How to choose yt, ht?
There are several choices and the choice does matter
6
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
x:
voté sur demande d ’ urgence
hp:
yp :
vote on emergency request
Current prediction
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
x:
voté sur demande d ’ urgence
hp:
yp :
vote on emergency request
Current prediction
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
x:
voté sur demande d ’ urgence
hp:
yp :
x:
vote on emergency request
Current prediction
voté sur demande d ’ urgence
ht:
yt :
vote on a request for urgent procedure
Bold update
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
x:
voté sur demande d ’ urgence
hp:
yp :
x:
vote on emergency request
Current prediction
voté sur demande d ’ urgence
ht:
yt :
vote on a request for urgent procedure
Bold update
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
n-best
x:
voté sur demande d ’ urgence
hp:
yp :
x:
vote on emergency request
Current prediction
voté sur demande d ’ urgence
ht:
yt :
vote on a request for urgent procedure
Bold update
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
n-best
x:
voté sur demande d ’ urgence
hp:
yp :
x:
vote on emergency request
Current prediction
voté sur demande d ’ urgence
ht:
yt :
vote on a request for urgent procedure
Bold update
7
Update strategies
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
x:
voté sur demande d ’ urgence
ht:
Reachable translations
n-best
x:
vote on an urgent request
Local update
voté sur demande d ’ urgence
hp:
yp :
yt :
x:
vote on emergency request
Current prediction
voté sur demande d ’ urgence
ht:
yt :
vote on a request for urgent procedure
Bold update
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
n-best
x:
voté sur demande d ’ urgence
hp:
yp :
vote on emergency request
Current prediction
7
Update strategies
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
Reachable translations
n-best
x:
voté sur demande d ’ urgence
hp:
yp :
vote on emergency request
Current prediction
Bold update: skip example
7
Update strategies
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
x:
voté sur demande d ’ urgence
ht:
Reachable translations
n-best
x:
yt :
vote on an urgent request
Local update
voté sur demande d ’ urgence
hp:
yp :
vote on emergency request
Current prediction
Bold update: skip example
7
Update strategies
Training example (reference)
x: voté sur demande d ’ urgence
y: vote on a request for urgent procedure
w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp)
x:
ht:
Reachable translations
x:
Decoder
n-best
Monotonic
Limited distortion
yt :
Bold
34.3
33.5
Local
Local update
34.6
34.7
hp:
yp :
Current prediction
Bold update: skip example
7
Features
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
n
Φ(x, y, h) =
8
Features
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
n
Blanket TranslationLogProb
LangModelLogProb
-8.92
-2.462
Φ(x, y, h) =
8
Features
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
n
Φ(x, y, h) =
Blanket TranslationLogProb
LangModelLogProb
Lexical
-8.92
-2.462
le parlement
parliament 1
[parliament adopted the]
1
...
8
Features
x:
le parlement adopte la résolution législative
h:
y:
parliament adopted the legislative resolution
n
Φ(x, y, h) =
Blanket TranslationLogProb
LangModelLogProb
Lexical
POS
le parlement
parliament
[parliament adopted the]
...
DT NN
NN
[NN VBD DT]
...
NN JJ
swap
JJ NN
-8.92
-2.462
1
1
1
1
1
8
Effect of features
trente-cinq
langues
five
languages
Features
Blanket (untuned)
Dev BLEU
33.0
9
Effect of features
trente-cinq
langues
five
languages
Tune Blanket features...
• Resulting relative weights: w(TranslationLogProb) = 2
w(LangModelLogProb) = 1
Features
Blanket (untuned)
Dev BLEU
33.0
9
Effect of features
trente-cinq
langues
five
languages
thirty-five
languages
Tune Blanket features...
• Resulting relative weights: w(TranslationLogProb) = 2
w(LangModelLogProb) = 1
Features
Blanket (untuned)
Dev BLEU
33.0
9
Effect of features
trente-cinq
langues
five
languages
thirty-five
languages
Tune Blanket features...
• Resulting relative weights: w(TranslationLogProb) = 2
w(LangModelLogProb) = 1
Features
Blanket (untuned)
Blanket
Dev BLEU
33.0
33.4
9
Effect of features
pour cela que
j ’ ai
voté favorablement
.
for that
i have
voted in favour
.
Features
Blanket (untuned)
Blanket
Dev BLEU
33.0
33.4
9
Effect of features
pour cela que
j ’ ai
voté favorablement
.
for that
i have
voted in favour
.
Add Lexical features...
• j ’ ai
i have gets very negative weight
• Literal phrase translations downweighted
to allow context-sensitive translations
Features
Blanket (untuned)
Blanket
Dev BLEU
33.0
33.4
9
Effect of features
pour cela que
j ’ ai
voté favorablement
.
for that
i have
voted in favour
.
for that reason
i
voted in favour
.
Add Lexical features...
• j ’ ai
i have gets very negative weight
• Literal phrase translations downweighted
to allow context-sensitive translations
Features
Blanket (untuned)
Blanket
Dev BLEU
33.0
33.4
9
Effect of features
pour cela que
j ’ ai
voté favorablement
.
for that
i have
voted in favour
.
for that reason
i
voted in favour
.
Add Lexical features...
• j ’ ai
i have gets very negative weight
• Literal phrase translations downweighted
to allow context-sensitive translations
Features
Blanket (untuned)
Blanket
Blanket+Lexical
Dev BLEU
33.0
33.4
35.0
9
Effect of features
How can we generalize beyond lexical features?
Features
Blanket (untuned)
Blanket
Blanket+Lexical
Dev BLEU
33.0
33.4
35.0
9
Effect of features
How can we generalize beyond lexical features?
Add POS features...
• la réalisation du droit
the right has negative weight
generalizes to DT NN IN NN
DT NN
• Number of nonzero features drops (1.55M to 1.24M)
Features
Blanket (untuned)
Blanket
Blanket+Lexical
Dev BLEU
33.0
33.4
35.0
9
Effect of features
How can we generalize beyond lexical features?
Add POS features...
• la réalisation du droit
the right has negative weight
generalizes to DT NN IN NN
DT NN
• Number of nonzero features drops (1.55M to 1.24M)
Features
Blanket (untuned)
Blanket
Blanket+Lexical
Blanket+Lexical+POS
Dev BLEU
33.0
33.4
35.0
35.3
9
Alignment constellation features
Phrase-extraction heuristic is important [Koehn ’03]
,
ce
même
,
that
same
croissance
zéro
zero
growth
rate
secure
refuge
abri
sûr
10
Alignment constellation features
Phrase-extraction heuristic is important [Koehn ’03]
croissance
zéro
,
ce
même
,
that
same
zero
growth
rate
secure
refuge
abri
sûr
Put features on the phrase-extraction process itself
Φ(x, y, h) =
n
1
...
10
Alignment constellation features
Phrase-extraction heuristic is important [Koehn ’03]
croissance
zéro
,
ce
même
,
that
same
zero
growth
rate
secure
refuge
abri
sûr
+
+
−
Put features on the phrase-extraction process itself
Φ(x, y, h) =
n
1
...
10
Alignment constellation features
Phrase-extraction heuristic is important [Koehn ’03]
croissance
zéro
,
ce
même
,
that
same
zero
growth
rate
secure
refuge
abri
sûr
+
+
−
Put features on the phrase-extraction process itself
Φ(x, y, h) =
Features
Blanket
Blanket+Lexical
Blanket+Lexical+POS
n
1
...
-Const
31.8
32.2
32.3
+Const
32.2
32.5
32.5
10
Summary of results
Monotonic systems
Pharaoh (MERT)
Blanket
Blanket+Lexical+POS
Test BLEU
28.8
28.4
29.6
11
Summary of results
Monotonic systems
Pharaoh (MERT)
Blanket
Blanket+Lexical+POS
Test BLEU
28.8
28.4
29.6
Distortion systems
Pharaoh (MERT) [Full]
Blanket [Limited]
Blanket+Lexical+POS [Limited]
Test BLEU
29.5
30.0
30.9
11
Conclusions
• Perceptron training with hidden variables:
local updates are better
12
Conclusions
• Perceptron training with hidden variables:
local updates are better
• Allow many expressive features:
blanket, lexical, POS, alignment constellation
12
Conclusions
• Perceptron training with hidden variables:
local updates are better
• Allow many expressive features:
blanket, lexical, POS, alignment constellation
• Significant BLEU improvements over Pharaoh
12
Conclusions
• Perceptron training with hidden variables:
local updates are better
• Allow many expressive features:
blanket, lexical, POS, alignment constellation
• Significant BLEU improvements over Pharaoh
• Extension: beyond phrase-based models
12
Conclusions
• Perceptron training with hidden variables:
local updates are better
• Allow many expressive features:
blanket, lexical, POS, alignment constellation
• Significant BLEU improvements over Pharaoh
• Extension: beyond phrase-based models
That’s it.
12