End-to-End Discriminative Training for Machine Translation
Transcription
End-to-End Discriminative Training for Machine Translation
End-to-End Discriminative Training for Machine Translation Percy Liang Dan Klein Alex Bouchard-Côté Ben Taskar UC Berkeley Computer Science Division Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples Features x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples Features x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . NN JJ → JJ NN x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . 1 x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples Features x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . NN JJ → JJ NN Lang. model x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . 1 -4.253 ... 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples Features x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . NN JJ → JJ NN Lang. model x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... 1 -4.253 ··· 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... Features w NN JJ → JJ NN 1 2.3 Lang. model -4.253 1.7 ··· 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... Features w NN JJ → JJ NN 1 2.3 Lang. model -4.253 1.7 ··· That’s it. 2 Discriminative machine translation x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution Training examples x1: mr barn crespo wishes to make a comment . y1: m. barn crespo veut intervenir . x2: i consider it very important to continue this work . y2: à mon avis , il est très important de continuer dans cette voie-là . x3: this all augurs well , and represents real progress . y3: tout cela est de bon augure et constitue un réel progrès . ... Features w NN JJ → JJ NN 1 2.3 Lang. model -4.253 1.7 ··· That’s it. Well, actually. . . 2 Why is this hard? x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution 3 Why is this hard? x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution • Correct output is ill-defined 3 Why is this hard? x: le parlement adopte la résolution législative y: parliament adopted the legislative resolution • Correct output is ill-defined • Correspondences are missing 3 Why is this hard? x: y: le parlement adopte la résolution législative DT NN VBD DT NN JJ • Correct output is ill-defined • Correspondences are missing 3 Why is this hard? x: le parlement adopte la résolution législative ??? y: parliament adopted the legislative resolution • Correct output is ill-defined • Correspondences are missing 3 Why is this hard? x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution • Correct output is ill-defined • Correspondences are missing 3 Why is this hard? x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution • Correct output is ill-defined • Correspondences are missing • Hidden correspondence is abused 3 Why is this hard? x: le parlement adopte la résolution législative h: y: parliament has adopted the resolution • Correct output is ill-defined • Correspondences are missing • Hidden correspondence is abused 3 Why is this hard? x: le parlement adopte la résolution législative h: y: parliament has adopted the resolution • Correct output is ill-defined • Correspondences are missing • Hidden correspondence is abused 3 Why is this hard? x: le parlement adopte la résolution législative h: y: parliament has adopted the resolution • Correct output is ill-defined • Correspondences are missing • Hidden correspondence is abused 3 Discriminative approaches Reranking [Shen, et al. ’04; Och, et al. ’04] x Base y1 y2 y3 Discrim. w y1 y2 y3 4 Discriminative approaches Reranking [Shen, et al. ’04; Och, et al. ’04] x Base y1 y2 y3 Discrim. w y1 y2 y3 MERT [Och ’03] Discrim. x w ··· y100 y101 y102 ··· 4 Discriminative approaches Reranking [Shen, et al. ’04; Och, et al. ’04] x Base y1 y2 y3 Discrim. w y1 y2 y3 MERT [Och ’03] Discrim. x w ··· y100 y101 y102 ··· Local [Tillmann, Zhang ’05] Discrim. x w ··· (part of y)100 (part of y)101 (part of y)102 ··· 4 Discriminative approaches Reranking [Shen, et al. ’04; Och, et al. ’04] x Base y1 y2 y3 Discrim. w Local [Tillmann, Zhang ’05] Discrim. x w ··· (part of y)100 (part of y)101 (part of y)102 ··· MERT [Och ’03] y1 y2 y3 Discrim. x w ··· y100 y101 y102 ··· Our end-to-end approach Discrim. x w ··· y100 y101 y102 ··· 4 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words 5 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus GIZA++ SRI Lang. Model alignments Phrase extractor Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words 5 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus GIZA++ SRI Lang. Model features alignments Phrase extractor Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words features Perceptron trainer 5 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus GIZA++ SRI Lang. Model features Restrict to length 5–15 alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples Perceptron trainer 5 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus GIZA++ SRI Lang. Model features Restrict to length 5–15 alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Perceptron trainer Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples feature weights Evaluator BLEU 5 Experimental setup Train (’99–’01) 414 sentence pairs 13.3M words French-English Europarl corpus GIZA++ SRI Lang. Model features Restrict to length 5–15 alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Perceptron trainer Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples feature weights Evaluator BLEU Decoder 5 Experimental setup x: le parlement adopte la résolution législative h: Train (’99–’01) y: parliament adopted the legislative resolution 414 sentence pairs 13.3M words Features: Φ(x, y, h) = GIZA++ SRI Lang. Model features 0 French-English · · · 0 1 0.2 · · · Europarl corpus Restrict to length 5–15 alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Perceptron trainer Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples feature weights Evaluator BLEU Decoder 5 Experimental setup x: le parlement adopte la résolution législative h: Train (’99–’01) y: parliament adopted the legislative resolution 414 sentence pairs 13.3M words Features: Φ(x, y, h) = Weights: Restrictw = GIZA++ SRI Lang. Model features to length 5–15 0 French-English · · · 0 1 0.2 · · · Europarl corpus · · · 2 0.1 −5 · · · 0 alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Perceptron trainer Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples feature weights Evaluator BLEU Decoder 5 Experimental setup x: le parlement adopte la résolution législative h: Train (’99–’01) y: parliament adopted the legislative resolution 414 sentence pairs 13.3M words Features: Φ(x, y, h) = Weights: Restrictw = GIZA++ SRI Lang. Model features 0 French-English · · · 0 1 0.2 · · · Europarl corpus · · · 2 0.1 −5 · · · 0 to length 5–15 = argmax Φ(x, y, h) · w Decode(x) y ,h alignments Phrase Train.5–15 67K sentence pairs extractor 715K words features Perceptron trainer Dev.5–15 (’02) Test.5–15 (’03) first 1K sentence pairs 10.4K words first 1K sentence pairs 10.8K words examples feature weights Evaluator BLEU Decoder 5 Experimental setup x: le parlement adopte la résolution législative h: Train (’99–’01) y: parliament adopted the legislative resolution 414 sentence pairs 13.3M words Features: Φ(x, y, h) = Weights: Restrictw = GIZA++ SRI Lang. Model features 0 French-English · · · 0 1 0.2 · · · Europarl corpus · · · 2 0.1 −5 · · · 0 to length 5–15 = argmax Φ(x, y, h) · w Decode(x) y ,h alignments Dev.5–15 (’02) Levels of distortion: Phrase Train.5–15 67K• sentence pairs first 1K sentence pairs monotonic extractor 715K words 10.4K words features • limited distortion • examples full distortion Perceptron trainer feature weights Test.5–15 (’03) first 1K sentence pairs 10.8K words Evaluator BLEU Decoder 5 Perceptron training For each training example (x, y): [Collins ’02] w ← w +Φ(x, yt) yt = y yp −Φ(x, yp) = Decode(x) 6 Perceptron training For each training example (x, y): [Collins ’02] w ← w +Φ(x, yt) yt = y yp −Φ(x, yp) = Decode(x) w ← w +Φ(x, yt, ht) −Φ(x, yp, hp) yt, ht = ??? yp, hp = Decode(x) 6 Perceptron training For each training example (x, y): [Collins ’02] w ← w +Φ(x, yt) yt = y yp −Φ(x, yp) = Decode(x) w ← w +Φ(x, yt, ht) −Φ(x, yp, hp) yt, ht = ??? yp, hp = Decode(x) How to choose yt, ht? There are several choices and the choice does matter 6 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations x: voté sur demande d ’ urgence hp: yp : vote on emergency request Current prediction 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations x: voté sur demande d ’ urgence hp: yp : vote on emergency request Current prediction 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations x: voté sur demande d ’ urgence hp: yp : x: vote on emergency request Current prediction voté sur demande d ’ urgence ht: yt : vote on a request for urgent procedure Bold update 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations x: voté sur demande d ’ urgence hp: yp : x: vote on emergency request Current prediction voté sur demande d ’ urgence ht: yt : vote on a request for urgent procedure Bold update 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations n-best x: voté sur demande d ’ urgence hp: yp : x: vote on emergency request Current prediction voté sur demande d ’ urgence ht: yt : vote on a request for urgent procedure Bold update 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations n-best x: voté sur demande d ’ urgence hp: yp : x: vote on emergency request Current prediction voté sur demande d ’ urgence ht: yt : vote on a request for urgent procedure Bold update 7 Update strategies Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) x: voté sur demande d ’ urgence ht: Reachable translations n-best x: vote on an urgent request Local update voté sur demande d ’ urgence hp: yp : yt : x: vote on emergency request Current prediction voté sur demande d ’ urgence ht: yt : vote on a request for urgent procedure Bold update 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations n-best x: voté sur demande d ’ urgence hp: yp : vote on emergency request Current prediction 7 Update strategies w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure Reachable translations n-best x: voté sur demande d ’ urgence hp: yp : vote on emergency request Current prediction Bold update: skip example 7 Update strategies Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) x: voté sur demande d ’ urgence ht: Reachable translations n-best x: yt : vote on an urgent request Local update voté sur demande d ’ urgence hp: yp : vote on emergency request Current prediction Bold update: skip example 7 Update strategies Training example (reference) x: voté sur demande d ’ urgence y: vote on a request for urgent procedure w ← w + Φ(x, yt, ht ) − Φ(x, yp, hp) x: ht: Reachable translations x: Decoder n-best Monotonic Limited distortion yt : Bold 34.3 33.5 Local Local update 34.6 34.7 hp: yp : Current prediction Bold update: skip example 7 Features x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution n Φ(x, y, h) = 8 Features x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution n Blanket TranslationLogProb LangModelLogProb -8.92 -2.462 Φ(x, y, h) = 8 Features x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution n Φ(x, y, h) = Blanket TranslationLogProb LangModelLogProb Lexical -8.92 -2.462 le parlement parliament 1 [parliament adopted the] 1 ... 8 Features x: le parlement adopte la résolution législative h: y: parliament adopted the legislative resolution n Φ(x, y, h) = Blanket TranslationLogProb LangModelLogProb Lexical POS le parlement parliament [parliament adopted the] ... DT NN NN [NN VBD DT] ... NN JJ swap JJ NN -8.92 -2.462 1 1 1 1 1 8 Effect of features trente-cinq langues five languages Features Blanket (untuned) Dev BLEU 33.0 9 Effect of features trente-cinq langues five languages Tune Blanket features... • Resulting relative weights: w(TranslationLogProb) = 2 w(LangModelLogProb) = 1 Features Blanket (untuned) Dev BLEU 33.0 9 Effect of features trente-cinq langues five languages thirty-five languages Tune Blanket features... • Resulting relative weights: w(TranslationLogProb) = 2 w(LangModelLogProb) = 1 Features Blanket (untuned) Dev BLEU 33.0 9 Effect of features trente-cinq langues five languages thirty-five languages Tune Blanket features... • Resulting relative weights: w(TranslationLogProb) = 2 w(LangModelLogProb) = 1 Features Blanket (untuned) Blanket Dev BLEU 33.0 33.4 9 Effect of features pour cela que j ’ ai voté favorablement . for that i have voted in favour . Features Blanket (untuned) Blanket Dev BLEU 33.0 33.4 9 Effect of features pour cela que j ’ ai voté favorablement . for that i have voted in favour . Add Lexical features... • j ’ ai i have gets very negative weight • Literal phrase translations downweighted to allow context-sensitive translations Features Blanket (untuned) Blanket Dev BLEU 33.0 33.4 9 Effect of features pour cela que j ’ ai voté favorablement . for that i have voted in favour . for that reason i voted in favour . Add Lexical features... • j ’ ai i have gets very negative weight • Literal phrase translations downweighted to allow context-sensitive translations Features Blanket (untuned) Blanket Dev BLEU 33.0 33.4 9 Effect of features pour cela que j ’ ai voté favorablement . for that i have voted in favour . for that reason i voted in favour . Add Lexical features... • j ’ ai i have gets very negative weight • Literal phrase translations downweighted to allow context-sensitive translations Features Blanket (untuned) Blanket Blanket+Lexical Dev BLEU 33.0 33.4 35.0 9 Effect of features How can we generalize beyond lexical features? Features Blanket (untuned) Blanket Blanket+Lexical Dev BLEU 33.0 33.4 35.0 9 Effect of features How can we generalize beyond lexical features? Add POS features... • la réalisation du droit the right has negative weight generalizes to DT NN IN NN DT NN • Number of nonzero features drops (1.55M to 1.24M) Features Blanket (untuned) Blanket Blanket+Lexical Dev BLEU 33.0 33.4 35.0 9 Effect of features How can we generalize beyond lexical features? Add POS features... • la réalisation du droit the right has negative weight generalizes to DT NN IN NN DT NN • Number of nonzero features drops (1.55M to 1.24M) Features Blanket (untuned) Blanket Blanket+Lexical Blanket+Lexical+POS Dev BLEU 33.0 33.4 35.0 35.3 9 Alignment constellation features Phrase-extraction heuristic is important [Koehn ’03] , ce même , that same croissance zéro zero growth rate secure refuge abri sûr 10 Alignment constellation features Phrase-extraction heuristic is important [Koehn ’03] croissance zéro , ce même , that same zero growth rate secure refuge abri sûr Put features on the phrase-extraction process itself Φ(x, y, h) = n 1 ... 10 Alignment constellation features Phrase-extraction heuristic is important [Koehn ’03] croissance zéro , ce même , that same zero growth rate secure refuge abri sûr + + − Put features on the phrase-extraction process itself Φ(x, y, h) = n 1 ... 10 Alignment constellation features Phrase-extraction heuristic is important [Koehn ’03] croissance zéro , ce même , that same zero growth rate secure refuge abri sûr + + − Put features on the phrase-extraction process itself Φ(x, y, h) = Features Blanket Blanket+Lexical Blanket+Lexical+POS n 1 ... -Const 31.8 32.2 32.3 +Const 32.2 32.5 32.5 10 Summary of results Monotonic systems Pharaoh (MERT) Blanket Blanket+Lexical+POS Test BLEU 28.8 28.4 29.6 11 Summary of results Monotonic systems Pharaoh (MERT) Blanket Blanket+Lexical+POS Test BLEU 28.8 28.4 29.6 Distortion systems Pharaoh (MERT) [Full] Blanket [Limited] Blanket+Lexical+POS [Limited] Test BLEU 29.5 30.0 30.9 11 Conclusions • Perceptron training with hidden variables: local updates are better 12 Conclusions • Perceptron training with hidden variables: local updates are better • Allow many expressive features: blanket, lexical, POS, alignment constellation 12 Conclusions • Perceptron training with hidden variables: local updates are better • Allow many expressive features: blanket, lexical, POS, alignment constellation • Significant BLEU improvements over Pharaoh 12 Conclusions • Perceptron training with hidden variables: local updates are better • Allow many expressive features: blanket, lexical, POS, alignment constellation • Significant BLEU improvements over Pharaoh • Extension: beyond phrase-based models 12 Conclusions • Perceptron training with hidden variables: local updates are better • Allow many expressive features: blanket, lexical, POS, alignment constellation • Significant BLEU improvements over Pharaoh • Extension: beyond phrase-based models That’s it. 12