Machine Translation with LSTMs

Transcription

Machine Translation with LSTMs
Machine Translation with LSTMs
Ilya Sutskever
Oriol Vinyals
Quoc Le
Google Inc.
Deep Neural Networks
1. Can perform an astonishingly wide range of
computations
2. Can be learned from data
powerful
models
learnable
models
deep neural networks
Powerful models are necessary
● A weak model will never get good performance
● Examples of weak models:
○
○
○
○
○
Single layer logistic regression
Linear SVM
CRFs
Small neural nets
Small conv nets
● A neural network needs to be large and deep to be
powerful
powerful
models
learnable
models
deep neural networks
A trainable model is necessary
● What’s the use of a powerful model if we can’t train it?
● That’s why supervised backpropagation is so important
● 10-layer neural nets easily trainable with backprop
powerful
models
learnable
models
deep neural networks
Why are deep nets powerful?
● A single neuron can implement boolean logic, and
general thus computation and computers
OR
AND
-0.5
-1.5
+1
+1
+1
NOT
+0.5
-1
+1
Why are deep nets powerful?
● A single neuron can implement boolean logic, and
general thus computation and computers
● Mid-sized 2-hidden layer neural network can sort N
N-bit numbers
○
○
Intuitively, sorting requires log N parallel steps
It’s amazing, try it at home with backpropagation!
Why are deep nets powerful?
● A single neuron can implement boolean logic, and
general thus computation and computers
● Mid-sized 2-hidden layer neural network can sort N
N-bit numbers
○
○
Intuitively, sorting requires log N parallel steps
It’s amazing, try it at home with backpropagation!
→ Neurons are more economical than boolean logic
“The Deep Learning Hypothesis”
● Human perception is fast
○
○
Neurons fire at most 100 times a second
Humans solve perception in 0.1 seconds
→ our neurons fire 10 times, at most
Anything a human can do in 0.1 seconds, a big 10layer neural network can do, too!
● 10-layer neural networks can be trained well in practice
powerful
models
learnable
models
deep neural networks
How to solve any problem?
● Use a lot of good AND labelled training data
● Use a big deep neural network
● → Success is the only possible outcome, literally
○
Otherwise the neural network is too small
powerful
models
learnable
models
deep neural networks
The deep learning hypothesis is
true!
● Big deep nets get the best results ever on:
○
○
Speech recognition
Object recognition
● Deep learning really works!
○
But there are other problems, too, such as MT
Deep nets can’t solve all problems
● Inputs and outputs must be of fixed dimensionality
○
Great for images: input is a big image of a fixed size
output is a 1-of-N encoding of category
output
● Bad news for machine translation and
speech recognition
Unit-specific
connections
Input
Goal: a general sequence to
sequence neural network
● The hope: a generic method that can be successfully
applied to any sequence-to-sequence problem, and
achieve excellent results
○
MT, Q&A, ASR, squiggle recognition, etc
● Manage expectations: we don’t beat the state of the
art, but we are close to a strong MT baseline system on
a large publicly available dataset!
Recurrent Neural Networks (RNNs)
● RNNs can work with sequences
t=1
t=2
t=3
t=4
t=5
t=6
out
out
out
out
out
out
hid
hid
hid
hid
hid
hid
inp
inp
inp
inp
inp
inp
Key idea:
each timestep
is a different
layer with the
same weights
Time
Recurrent Neural Networks (RNNs)
● Neural networks that can process sequences well
○
Very expressive models
● Use backpropagation
○
Fun fact: recurrent neural networks were trained in the original
backpropagation paper in 1986
○
● Sadly RNNs are hard to train with backpropagation
○
○
○
unstable
Has trouble learning “long-term dependencies”
Vanishing gradient problems (Hochreiter 1991; Bengio et al., 1994)
● There are ways to learn RNNs but they are hard to use
Long Short-Term Memory (LSTM)
● An RNN architecture that is good at long-term
dependencies
X
M
M
+
X
H
I1
I2
F
O
X
H
output
X
The heart of the LSTM
RNNs overwrite the hidden state
LSTMs add to the hidden state
● Addition has nice gradients
○
All terms in a sum contribute equally
● LSTM is good at noticing long-range correlations
○
Because of the nice gradients of addition
● Main advantage (over HF): requires little tuning
○
Hugely important in new applications
How to use an LSTM to map
sequences to sequences?
● Normal formulations of LSTMs and RNNs have issues:
○
○
length of input sequence = length of output sequence
Not good for either ASR and MT
● Every strategy for mapping sequences to sequences
has an HMM-like component
○
Normal ASR approaches have a big complicated transducer
○
The Connectionist Sequence Classification (CTC) assumes monotonic
alignments
● But we want something simpler and more generic
○
Should be applicable to any sequence-to-sequence problem
■ Including MT, where words can be reordered in many ways
Main idea
● Neural nets are excellent at learning very complicated
functions
● “Coerce” a neural network to read one sequence and
produce another
● Learning should take care of the rest
● All neural networks are equivalent anyway
○
○
So the important thing is to provide the neural network with all the
information
And to make it trainable and big
Main idea
Target sequence
A
B
C
Input sequence
D
X
Y
Z
Q
__
X
Y
Z
That’s it!
● The LSTM needs to read the entire input sequence, and
then produce the target sequence “from memory”
● The input sequence is stored by a single LSTM hidden
state
● Surely the LSTM’s state could store only a handful of
words and nothing else?
Step 1: can the LSTM reconstruct its
input?
● Can this scheme learn the identity function?
Target sequence
A
B
C
D
A
B
C
D
_
_
A
B
C
● Answer: it can, and it can do it very easily. It just does
it effortlessly. Test perplexity of 1.03
Step 2: small dataset experiments:
EuroParl
● French to English
○
○
○
○
Low-entropy parliament language
20M words in total
Small vocabulary
Sentence length no longer than 25
● Although: 25 words is not that small
● Early results were encouraging
Digression: decoding
● Formally, given an input sentence, the LSTM defines a
distribution over output sentences
● Therefore, we should produce the sentence with the
highest probability
● But there are exponentially many sentences, how to find
it?
● Search problem: use simple greedy beam search
Decoding in a nutshell
●
●
●
●
Proceed left to right
Maintain N partial translations
Expand each translation with possible next words
Discard all but the top N new partial translations
2 partial hypothesis
I
My
expand
and
sort
expand hypotheses
I decided
My decision
I thought
I tried
My thinking
My direction
2 new partial hypotheses
prune
I decided
My decision
…
Why does simple beam-search
work?
● The LSTM is trained to predict the next word given
previous words
● If next step prediction is good, truth should be among
the most likely next words
○
Empirically, the small beam seems to work fairly well (so we think!)
● But: we have decoding failures
○
○
The net produces zero-length sentences
Fixable with a heuristic
Model for big experiments
●
●
●
●
●
160K input words
80k output words
4 layers of 1000D LSTM
different LSTMs for input and output language
384M parameters
The model
A
B
C
D
80k softmax by
1000 dims
This is very big!
1000 LSTM cells
2000 dims per
timestep
2000 x 4 =
8k dims per
sentence
A
B
C
D
_
_
A
B
C
160k vocab in
input language
Parallelization
●
●
●
●
Parallelization is important
More parallelization is better -- ongoing work
8 GPUs
More details in the upcoming paper
Results on a big dataset
●
●
●
●
●
●
Corpus: WMT’14 English → French
680M words
about 50K test words
An average of 6 models gets a BLEU score of 30.73
Strong SMT baseline gets 33.3
State of the art is 35.8
● When we rescore n-best lists of the baseline using an
average of 6 we get 36.36
Our system suffers on rare words
It also suffers on long sentences
Break long sentences into pieces
Representations
Representations
Examples
● Due to a technicality, the following examples were
generated from a model that did not converge
○
Actual translations are better
Examples
● FR: Les avionneurs se querellent au sujet de la largeur
des sièges alors que de grosses commandes sont en
jeu
● GT: Aircraft manufacturers are quarreling about the seat
width as large orders are at stake
● LSTM: Aircraft manufacturers are concerned about the
width of seats while large orders are at stake
● TRUTH: Jet makers feud over seat width with big orders
at stake
Example
●
FR: La dispute fait rage entre les grands constructeurs aéronautiques à
propos de la largeur des sièges de la classe touriste sur les vols longcourriers , ouvrant la voie à une confrontation amère lors du salon
aéronautique de Dubaï qui a lieu de mois-ci .
●
GT: The dispute rages between the major aircraft manufacturers about the
width of seats in the economy class on long-haul flights, paving the way for
a bitter confrontation at the Dubai Airshow which takes place month.
●
LSTM: The dispute is raging between large aircraft manufacturers on the
size of the tourist seats on the long-haul flights , leading to a bitter
confrontation at the Dubai Airshow in the month of October .
●
TRUTH: A row has flared up between leading plane makers over the width
of tourist-class seats on long-distance flights , setting the tone for a bitter
confrontation at this month 's Dubai Airshow
Example
●
●
●
●
FR: Le conflit porte sur la largeur des sièges proposés sur les vols longcourriers aux passagers de la classe économique – qui ne sont pas
toujours les plus courtisés par les compagnies aériennes , mais auxquels
l' espace alloué est essentiel pour augmenter les gains d'
efficacité dans les derniers appareils présentés par Airbus SAS et Boeing
Co .
GT: The conflict is about the width of the proposed long-haul passengers
in economy class seats - which are not always the most sought by airlines,
but that & apos; allocated space is essential to increase the gains & apos;
efficiency in the latest devices made ​by Airbus SAS and Boeing Co.
LSTM: The dispute is about the size of the seats on the long-haul
passenger airlines , which are not always the same as the airlines that are
in the air , but the most important thing is to increase the efficiency of the
aircraft in the latest Airbus and Boeing SAS aircraft
TRUTH: The dispute focuses on the width of seats provided on long-haul
flights for economy passengers - not always the ones most courted by
airlines , but whose allocated space holds the key to efficiency claims for
the latest jets offered by Airbus SAS and Boeing Co .
Example (out of vocabulary word)
●
FR: Aujourd' hui , Airbus en appelle directement au public avant le
salon aéronautique de Dubaï , où le 777X devrait prendre le pas sur ses
concurrents avec plus de 100 commandes .
●
GT: Today, Airbus is appealing directly to the public before the Dubai
Airshow, where the 777X should take precedence over its competitors with
over 100 commands.
●
LSTM: Today , Airbus is calling directly to the public before the Dubai
Airshow , where it is expected to take over its competitors with more than
100 orders
●
TRUTH: Now , Airbus is appealing directly to the public ahead of the Dubai
Airshow , where the 777X is expected to dominate with more than 100
orders .
Example
●
FR: « Nous sommes heureux que la FAA reconnaisse qu' une
expérience passager agréable n' est pas incompatible avec la
sécurité » , a déclaré Roger Dow , Président et directeur de l' U.S.
Travel Association .
●
GT: "We are pleased that the FAA recognizes that a pleasant passenger
experience is not incompatible with safety," said Roger Dow, president and
director of the US Travel Association
●
LSTM: " We are pleased that the FAA recognizes that a pleasant
passenger experience is not incompatible with security , " said Roger
Dow , President and CEO of the U.S. Travel Association .
●
TRUTH: " We 're pleased the FAA recognizes that an
enjoyable passenger experience is not incompatible with safety and
security , " said Roger Dow , CEO of the U.S. Travel Association .
Example (meh)
●
FR: Un groupe de l' industrie du voyage s' est félicité des
changements , les appelant des arrangements pleins de bon sens pour
des voyageurs équipés de technologies .
●
GT: A group of travel industry welcomed the changes, calling for full
arrangements of sense for travelers equipped technologies.
●
LSTM: A travel industry group welcomed the changes , calling for common
arrangements for travellers with technology .
●
TRUTH: A travel industry group welcomed the changes , calling them
common-sense accommodations for a traveling public now bristling with
technology .
Example (nonsensical translation)
●
FR: Toutefois , le comité a recommandé que la FAA autorise les pilotes à
demander aux passagers d' éteindre leurs appareils pendant les
atterrissages aux instruments dans des conditions de faible visibilité .
●
GT: However, the committee recommended that the FAA allows pilots to
ask passengers to turn off their devices during instrument landings in low
visibility conditions.
●
LSTM: However , the Committee recommended that the FAA allow pilots
to ask passengers to stop their aircraft from landing in instruments under
low visibility conditions .
●
TRUTH: However , the committee recommended the FAA allow pilots to
order passengers to shut off devices during instrument landings in low
visibility .
Example
●
FR: En 2011 , les dirigeants de la société ont chargé un avion de ligne de
lecteurs de livres électroniques Kindle et l' ont fait voler pour voir
s' il y avait des problèmes , mais il n' y en a eu aucun .
●
GT: In 2011, company officials have charged an airliner of Kindle e-book
readers and they flew to see if there were problems, but there were none.
●
LSTM: In 2011 , the leaders of the company set up an electronic book
readers ' book , and flew it to see if there were problems , but there
were none .
●
TRUTH: In 2011 , company officials loaded an airliner full of their Kindle ereaders and flew it around to test for problems but found none .
Example (inversion of meaning)
●
FR: Amazon.com fait partie de ceux qui font pression pour un
assouplissement des restrictions concernant l' utilisation des
appareils électroniques par les passagers .
●
GT: Amazon.com is among those pushing for a relaxation of restrictions on
the use of electronic devices by passengers.
●
LSTM: Amazon.com is one of those who lobby for restrictions on the use
of electronic devices by passengers .
●
TRUTH: Among those pressing for a relaxation of restrictions on
passengers ' use of the devices has been Amazon.com.
Example
●
FR: Les compagnies aériennes permettent à leurs passagers d'
utiliser le Wi-Fi aux altitudes de croisière depuis plusieurs années .
●
GT: Airlines allow their passengers to use Wi-Fi at cruising altitudes for
several years.
●
LSTM: The airlines allow their passengers to use the Wi-Fi at cruising
altitudes for several years .
●
TRUTH: Airlines have been offering Wi-Fi use at cruising altitudes to
passengers for several years .
Example (forgetting a name)
●
FR: « Une fois la nouvelle politique mise en œuvre en toute sécurité – et
nous allons travailler en étroite collaboration avec les transporteurs pour
cela – ce sera profitable à tout le monde » , a expliqué Glading dans une
déclaration .
●
GT: "Once the new policy implemented safely - and we will work closely
with carriers for this - it will benefit everyone," said Glading in a statement.
●
LSTM: " Once the new security policy is implemented - and we will
work closely with the carriers in this way - it will be beneficial for everyone ,
" he said in a statement .
●
TRUTH: " Once the new policy is safely implemented - and we
're going to work closely with the carrier to do that - it will be a winwin , " Glading said in a statement
Example
●
FR: « En fonction du plan , nous pourrions approuver une utilisation élargie
des appareils électroniques très prochainement » , a indiqué la FAA dans
une déclaration .
●
GT: "Depending on the plan, we could approve wider use of electronic
devices in the near future," said the FAA in a statement.
●
LSTM: " Depending on the plan , we could approve an extended use
of electronic devices in the near future , " said FAA in a statement .
●
TRUTH: " Depending on the condition of the plan , we could approve
expanded use of electronic devices very soon , " the FAA said in a
statement .
Example
●
FR: Les compagnies aériennes devront prouver à la FAA que leurs avions
respectent les nouvelles directives et qu' elles ont mis à jour les
manuels de formation des membres d' équipage et les règles
concernant le rangement des appareils pour être en conformité .
●
GT: Airlines will have to prove to the FAA that the planes meet the new
guidelines and they have updated training manuals for crew members and
the rules for storage devices for compliance.
●
LSTM: The airlines will have to prove to the FAA that their aircraft comply
with the new guidelines and have updated the training manuals for crew
members and the rules for the storage of aircraft in order to comply .
●
TRUTH: Airlines will have to show the FAA how their airplanes meet the
new guidelines and that they 've updating their flight crew training
manuals and rules for stowing devices to reflect the new guidelines .
Example
●
FR: FAA : Les passagers aériens peuvent désormais utiliser des gadgets à
bord des avions ( mais pas passer un appel avec leur téléphone portable )
●
GT: FAA: Air passengers can now use gadgets on planes (but not make a
call on their cell phone)
●
LSTM: FAA : Air passengers can now use gadgets on board ( but do not
use a phone call ) .
●
TRUTH: FAA : Air passengers can now use gadgets on planes ( but not
make cell phone calls )
Remember, this model hasn’t
converged
Results will be better by the end of the week
Weakness of model
● Large vocabulary: we have a vector for each word in the
vocabulary
○
So large vocabularies become expensive and require a lot of training
● Long sentences: the LSTM’s state has limited capacity
● Solution:
○
Train on chunks, translate chunks in order
○
Rare word problem: use a huge vocab and train the LSTM on a huge
amount of data
■ And the rare word will become a frequent word
Conclusion
● We showed that regular LSTMs can translate short
sentences pretty well
● On short sentences and small vocabularies, our BLUE
score is worse than state of the art, but not but that
much
● Our method applies to any sequence to sequence
problem
● We will succeed.
In closing ...
● Deep learning theory is confirmed yet again
● MT will probably be solved soon
● Can now map sequences to sequences, no need to limit
ourselves to vectors
● “If your deep net doesn’t work, train a bigger deeper
net”
THE END!