The translation of examples, citations, definitions

Transcription

The translation of examples, citations, definitions
The translation of examples, citations, definitions
and glosses in the Papillon project
Christian BOITET §
§
GETA, CLIPS, IMAG, 385 rue de la Bibliothèque, BP 53
38041 Grenoble cedex 9, France
[email protected]
Abstract
The Papillon lexical data base comprises a set of detailed monolingual dictionaries of « lexies » (word
senses) interlinked through « axies » (interlingual links) which can also refer to external semanticsoriented systems such as UNL « universal words », Worldnet « synsets », Ontos « concepts » or NTT
ALT/JE system « semantic classes ». The basic idea is that bilingual or multitarget usage dictionaries
can be generated ad libitum from the data base.
This implies that examples, citations, definitions and glosses expressed in each language be translated
into all other languages and stored into the data base. Storing can be achieved in a simple and
« seamless » way by introducing « auxiliary» lexies and axies for these « free language elements ». But
translating all these elements into all languages is a major subproject of the Papillon project.
We propose to use the mutualization feature of the Papillon server and help voluntary contributors
perform of postedit translations using a shared, web-oriented translation workbench using the
« Montaigne » architecture. We also propose to use freely available UNL web sites to get first drafts of
translations, thereby attaching full UNL graphs to auxiliary axies. In the future, a « coedition »
technique, still at the research stage, could be used to improve the UNL graphs a posteriori and
transparently from any language, and get improved translations in all target languages.
Mots-clés : Papillon multilingual data base, N-N translation of dictionary information, Montaigne
architecture, interlingual representation, coedition of text & UNL graph
Résumé
La base de données lexical multilingue Papillon comprend un ensemble de dictionnaires monolingues
détaillés de « lexies » (sens de mots) reliés par des « axies » (liens interlingues) qui peuvent aussi
renvoyer à des systèmes sémantiques externes comme les « mots universaux » (UWs) UNL, les
« synsets » de Worldnet, les « concepts » d’Ontos, ou les « classes sémantiques » du système ALT/JE de
NTT. L’idée de base est que des dictionnaires d’usage bilingues ou multicibles puissent être générés ad
libitum à partir de la base de données.
Cela implique que les exemples, citations, définitions et gloses exprimés dans chaque langue soient
traduits dans toutes les autres langues et stockés dans la base. Le stockage peut être obtenu simplement
et « sans couture » en introduisant des lexies et des axies « auxiliaires » pour ces « éléments libres de
langue ». Mais la traduction de tous ces éléments dans toutes les langues est un sous-project majeur du
projet Papillon.
Nous proposons d’utiliser la caractéristique de mutualisation du serveur Papillon en aidant les
contributeurs volontaires à effectuer ou à postéditer des traductions en utilisant un poste de traduction
partagé, utilisable par réseau, et suivant l’architure « Montaigne ». Nous proposons aussi d’utiliser les
sites gratuits UNL pour obtenir de premiers jets de traductions, en attachant ce faisant des graphes UNL
complets aux axies auxiliaires correspondantes. Dans le futur, on pourrait utiliser une technique de
« coédition », actuellement au stade de la recherche, pour améliorer les graphes UNL a posteriori et de
façon transparente à partir de toute langue, et obtenir des traductions améliorées dans toutes les langues
cibles.
Keywords : Base de données lexicale multilingue Papillon, traduction N-N d’informations
dictionnairiques, architecture Montaigne, représentation interlingue, coédition de texte & graphe UNL
Introduction
The Papillon lexical data base comprises a set of detailed monolingual DiCo [26] dictionaries of « lexies » (word
senses) interlinked through « axies » (interlingual links) which in turn can also refer to external semantics-
1/10
The translation of examples, citations, definitions and glosses in the Papillon project
oriented systems such as UNL « universal words », Worldnet « synsets », Ontos « concepts » or NTT ALT/JE
system « semantic classes ». The basic idea is that bilingual or multitarget usage dictionaries can be generated ad
libitum from the data base. Figure 1 gives a simplified example.
Figure 1 : Axies link monolingual lexies and external semantic symbols such as UWs
This implies that all « language elements » in monolingual entries, that is, examples, citations, definitions,
glosses and labels, are translated into all other languages and stored into the data base. For example, in a FrenchThai dictionary for Thai readers, French definitions such as « carte à jouer » or « carte géographique » should
appear both in French and in Thai.
We first analyze this translation problem in more detail and show that, because the situation is asymmetric, the
number of binary translations to perform is not linear but quadratic in the number of languages, whatever the
translation technique. However, storing these translations can be achieved in a simple and « seamless » way by
introducing auxiliaries lexies and axies, with a cost linear in the number of languages.
In the second part, we show how to use the « mutualization » spirit of the Papillon projects and the associated
characteristics of the Papillon server to link Papillon with a similar web-oriented, cooperative environment for
human translation proposing free on-line translation aids in exchange of contributions to its translation memory
(« Montaigne » architecture).
In the third and last part, we show how to produce some (and perhaps ultimately all) translations of free language
elements using UNL-graphs as intermediate « pivot » representations linked to complex axies, and a mixture of
automatic processing, interactive disambiguation and coedition requiring only monolingual competences from
contributors to reach the desired quality level.
Although we concentrate on translation in this presentation, we should also address the problem of creating free
language elements, if possible in such a way that they are immediately available in all languages. Possible
solutions are (1) to produce examples and perhaps definitions in several languages by extracting them from
existing translation memories and (2) to try and use again UNL to generate parallel language elements from
UNL graphs. But the latter can not work for citations.
Handling the free language elements in a rich multilingual lexical data base such as Papillon is not only
challenging from the scientific and technical points of view, but also from the organizational, sociological and
intercultural points of view, because of the variety of contributors and techniques.
2/10
Christian Boitet
1
1.1
1.1.1
The problem : translating free language elements & storing the results
Translating (and creating) language elements
Preliminary remarks
As labels are elements of finite lists such as domains (geogr., phys., chem. etc.) and appear in the Papillon
specific schema for each language, we may consider that their translations in all languages are stored once and
for all at creation time. The « free language parts » to be translated are then examples, citations, definitions and
glosses.
By « gloss », we understand any word or phrase attached to a vocable to let human readers guess the intended
lexie (word sense). For example, « ice (food) » = « ice (desert) » as opposed to « ice (cake crust) » and to « ice
(water) ». Glosses are not definitions, and certainly not DiCo semantic formulas, but serve as abbreviated
explanations or hints. To translate them seems trivial but is not. For example, « desert » has to be understood as
part of a meal and not as a geographical desert, or else Japanese readers of an English-Japanese dictionary would
be misled. Various techniques such as « conceptual vectors » or network activations (as done by MSR on the
Longman and American Heritage dictionaries) might be used to disambiguate glosses, and more generally words
appearing in language elements.
Examples, citations and definitions are obviously more difficult to translate than glosses. To make matters
worse, the whole translation situation is asymmetric. Suppose for example that the French DiCo contains « Il
utilise toujours des cartes IGN1 » as an example for « carte.2 ». The English translation « He always uses IGN
maps » is good as a translation of this example, but certainly not as an example of use for « map » in English.
1.1.2
Quadratic size of the problem
As a consequence, supposing we have L languages, M lexies in each language, and an average of F free
language elements for each lexie, the amount of translations needed is not F*M*(L-1), but F*M*L*(L-1). It is
also necessary to build the natural or « native » F*M*L free language elements. If L=7, M=100,000 and F=3 (1
or 2 examples, 1 gloss, 0 or 1 citation), there are more than 2M elements to build, giving rise to 12M
translations.
Translation of free language elements may of course help building original free language elements in other
languages. For example, « He always uses IGN maps » might induce a contributor to propose « He often uses
AA2 maps » as a « native » example. But let us concentrate on the translation problem.
1.2
1.2.1
Storing the translation results
Problems of storing native and translated elements in lexies
Where to store the translation results ? Of course, in the data base itself, and it would seem natural to store them
in the corresponding lexies, alongside with the original elements. But we cannot store translations of French
elements in English, Japanese, etc. in French lexies, because that would violate the principle of strict
monolinguality of the DiCo volumes, and give a very messy data structure.
We could envisage to store native and translated elements expressed in French in French lexies. For example, in
the current Papillon microstructure for the French DiCo, the original French example for « carte.1 » (« Il utilise
toujours des cartes IGN ») is stored in the entry for that lexie. Adding appropriate XML tags or attributes3, we
could store next to it the French translations of « native » examples attached to lexies in other languages, such as
« Il utilise souvent des cartes AA ».
This scheme has an important drawback. Translations of examples have to be linked to the corresponding
« native » examples. Implementing that linking would lead to important changes in the macrostructure of the
1
Institut Géographique National
Automobile Association (for the sake of the examplep
3
Recall that each monolingual DiCo volume may be considered as one large XML file, although it is broken
down in small pieces stored in a relational database such as Postgres at the physical level.
2
3/10
The translation of examples, citations, definitions and glosses in the Papillon project
data base : either introduce special identification attributes or introduce new types of axies and links to them
from inside lexies.
1.2.2
Introducing auxiliary lexies and axies
A better solution is to create a new type of entries to store all free language elements. We will call them
« auxiliary lexies » as opposed to the normal lexies. Auxiliary lexies will be linked by auxiliary axies. We may
abbreviate as « x-lexies » and « x-axies », and refine « x » when necessary as « def », « cit », « ex » and « glo »
(definition, citation, example, gloss). As it is, glo-lexies will be quite simple. Cit-lexies and ex-lexies would be
simpler than normal lexies, having no semantic definition, no logico-syntactic « regime », no examples and no
collocations, but perhaps attached morphosyntactic information and sense disambiguation information such as
sense number (1 in « carte.1 »).
The x-axies will only be slightly different from axies. Links will remain the same, the only difference being that
an x-axie will contain a list of x-lexies (instead of a list of lexies) for each language L, and that an x-axie with
x≠glo will contain a list of UNL-graphs (instead of a list of UWs).
This way, storing of free language elements and their translations can be achieved in a simple and « seamless »
way.
2
A « Montaigne » environment for human translation
We propose to use the mutualization feature of the Papillon server and help voluntary contributors perform of
postedit translations using a shared, web-oriented translation workbench using the « Montaigne » architecture.
2.1
2.1.1
The Montaigne architecture
Rationale and evolution of the basic concept
The Montaigne4 architecture has first been defined in 1995, when it was realized that the EuroLang Optimizer
(EO) TTS (translation support system) developed by Site/Eurolang could only be used by sizable groups of
professional translators. Measures had proven that it was necessary to have at least 800 pages of the same kind in
the translation memory to get real improvements from its usage. But an isolated translator, and even more an
occasional translator, rarely translates more than 20-40 pages a year in a given format, domain, and grammatical
sublanguage. Also, the pricing scheme and the complexity were dissuasive : about 1500€ for a client licence and
the same for the corresponding server token, need to buy Windows-NT and SQL-server on the server, difficult
installation…
The basic idea of Montaigne is to let users share a common translation memory and other support tools such as a
bilingual editor and online dictionaries, freely, through the network, in exchange for their agreement to share
their data « products » with others. These data are aligned sentences and dictionary entries produced by their
translation activity. The pricing model is that of IE or Netscape : free clients and paying server, with the idea that
servers should be funded by institutions wanting their members to publish both in their native tongue and in
English.
At the time, we tried to transform the EO software to meet these new requirements, but it proved too costly
because the client was tightly integrated into Word and Windows 95. Actually, the client software was far too
complex. Since then, the development of web-oriented applications and tools have made it possible to modify
this architecture so that what runs on the client is very light. At the limit, the server may send html pages
including javascript programs to implement all functions, including the bilingual editor. Progresses have also
been made on translation memories matching algorithms, which now give better recall and precision [28-30].
A limited version has been prototyped by V. Berment for Lao-French translation and is available on
h t t p : / / w w w . l a o s o f t w a r e . c o m . OKI electric also supports a site built around similar ideas
(http;//www.yakushite.com).
4
Model Of New Translation AIds Generalized to the NEt
4/10
Christian Boitet
2.1.2
Scenario for using a « Montaigne » TSS site
The scenario envisaged for a full version is as follows. Using any web navigator, you enter the server, and
register if not already done. At that point, you can access the common resources (translation memories,
dictionaries, bilingual editor and other tools) in read-only mode, and your private space on the server disks,
where you may keep private (or not yet sharable) lexicons and translated segments or documents.
From the interface, you upload a document you want to translate, indicating its source language and format, and
the desired target languages and formats. The server preprocesses the document : normalization into a common
XML format and in Unicode (UTF-8 encoding), segmentation in units of translation (normally sentences),
computation of several layers of representation (such as text only without formatting tags, lemmatized forms,
chunks…), and search in the translation memory. It then opens a page containing a 2-column table with one line
for each segment and a frame for suggestions :
…
…
source segment N-2
translated segment (done)
source segment N-1
translated segment (done)
source segment N
translated segment (currently being
created)
source segment N+1
suggestion(s) from the TM
dictionary suggestions
source segment N+2
Figure 2 : typical layout of a bilingual editor in a TTS
When you click in the first source segment, suggestions for translations extracted from the translation memory
appear to the right, and under them the lexical information relative to the segment, if any. Using normal editing
functions and some specific shortcuts, you build the translated segment. When you click in the next source
segment or quit, the server updates your private memory. At the end, you decide to allow or not sharing of the
results of your work, and download the translated file in the requested format from the server.
This was only the « bare-bones » of the TTS. Some more functions are necessary, such as the possibility to
modify the segmentation of the source texte, and of correcting the source text. Many other functions can be
envisaged, such as voice input, link with a spell checker, a grammar checker, etc.
2.2
↔Montaigne
Peer-to-peer architecture: Papillon↔
The Papillon data base and server architectures are already quite complex, so that it does not seem a good idea to
try and integrate Papillon and Montaigne in a classical « client-server » architecture. Also, in the context of usual
translation, Papillon would be a server and Montaigne a client sending requests concerning words, while, in the
context of the Papillon translation subproject, Montaigne would appear as a server.
A peer-to-peer integration seems preferable. As the server organization is the same in both cases (common
shared resources and private spaces, freely settable user groups), it could and should be shared at the upper level,
so one would enter Montaigne without login procedure when consulting a Papillon dictionary and wanting to
contribute by translating some examples or citations, or revising existing translations.
3
Introducing automaticity through UNL
We also propose to use freely available UNL web sites to get translations at various quality levels. These
translation will then be available as suggestions in the Montaigne environment, exactly as suggestions coming
from the translation memory. For this, we attach full UNL graphs to def-axies, cit-axies, and ex-axies.
3.1
Brief introduction to UNL
UNL (Universal Networking Language) is the name of a project, of a meaning representation language, and of a
format for "perfectly aligned" multilingual documents (http://www.unl.ias.unu.edu, see also [11, 41]). The UNL
language is a good interlingua for automated translation, ranging from fully automatic MT to interactive MT of
5/10
The translation of examples, citations, definitions and glosses in the Papillon project
several kinds through translation of non task-oriented spoken dialogues. It is also more than that, due to the
associated "knowledge base", and has a great potential in textual information processing applications.
The UNL representation is made of "semantic
graphs" where a graph expresses the meaning of
some natural language utterance. Nodes contain
lexical units and attributes, arcs bear semantic
relations. Connex subgraphs may be defined as
"scopes", so that a UNL graph may be a hypergraph.
score(icl>event,agt>human,fld>sport)
.@entry.@past.@complete
agt
obj
Ronaldo
Figure 3 shows a graphic representation of a UNL
graph. A linear UNL-xml writing appears in Figure
7.
pos
ins
plt
head(pof>body)
corner
goal(icl>thing)
The lexical units, called Universal Words (UW)5,,
represent word meanings, something less ambitious
than concepts. Their denotations are built to be
intuitively understood by developers knowing
English, that is, by all developers in NLP. A UW is
an English term or pseudo-term possibly completed
by semantic restrictions.
obj
mod
left
Figure 3 : a possible UNL graph for “Ronaldo has
headed the ball into the left corner of the goal”
A UW such as "process" represents all word meanings of that lemma, seen as citation form (verb or noun here).
The UW "process(icl>do, agt>person)" covers the verbal meanings of processing, working on, etc.
The attributes are the (semantic) number, genre, time, aspect, modality, etc. The 40 or so semantic relations are
traditional "deep cases" such as agent, (deep) object, location, goal, time, etc.
One way of looking at a UNL graph corresponding to an utterance U-L in language L is to say that it represents
the abstract structure of an equivalent English utterance U-E as "seen from L", meaning that semantic attributes
not necessarily expressed in L may be absent (e.g., aspect coming from French, determination or number coming
from Japanese, etc.).
The UNL language of semantic graphs may be called as a "semantico-linguistic" interlingua. As a successor of
the technically and commercially successful ATLAS-II and PIVOT interlinguas, its potential to support various
kinds of text MT is certain, even if some improvements would be welcome, as always. It is also a strong
candidate to be used in spoken dialogue translation systems when the utterances to be handled are not only taskoriented and of limited variety, but become more free and truly spontaneous. Finally, although it is not a true
representation language such as KRL and its frame-based and logic-based successors, and although its associated
"knowledge base" is not a true ontology, but rather a kind of immense thesaurus of (interlingual) sets of word
senses, it seems particularly weel suited to the processing of multilingual information in natural language
(information retrieval, abstracting, gisting, etc.).
The UNL format of multilingual documents aligned at the level of utterances is currenly embedded in html (call
it UNL-html), and used by various tools such as the UNL viewer. By using a simple transformation, one obtains
the UNL-xml format, and profits from all tools currently developed around XML (see Figure 7 below).
3.2
Interactive disambiguation at analysis time
A first scenario could be as follows. You consult Papillon and the interface tells you your help would be
welcome in translating some free language element, such as an example, from French into all other Papillon
languages. You agree, and the interface changes to
include interactive disambiguation functionalities
« à la LIDIA ». A frame appears with the example
in it, and soon a question mark next to it. That
means that ambiguities have been encountered
Figure 4: questions on an example are waiting
during analysis. You click on the
button
to start the disambiguation dialogue, which is a
succession of simple questions with a few menu items from which to choose one.
5
in French, "mot universel" sounds strange but we may use "Unité de Vocabulaire Virtuel", again UW.
6/10
Christian Boitet
A first question appears (Figure 5). In the context of this story, the user should choose to attach ‘de Chine’ to
‘vase’ (Chinese vase). A second dialogue appears (Figure 6) to ask about the word sense of ‘capitaine’.
Le capitaine a rapporté un vase de chine.
capitaine
de Chine, le capitaine a rapporté un vase.
Officier qui commande une compagnie d'infanterie, un
escadron de cavalerie, une batterie d'artillerie
Le capitaine a rapporte (un vase de chine).
Officier qui commande un navire de commerce
Chef d'une équipe sportive
Figure 6: word sense disambiguation
Figure 5: attachement problem
Internally, a unique multilevel concrete tree (UMC-structure) is obtained. The normal automatic analysis
continues and produces a more abstract tree (UMA-structure). Using the French-UNL dictionary (deducible from
Papillon if UWs have been linked to axies) and a few transformation rules, a transfer phase produces a « UNLtree », and then a standard algorithm transforms it into a UNL-graph (where reentrancy, cycles, and recursion by
« scopes » are possible).
At that point, a UNL document containing only the example in French and its enconversion into a UNL-graph is
built and sent for deconversion to UNL servers for all desired target languages. When this is finished, the
translations obtained are put in the « translation space » of the Papillon server as usual contributions, to be
validated by the central group. This ends the scenario, and the user continues browsing Papillon. Other users
from other native tongues will of course annotate and improve the translations obtained in their languages.
3.3
Text-UNLgraph coedition at reading time
In the future,the preceding scenario could be extended to include a « coedition » technique, still at the research
stage, to improve the UNL graphs a posteriori and transparently from any language, and get improved
translations in all target languages. Let us illustrate this by an example.
Suppose we have an example taken
from the “FB2004” corpus, initially in
Spanish, but enconverted from a
Chinese version produced by a
Chinese contributor to Papillon, and
then deconverted into English,
Spanish, French, and Italian.
As the UNL graph does not to contain
definiteness
and
aspectual
information, the deconversion results
have many wrong articles, and some
errors on aspects.
<unl:S num="1">
<unl:org lg="cn">'/20$*")& -1.#%+(, </unl:org>
<unl:unl>
<unl:arc> agt(retrieve(icl>do).@entry.@future, city) </unl:arc>
<unl:arc> tim(retrieve(icl>do).@entry.@future, after) </unl:arc>
<unl:arc> obj(after, Forum) </unl:arc>
<unl:arc> obj(retrieve(icl>do).@entry.@future, zone(icl>place).@indef) </unl:arc>
<unl:arc> mod(zone(icl>place).@indef, coastal) </unl:arc> </unl:unl>
<unl:cn> '/20$*")& -1.#%+(, </unl:cn>
<unl:el> After a Forum, a city will retrieve a coastal zone.</unl:el>
<unl:es> Ciudad recobrará una zona de costal después Foro. </unl:es>
<unl:fr> Une cité retrouvera une zone côtière après un forum. </unl:fr>
<unl:it> Città ricuperarà une zona costiera dopo Forum. </unl:it>
<unl:jp> ✔</unl:jp>
</unl:S>
Figure 7 : an example deconverted but needing revision
The idea of "coedition" is to correct the UNL graph associated with a segment one wants to improve, and then to
send the improved graph to all deconverters and get better translations into all languages. Here, the modifications
on the graph might be :
• add ".@def" on the nodes containing "city", "Forum".
• replace "retrieve" by "recover" and add ".@complete" on the node containing it.
It is not possible in principle to deduce the modification on the graph from a modification on the text. For
example, replacing "un" ("a") by "le" ("the") does not entail that the following noun is determined (.@def),
because it can also be generic ("il aime la montagne" = "he likes mountains"). The technique envisaged is that:
• revision is not done by modifying directly the text, but by using a menu system,
• the menu items have a "language side" and a hidden "UNL side",
7/10
The translation of examples, citations, definitions and glosses in the Papillon project
•
•
when a menu item is chosen, only the graph is transformed, and the action to be done on the text is stored
and shown next to its focus,
at any time, the new graph may be sent to the source language deconverter and the result shown. If is is
satisfactory, that shows that errors were due to the graph and not to the deconverter, and the graph may be
sent to deconverters in other languages. Versions in some other languages known by the user should be
dislayable, so that improvement sharing is visible and encouraging.
In a scenario for more expert users, the UNL graph or the UNL tree is made visible and directly manipulatable,
as well as the results of segmentation and lemmatization used to establish the fine-grained correspondence
between the text and the graph necessay for coedition (modifications indicated on words have to be
« transported » and « translated » as modifications on « corresponding parts » of the UNL graph).
Show Graph
Deconversion
Find Lemma
Find Correspondence
Save Graph
After a Forum, a city will
retrieve a coastal zone.
Une cité retrouvera une zone côtière après un forum.
After the Forum, the city will
have recovered a coastal zone.
remember
dormitory
a
city
English
Spanish
zone
retrieve
a
find
un
cité
retrouver
un
indef art
sin
noun
sin
verb
future
indef art
sin
area
zone
noun
sin
coastal
after
côtier
après
adj
sin
prop
a
Forum
Ciudad recobrará una zona
de costal después Foro.
un
Forum
La ciudad habrá recobrado una
zona de costal después el Foro.
indef art
sin
noun
sin
Italian
Città ricuperarà une zona
costiera dopo Forum.
zone(icl>place)(.@indef, obj)
coastal(mod)
La città ha ricuperato une
zona costiera dopo il Forum.
Japanese
retrieve (icl>do)(.@entry.@future)
after(tim)
city(agt)
Original text
Une cité retrouvera une zone côtière après un forum.
To Do
la
Second Deconversion
La cité retrouvera une zone côtière après le Forum.
Forum(obj)
✔
✔
Chinese
le Maj
Manual Insertion
Graph : correspondence
Simple text view
Multiple text view
'/20$*")& -1.
#%+(,
'/20$*")& -1.
#%+(,
Save
Quit
Figure 8 : example of coedition (expert mode)
Conclusion and perspectives
In any multilingual lexical data base from which really useful bilingual usage dictionaries have to be produced,
all language elements such as examples, citations, definitions, glosses and labels expressed in each language
have to be translated into all other languages and stored into the data base. Storing can be achieved in a simple
and « seamless » way by introducing complex lexies and axies. But translating all these elements into all
languages is a major subproject of the Papillon project.
We propose to use the mutualization feature of the Papillon server and help voluntary contributors perform of
postedit translations using a shared, web-oriented translation workbench using the « Montaigne » architecture.
We also propose to use freely available UNL web sites to get first drafts of translations, thereby attaching full
UNL graphs to complex axies. In the future, a « coedition » technique, still at the research stage, could be used
to improve the UNL graphs a posteriori and transparently from any language, and get improved translations in
all target languages.
Although we have concentrated on translation in this presentation, we should also address the problem of
creating free language elements, if possible in such a way that they are immediately available in all languages.
Possible solutions are (1) to produce examples and perhaps definitions in several languages by extracting them
8/10
Christian Boitet
from existing translation memories and (2) to try and use again UNL to generate parallel language elements from
UNL graphs. But solution (2) can not work for citations, because a tentative (created) citation cannot be rejected
because it is not found in the available corpora, as it may exist elsewhere.
Handling the free language elements in a rich multilingual lexical data base such as Papillon is not only
challenging from the scientific and technical points of view, but also from the organizational, sociological and
intercultural points of view, because of the variety of contributors and techniques.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Ampornaramveth V. (1998) Saikam: an online dictionary development project. Proc. 4th Intl. Workshop on
Academic Information Networks and Systems (WAINS'4), NACSIS Seminar House, Karuizawa, Japan, February
1998. http://thaigate.nacsis.ac.jp:8888
Ampornaramveth V. (1998) Trilingual WWW interface to Saikam dictionary project. Proc. 5th Intl. Workshop on
Academic Information Networks and Systems (WAINS'5), Bangkok, December 1998, AIT.
Ampornaramveth V., Aizawa A. & Oyama K. (2000) An Internet-based Collaborative Dictionary Development
Project: SAIKAM. Proc. 7th Intl. Workshop on Academic Information Networks and Systems (WAINS'7), Bangkok, 78 December 2000, Kasetsart University, H. Sakaki ed.
Blanc E. (1993) Visite guidée de PARAX, une base lexicale pentalingue par acceptions sous HyperCard. GETA,
IMAG, 30 p.
Blanc É., Sérasset G. & Tchéou F. (1994) Designing an Acception-Based Multilingual Lexical Data Base under
HyperCard: PARAX. Research Report, GETA, IMAG (UJF & CNRS), Aug. 1994, 10 p.
Blanc É. (2000) From the UNL hypergraph to GETA's multilevel tree. Proc. MT'2000, Oxford, 18-21 Oct. 2000,
British Computer Society, 10 p.
Boitet C. & Blanchon H. (1994) Multilingual Dialogue-Based MT for Monolingual Authors: the LIDIA Project and a
First Mockup. Machine Translation 9/2 (1994), pp. 99—132.
Boitet C. (1996) (Human-Aided) Machine Translation: a better future? In "Survey of the State of the Art of Human
Language Technology", R. Cole (Editor-in-Chief), J. Mariani, H. Uszkoreit& al., ed., A. Z. G. Varile, Giardini, Pisa,
pp. 251—256. (also available since 1996 at http://www.cse.ogi.edu/CSLU/HLTsurvey/)
Boitet C. (1996) Machine-Aided Human Translation. In "Survey of the State of the Art of Human Language
Technology", R. Cole (Editor-in-Chief), J. Mariani, H. Uszkoreit& al., ed., A. Z. G. Varile, Giardini, Pisa, pp. 257-260.
(also available since 1996 at http://www.cse.ogi.edu/CSLU/HLTsurvey/)
Boitet C. (1997) GETA's MT methodology and its current development towards personal networking communication
and speech translation in the context of the UNL and C-STAR projects. Proc. PACLING-97, Ohme, 2-5 September
1997, Meisei University, H. Sakaki ed., pp. 23-57. (invited communication)
Boitet C. & Tsai W.-J. (2002) Coedition to share text revision across languages. Proc. COLING-02 WS on MT,
Taipeh, 1/9/2002, 8 p. (accepted)
Fafiotte G. & Boitet C. (2000) Rapport final de la phase 1 du projet "FeV" (Réalisation d'un dictionnaire d'usage et
d'une base terminologique par acceptions informatisés français-vietnamien via l'anglais). GETA, CLIPS, IMAG, 16 p.
Gut Y., Yusoff Z., Samat S. A., Boitet C., Nedobejkine N., Lafourcade M. & al. (1996) Kamus Perancis Melayu
dewan - dictionnaire français-malais. Dewan Bahasa dans Pustaka, Kuala Lumpur, Malaisie, 1 vol., pp. 667.
Lafourcade M. (1996) Structured lexical data: how to make them widely available, useful and reasonably protected?
- a practical example with a trilingual dictionary. Proc. COLING-96, Copenhagen, 4-9 Aug. 1996, ICCL, B. Maegaard
ed., 4 p.
Lafourcade M. & Sérasset G. (1996) Apple Technology Integration - A WEB dictionary server as a practical
example. MacTech magazine (MacTech magazine), 12/7 (1996), pp. 25-32.
Lafourcade M. (1996) Serveurs de dictionnaires - Etude de cas avec l'outil Alex et le projet de dictionnaire FrançaisAnglais-Malais. Proc. Séminaire Lexique - Représentation et Outils pour les Bases Lexicales - Morphologie Robuste,
Grenoble, France, 13-novembre 1996, GDR-PRC - Communication Homme-Machine, vol. 1/1, pp. 185-192.
Lafourcade M. (1997) Construction et services dictionnaires n-lingues, exemple des projets Fe*. Proc. Quatrième
conférence annuelle sur Le traitement Automatique du Langage Natural (TALN), Grenoble, France, 12-13 juin 1997,
CLIPS, IMAG, D. Genthial ed., pp. 162-168.
Lafourcade M. (1997) Multilingual Dictionary Construction and Services - Case Study with the Fe* Projects. Proc.
PACLING'97, Meisei University, Ohme, Tokyo, Japan, 2-5 September 1997, PACL, H. Sakaki ed., vol. 1/1, pp. 173181.
Lafourcade M. & Rivepiboon W. (1997) Issues in the French-English-Thai Dictionary Project. Proc. International
Workshop on Human and Computer Processing of Language and Speech, Chulalongkorn University, Bangkok,
Thailand, 8-12 December 1997, S. Luksaneeyanawin ed., vol. 1/1.
Mangeot M. (1999) Accès Internet au dictionnaire FEM (français-anglais-malais). GETA, CLIPS, IMAG, Grenoble,
Dictionnaire trilingue d'usage. http://clips.imag.fr/geta/services/fem
9/10
The translation of examples, citations, definitions and glosses in the Papillon project
[21] Mangeot M. (2001) Environnements centralisés et distribués pour lexicographes et lexicologues en contexte
multilingue. Thèse, UJF (thèse préparée au GETA, CLIPS), 293 p.
[22] Mangeot-Lerebours M. (1999) Visualisation et Navigation dans des bases de données hétérogènes. Proc. Journée de
l'audiovisuel ANRT/INA, Paris, 23 septembre 1999, INA .
[23] Mangeot-Lerebours M. (1999) Accès unique à des dictionnaires hétérogènes. Proc. Lexicologie, Terminologie,
Traduction (LTT'99), Beyrouth, Liban, 11-13 novembre 1999, AUPELF-UREF, A. Clas ed., 3 p.
[24] Mangeot-Lerebours M. (2000) Papillon Lexical Database Project: Monolingual Dictionaries & Interlingual Links.
Proc. 7th Workshop on Advanced Information Network and SystemPacific Association for Computational Linguistics
1997 Conference (WAINS'7), Bangkok, Thailande, 7-8 décembre 2000, Kasetsart University, H. Sakaki ed., 6 p.
[25] Mel’tchuk I. (1981) Meaning-Text models: a recent trend in Soviet linguistics. Annual Review of Anthropology 10
(1981), pp. 27—62.
[26] Mel’tchuk I. & Polguère A. (1987) A Formal Lexicon in the Meaning-Text Theory: or How to do Lexica with Words.
Computational Linguistics 13/3-4 (1987), pp. 261-275.
[27] Mel’tchuk I., Clas A. & Polguère A. (1995) Introduction à la lexicologie explicative et combinatoire. AUPELFUREF/Duculot, Louvain-la-Neuve, 256 p.
[28] Planas E. (1998) TELA: Structures et algorithmes pour la Traduction Fondée sur la Mémoire. Thèse, UJF (Grenoble
1), 7 July 1998, 375 p.
[29] Planas E. & Furuse O. (1999) Considering Translation Memories as a Cross Language Information Retrieval system.
Proc. MT Summit VII, Singapore, 13-17 September 1999, Asia Pacific Ass. for MT, J.-I. Tsujii ed., 4 p.
[30] Planas E. & Furuse O. (1999) A Close Multilevel String Matching Algorithm for Shallow Translation. Proc. TMI-99,
4 p.
[31] Sérasset G. (1994) Recent Trends of Electronic Dictionary Research and Development in Europe. TM 038, EDR,
Japon, mars 1994, 88 p.
[32] Sérasset G. (1994) An Interlingual Lexical Organization Based on Acceptions. Proc. ICLA-94, 26-28 July 1994,
USM, 12 p.
[33] Sérasset G. (1994) Interlingual Lexical Organisation for Multilingual Lexical Databases. Proc. 15th International
Conference on Computational Linguistics, COLING-94, 5-9 Aug. 1994, 6 p.
[34] Sérasset G. (1994) SUBLIM, un système universel de bases lexicales multilingues; et NADIA, sa spécialisation aux
bases lexicales interlingues par acceptions. Nouvelle thèse, UJF (Grenoble 1).
[35] Sérasset G. (1996) Un éditeur pour le dictionnaire explicatif et combinatoire du français contemporain. Proc.
Journées lexique du PRC-CHM, Grenoble, D. Genthial ed.
[36] Sérasset G. (1996) Informatisation du Dictionnaire Explicatif et Combinatoire : le projet NADIA-DEC. Proc.
Lexicomatique et Dictionnairique, Lyon, 28-30 septembre 1996, AUPELF•UREF, A. Clas ed.
[37] Sérasset G. (1997) Informatisation du Dictionnaire Explicatif et Combinatoire. Proc. TALN-97, Grenoble, 12-13 juin
1997, CLIPS, IMAG, D. Genthial ed., pp. 194-198.
[38] Sérasset G. & Polguère A. (1997) Outils pour lexicographes : application à la lexicographie explicative et
combinatoire. Proc. RIAO'97, Montréal, 25-27 juin 1997, vol. 2/2, pp. 701-708.
[39] Sérasset G. (1997) Le projet NADIA-DEC : vers un dictionnaire explicatif et combinatoire informatisé ? Proc. La
mémoire des mots, 5ème journées scientifiques du réseau LTT, Tunis, 25-27 septembre 1997, AUPELF•UREF, A.
Clas ed., 7 p.
[40] Sérasset G. & Mangeot M. (1998) L'édition lexicographique dans un système générique de gestion de bases lexicales
multilingues. Proc. Natural Language Processing and Industrial Applications, Moncton, vol. 1/2, pp. 110-116.
[41] Sérasset G. & Boitet C. (2000) On UNL as the future "html of the linguistic content" & the reuse of existing NLP
components in UNL-related applications with the example of a UNL-French deconverter. Proc. COLING-2000,
Saarbrücken, 31/7—3/8/2000, ACL, H. Uszkoreit ed., 7 p. (submitted)
[42] Sérasset G. & Mangeot-Lerebours M. (2001) Papillon Lexical Database Project: Monolingual Dictionaries &
Interlingual Links. Proc. NLPRS-2001, NII, Tokyo, 27-30 November 2001, pp. 119-125.
[43] Tomokiyo M., Mangeot-Lerebours M. & Planas E. (2000) Papillon : a Project of Lexical Database for English,
French and Japanese, using Interlingual Links. Proc. Journées des Sciences et Techniques de l'Ambassade de France
au Japon, Tokyo, Japon, 13-14 novembre 2000, Ambassade de France au Japon, 3 p.
10/10

Documents pareils