ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE

Transcription

ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE
First International Conference on Lao Studies, May 20-22, 2005
ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE
Vincent BERMENT
[email protected]
GETA-CLIPS (IMAG)
BP 53
38041 Grenoble Cedex 9, France
http://www-clips.imag.fr/geta/
INALCO
2, rue de Lille
75343 Paris Cedex 7, France
http://www.inalco.fr
ABSTRACT — The students learning Laotian or French language, the professional or occasional translators, all
have the same few means for helping them in their translations task. Those means are the often frozen, old and
tiresome to use dictionaries, like Marc Reinhorn’s Lao-French dictionary or Pierre Somchine Nginn’s FrenchLao dictionary. This paper presents LaoTrans, a set of online translation support services between Lao and
French (both directions), that has all the flexibility and evolutionarity that the new Information and
Communication Technologies (ICT) can bring.
RÉSUMÉ — Les étudiants laotiens apprenant le français ou français apprenant le laotien, les traducteurs
professionnels ou occasionnels disposent de peu de moyens pour les aider dans leur tâche. Il s’agit des
dictionnaires souvent anciens, figés et dont l’utilisation est fastidieuse comme le dictionnaire laotien-français de
Marc Reinhorn ou comme celui, français-laotien, de Pierre Somchine Nginn. Le présent article présente
LaoTrans, un ensemble de services d’aide à la traduction en ligne entre français et lao, qui dispose de toute la
souplesse et de l’évolutivité que permettent aujourd’hui les nouvelles Technologies de l’Information et des
Communications (TIC).
KEYWORDS — Machine translation, translation support tool, pi-language, π-language, Lao language,
language computerization, unsegmented writing system, syllabic segmentation, word segmentation,
lexical database, electronic dictionary, LaoTrans, LaoWord, LaoUniKey, Lao Software
INTRODUCTION
Thanks to a continuous effort made over the last 30 years, the Lao system of writing has now
achieved a satisfactory level of computerization: mature fonts, text input tools and word processors are
now available and the Lao Unicode is used over the Internet. So a lot of people is now waiting for
what may be looked as a natural next step of this computerization effort: a high quality machine
translation service that would allow the Lao people to read English, Chinese, Japanese, French or
whatsoever language, translated in their own one.
In contrast with the pioneering age where isolated individuals could offer complete solutions, the
much more complex machine translation step must rely on groups of linguists and computer scientists
to get through. For example, the order of magnitude for producing a good quality Lao-English /
English-Lao machine translation system is estimated to 50 men-years (Lafourcade, 1994), only for the
lingware, when an add-in such as LaoWord which provides a full set of word processing functions
only requires about 6 men-years. So even if the methods used are well adapted to the computerization
of under-resourced languages or, as they are called now, π-languages (Berment, 2004), it may take
time before a good translation software is available for the Lao language.
1 / 11
First International Conference on Lao Studies, May 20-22, 2005
On the way of this highly desirable future, an interesting translation support can be brought to
professional and occasional translators by simpler tools mostly based on the technology developed for
word processors. In this paper, we present LaoTrans, a set of online translation services that offers, for
the first time, a translation support on the Web (dictionary, word for word translation of texts). Doing
so, we will see how the reuse of a technology previously developed dramatically eased the realization
of this new software. Then, looking towards the future, we will also show how the LaoTrans
experience can contribute to the design of a full machine translation service. We will also introduce
LaoUniKey, a text input tool, and LaoLex, an online collaborative service for building dictionaries.
I FUNCTIONAL DESCRIPTION OF LAOTRANS
LaoTrans is a set of two online services which are available, among other software dedicated to
the Lao language, at the following address: http://www.LaoSoftware.com.
I.1 TRANSLATION OF WORDS: LAOTRANS-WORD
The first of the two services is called LaoTrans-Word. It is a look-up dictionary that provides both
Lao-French and French-Lao translations for isolated words.
Figure 1: LaoTrans-Word translates “ແປ” into French (example)
Using this service from Lao to French, the tool displays all the possible French translations as
shown in the figure above for the word “ແປ”.
When using the dictionary in the other direction, from French to Lao, the engine looks for all the
translations that include the French word, for example the following translations will be given for
“chaussure” (see next page).
2 / 11
First International Conference on Lao Studies, May 20-22, 2005
Figure 2 : LaoTrans-Word translates “chaussure” into Lao (example)
This is very convenient to find all the occurrences of a word including, for example, plural or
conjugated forms. Actually, this was done so because the original lexicon used in LaoTrans is a LaoFrench dictionary in which an entry like ກິ່ງ is translated by the periphrasis “l'autre partie d'une paire”
and another like ເກີບ is translated by the plural forms “souliers” and “chaussures”. So it appeared
natural to search the French word not as an exact translation but only as any sub-chain.
In both directions, Lao-French and French-Lao, the user can investigate further his translated
words as they are displayed with hyperlinks. For example, when the user will click on the first
translation given for “chaussure” (ເກີບ), LaoTrans-Word will return the three following results.
Figure 3 : LaoTrans-Word translates “ເກີບ” into plural forms (example)
3 / 11
First International Conference on Lao Studies, May 20-22, 2005
I.2 TRANSLATION OF TEXTS: LAOTRANS-TEXT
The second service offered by LaoTrans is more complex as it aims at easing the reading and the
translation of a full piece of Lao text (“active reading support” tool). With this service, the user
receives a word for word translation in response to his Lao text as shown in the figure below. As you
can see in this figure, a word which has more than one translation in the dictionary will be displayed
with the list of all its translations separated with ‘/’ characters.
Figure 4 : LaoTrans-Text translates a Lao text into a list of its words translated into French
Obviously, this is not yet a complete machine translation system with production of sentences. But
it is also obvious that the human user will easily and quickly produce his own translation thanks to the
translations he got for the words.
4 / 11
First International Conference on Lao Studies, May 20-22, 2005
I.3 TEXTS INPUT IN UNICODE: LAOUNIKEY
Unicode is an initiative for standardizing the representation in computers of all the writing systems
in the world. Though the Lao writing system is part of the Unicode standard since June 1992 (Unicode
version 1.01, see http://www.unicode.org/Public/UNIDATA/DerivedAge.txt), Unicode fonts that work
properly for Lao only appeared a few years ago1. So, after a long period during which Lao Unicode
remained almost unused, we now observe that more and more Web sites adopt Unicode so we are
confident that it will become THE standard in a close future and that Lao texts will at last be readable
on any computer in the world.
In LaoTrans, all the Lao texts are encoded in Unicode. As Lao text input is still not natively
present in Microsoft Windows (people in Laos mainly use Windows 98, Me and XP), the users need to
rely on additional software to type their texts. LaoUniKey (ລາວຢຸນິກ້ີ) is such a software. It is available
from the same Web site as LaoTrans: http://www.LaoSoftware.com. This software is very flexible
(formats, hotkeys...) and its human-machine interface is trilingual: Lao, French and English.
Figure 5 : LaoUniKey configuration window (Lao language selected)
Alternatively, the text can also be copied from an existing document, and in particular from Web
sites using the Unicode encoding.
1
In May 2001, we developed one of the first Unicode font which was properly working for Lao. It was
presented during the Papillon seminar in Tokyo in May 2002 where it was used to show the Lao language in
PapiLex, a mockup of collaborative online Unicode dictionary. See (Berment 2002) and (Berment 2004). The
most popular fonts at the moment are done by John Durdin (e.g. Saysettha OT).
5 / 11
First International Conference on Lao Studies, May 20-22, 2005
I.4 DICTIONARY CONTENT: THE REUSE OF LINGWARE
In order to offer a translation service of good quality, the software itself has to be good but it’s not
enough. A good dictionary is also a key issue as it has to contain as many word-senses and as little
mistakes as possible. Fortunately, the dictionaries made by teachers or by linguists with means such as
word processors can often be recycled to become lexical databases. This process can be helped by
tools such as Recupdic in which the textual structure of the dictionary is formally described in order to
automate its transformation into a lexical base (Doan-Nguyen, 1998).
In the case of LaoTrans, we could recycle a Lao-French lexicon of about 15,000 word-senses that
was created by Paul Jadin2 who originally typed it in Excel. Its structure had two great advantages:
• The entries and their translations were already separated in two different columns,
• The entries corresponding to two different word-senses were in two different rows3.
This 15,000 entries dictionary happens to provide a good coverage for the test texts.
The process to transform Paul Jadin’s Excel file into a lexical database followed the three steps:
• Save the Excel file into a “.csv” file,
• Import this “.csv” file into a MySQL table by using phpMyAdmin4,
• Transforming the proprietary encoding of the Lao entries into Unicode5.
If we now look towards the future, we may wonder what is likely to be reusable in this “translation
support” stuff. Actually, we think that the dictionary is the most essential element to reuse. The current
step will have driven to a correct and large dictionary over a period of several months. This will be
even more evident with the interactive use of LaoTrans and LaoLex as demonstrated in chapter III.
The improvement, by adding lexical content (examples...) to the entries, participates to the lingware’s
development. So this enriched and improved dictionary can be later used for itself to do the lexical
transfer of a machine translation system. Moreover, it can also be used together with LaoTrans and a
corpus to provide a help for analyzing and understanding the morphosyntactic structure of the Lao
language.
2
Paul Jadin is an independent lexicographer who lives in Belgium. We got in contact thanks to the Internet. This
last remark highlights the essential fact that Internet changed the way people are working together, and
especially people involved in π-language computerization because they are very few and scattered around the
world. The collaboration between Paul Jadin and the author is a typical example as we actually never physically
met until now.
For example, the word ແປ with the meanings “Traduire” and “Plat” were in two different rows.
After selecting the table in the database, the “Structure” page is displayed. This page proposes to “Insert data
from a text file into table” at its bottom. This can be used in particular for inserting a “.csv” file. For more
details, see http://www.phpmyadmin.net/home_page/index.php and http://f2o.org/help/phpmyadmin.php.
5 This last action is handled by a C++ program deriving from LaoWord which can transform a text from any
given encoding into any other, and in particular Unicode. This C++ program is called by a PHP code.
3
4
6 / 11
First International Conference on Lao Studies, May 20-22, 2005
II ARCHITECTURE AND ALGORITHMS: THE REUSE OF SOFTWARE
II.1 LAOTRANS-WORD SOFTWARE ARCHITECTURE AND ALGORITHMS
The LaoTrans-Word software is running on a Linux server. The languages used are PHP and C++,
and the dictionary is stored in a MySQL database. The architecture is shown in the figure below.
Form
Web
Client
Request
LaoTrans Code
(PHP)
Response
Characters
Chain to
Process
Lexical Data
Lexical
Base
(MySQL)
Response
Web
Server
LaoProc Code
(C++)
Figure 6 : LaoTrans-Word: Architecture
The browser (Web client) sends the word to translate to the Web server in a form (see Figure 1).
The principle of the operation in the server is summarized in the following drawing.
Web
Client
Web
Server
Form
Canonic formatting
(LaoProc)
Word retrieval
(LaoTrans/Lexical base)
Response construction
(LaoTrans)
Response
Figure 7 : LaoTrans-Word: Processing sequence
First, the server transforms the received word into a canonical format (for example ໜ → ຫນ 6)
which is the format of the Lao headwords in the database. Then, it simply searches the standardized
word in the database and returns the response to the client.
6
For more details, see (Berment, 2004).
7 / 11
First International Conference on Lao Studies, May 20-22, 2005
II.2 LAOTRANS-TEXT SOFTWARE ARCHITECTURE AND ALGORITHMS
From a “main bricks” point of view, the structure of LaoTrans-Text is the same as the one of
LaoTrans-Word. The differences are that LaoProc also has to segment the text into syllables and to
compute a phonological transcription for each of them. Then the PHP code of LaoTrans associates the
syllables together to form words by using a longest matching algorithm (see (Rarunrom, 1991)).
Web
Client
Web
Server
Form
Canonic formatting
(LaoProc)
Syllabic segmentation
(LaoProc)
Phonological transcription
(LaoProc)
Word retrieval
(LaoTrans/Lexical base)
Response construction
(LaoTrans)
Response
Figure 8 : LaoTrans-Text: Processing sequence
II.3 GAIN REACHED BY REUSING AN EXISTING SOURCE CODE
Though the three “linguistic” algorithms (canonic formatting, syllabic segmentation and
phonological transcription) are not obvious, their part in the development of LaoProc did not even
take 15 hours, thanks to the reuse of the source code of LaoWord (Berment, 2003a).
Total
Three linguistic modules
LaoWord
> 4,000 h
≈ 2,500 h
LaoProc
≈ 300 h
< 15 h
Figure 9 : Compared development efforts7
The gain8 reached for the three linguistic modules is greater than 99 % (the rest of the two
software programs is not comparable so talking of gain would not be relevant). The global gain for
the LaoProc program is then (2,785 – 300) / 2,785 ≈ 89 %. The same order of magnitude was
already observed for LaoUniKey which also derives from LaoWord (≈ 93 % for the linguistic
modules) as well as for other software reuse experiences (Berment, 2004).
We may observe that this software reuse was possible because the programming language (C++)
can be compiled on both Windows and Linux platforms (LaoWord is running on Windows and
LaoProc on a Linux server).
7
Development time, not including feasibility studies, mockups, design & test documentation, user’s manuals,
distribution, support and maintenance.
8 This gain is defined by the formula : (T – T ) / T , with T = development time without reuse and T =
S
A
S
S
A
development time with reuse.
8 / 11
First International Conference on Lao Studies, May 20-22, 2005
III DICTIONARY ENRICHMENT: LAOLEX
As any other dictionary, the dictionary used in LaoTrans may contain mistakes and entries may
also be missing. The LaoLex software is designed to add and modify lexical entries in online
dictionaries. Though this existing software was first presented in 2003, it has not been upgraded yet to
handle Unicode. However, as it will soon be available as a natural complement to LaoTrans, we will
say a few words here about its capabilities. For more details, see (Berment and Thongvilu 2003),
(Berment et al. 2003) and (Berment 2003b).
Figure 10 : Creation of an entry in LaoLex
An efficient manner to improve the existing dictionary is to couple LaoTrans together with
LaoLex dynamically. This means that every time a recorded user asks for a translation and the result
does not give full satisfaction, he may modify or add entries in the dictionary9.
The main dictionary will not be modified directly by LaoLex. Only the personal dictionaries of the
recorded users will. However, when the content of a personal dictionary becomes significant, it can be
checked by the linguist in charge of the main dictionary and then be added. This collaborating way of
building the main dictionary can be very efficient as the users themselves do guarantee its lexical
quality and completeness.
9
This coupled mechanism is foreseen to be included in the future implementation of LaoLex.
9 / 11
First International Conference on Lao Studies, May 20-22, 2005
As you can see in Figure 10, LaoLex can not only enrich the headwords and their translations
(currently the only available data in the dictionary) but also the lexical content of the existing entries.
The main additional lexical items are:
• Old spellings,
• Alternative spellings,
• Part of speech,
• Language level,
• Specifier,
• Definition (with its translation),
• Examples (with their translations),
• Idioms (with their translations),
• Comments.
Every Lao word in the dictionary can be associated with a category (ປະເພດ). This category, or
part of speech, is attributed according to syntactical criteria (semantic and morphological criteria also
participate to categorization). In LaoLex, we use a description in three levels:
• A group level,
• A category level,
• A subcategory level.
The first level, called “group” because its elements group together several categories, coincides
with the “seven parts of speech” of the “traditional” grammar (Comité Littéraire, 1962):
•
Nouns (ຄຳນາມ),
•
Pronouns (ຄຳສັບພະນາມ / ຄຳສັພນາມ),
•
Verbs (ຄຳກິຣິຍາ),
•
Predicatives10 (ຄຳວິເສດ),
•
Prepositions (ຄຳບຸບພະບົດ / ຄຳບຸພບົດ),
•
Conjunctions (ຄຳສັນທານ),
•
Interjections (ຄຳອຸທານ).
The second and third levels mostly derive from Marc Reinhorn's works completed by Lamvieng
Inthamone's. See (Berment et al. 2003).
The levels of language are also important in the Lao language. We are using the following ones:
• General use
• Respectful
• Colloquial
• Slangy
• Specialty
• Refined
• Monk
• Royal
• Literary
• Spoken
• Archaic
10
Term borrowed from Marc Reinhorn’s grammar (Reinhorn, 1975).
10 / 11
First International Conference on Lao Studies, May 20-22, 2005
REFERENCES LIST
Berment Vincent. 2002. Several Technical Issues for Building New Lexical Bases. Papillon Seminar,
July 15 – 18, 2002, NII, Hitotsubashi, Chiyoda-ku, Tokyo, Japan.
Berment Vincent, Thongvilu Houmphanh. 2003. Cooperative Lao ICT framework. Case study:
construction of Lao lexical resources. Regional Conference on Digital GMS, February 26 – 28, 2003,
Asian Institute of Technology, Bangkok, Thailand.
Berment Vincent, Jacqmin Thakkhinh, Dechanet Blandine. 2003. Parts of Speech for the LaoLex
Dictionary. Pan Asia Networking All Partners 2003, March 3 – 10, 2003, Vientiane Novotel,
Vientiane, Laos.
Berment Vincent. 2003a. LaoWord's Word Processing Functions. Pan Asia Networking All Partners
2003, March 3 – 10, 2003, Vientiane Novotel, Vientiane, Laos.
Berment Vincent. 2003b. Current status of the Papillon-Lao database: Tools (LaoLex), dictionary
(LaoDict), XML schema and export towards Papillon. Papillon Seminar, July 3 – 5, 2003, Sapporo
University, Sapporo, Japan.
Berment Vincent. 2004. Méthodes pour informatiser des langues et des groupes de langues « peu
dotées ». Ph.D. thesis, Grenoble University, France.
Comité Littéraire (ກົມວັນນະຄະດີ) 1962. Lao grammar (ໄວຍາກອນລາວ).
Doan-Nguyen Hai. 1998. Techniques génériques d’accumulation d’ensembles lexicaux structurés à
partir de ressources dictionnairiques informatisées multilingues hétérogènes. Ph.D. thesis, Grenoble
University, France.
Jadin Paul. 2005. Dictionnaire laotien-français. Non publié.
Lafourcade Mathieu. 1994. Génie logiciel pour le génie linguiciel. Ph.D. thesis, Grenoble University,
France.
Nginn Pierre Somchine. 1980. Dictionnaire français-lao. Idase, Paris.
Rarunrom Sampan. 1991. Dictionary-based Thai word separation,. Senior project report, Thailand.
Reinhorn Marc. 1970. Dictionnaire laotien-français. CNRS.
Reinhorn Marc. 1975. Grammaire de la langue lao. INALCO.
11 / 11

Documents pareils