ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE
Transcription
ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE
First International Conference on Lao Studies, May 20-22, 2005 ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE Vincent BERMENT [email protected] GETA-CLIPS (IMAG) BP 53 38041 Grenoble Cedex 9, France http://www-clips.imag.fr/geta/ INALCO 2, rue de Lille 75343 Paris Cedex 7, France http://www.inalco.fr ABSTRACT — The students learning Laotian or French language, the professional or occasional translators, all have the same few means for helping them in their translations task. Those means are the often frozen, old and tiresome to use dictionaries, like Marc Reinhorn’s Lao-French dictionary or Pierre Somchine Nginn’s FrenchLao dictionary. This paper presents LaoTrans, a set of online translation support services between Lao and French (both directions), that has all the flexibility and evolutionarity that the new Information and Communication Technologies (ICT) can bring. RÉSUMÉ — Les étudiants laotiens apprenant le français ou français apprenant le laotien, les traducteurs professionnels ou occasionnels disposent de peu de moyens pour les aider dans leur tâche. Il s’agit des dictionnaires souvent anciens, figés et dont l’utilisation est fastidieuse comme le dictionnaire laotien-français de Marc Reinhorn ou comme celui, français-laotien, de Pierre Somchine Nginn. Le présent article présente LaoTrans, un ensemble de services d’aide à la traduction en ligne entre français et lao, qui dispose de toute la souplesse et de l’évolutivité que permettent aujourd’hui les nouvelles Technologies de l’Information et des Communications (TIC). KEYWORDS — Machine translation, translation support tool, pi-language, π-language, Lao language, language computerization, unsegmented writing system, syllabic segmentation, word segmentation, lexical database, electronic dictionary, LaoTrans, LaoWord, LaoUniKey, Lao Software INTRODUCTION Thanks to a continuous effort made over the last 30 years, the Lao system of writing has now achieved a satisfactory level of computerization: mature fonts, text input tools and word processors are now available and the Lao Unicode is used over the Internet. So a lot of people is now waiting for what may be looked as a natural next step of this computerization effort: a high quality machine translation service that would allow the Lao people to read English, Chinese, Japanese, French or whatsoever language, translated in their own one. In contrast with the pioneering age where isolated individuals could offer complete solutions, the much more complex machine translation step must rely on groups of linguists and computer scientists to get through. For example, the order of magnitude for producing a good quality Lao-English / English-Lao machine translation system is estimated to 50 men-years (Lafourcade, 1994), only for the lingware, when an add-in such as LaoWord which provides a full set of word processing functions only requires about 6 men-years. So even if the methods used are well adapted to the computerization of under-resourced languages or, as they are called now, π-languages (Berment, 2004), it may take time before a good translation software is available for the Lao language. 1 / 11 First International Conference on Lao Studies, May 20-22, 2005 On the way of this highly desirable future, an interesting translation support can be brought to professional and occasional translators by simpler tools mostly based on the technology developed for word processors. In this paper, we present LaoTrans, a set of online translation services that offers, for the first time, a translation support on the Web (dictionary, word for word translation of texts). Doing so, we will see how the reuse of a technology previously developed dramatically eased the realization of this new software. Then, looking towards the future, we will also show how the LaoTrans experience can contribute to the design of a full machine translation service. We will also introduce LaoUniKey, a text input tool, and LaoLex, an online collaborative service for building dictionaries. I FUNCTIONAL DESCRIPTION OF LAOTRANS LaoTrans is a set of two online services which are available, among other software dedicated to the Lao language, at the following address: http://www.LaoSoftware.com. I.1 TRANSLATION OF WORDS: LAOTRANS-WORD The first of the two services is called LaoTrans-Word. It is a look-up dictionary that provides both Lao-French and French-Lao translations for isolated words. Figure 1: LaoTrans-Word translates “ແປ” into French (example) Using this service from Lao to French, the tool displays all the possible French translations as shown in the figure above for the word “ແປ”. When using the dictionary in the other direction, from French to Lao, the engine looks for all the translations that include the French word, for example the following translations will be given for “chaussure” (see next page). 2 / 11 First International Conference on Lao Studies, May 20-22, 2005 Figure 2 : LaoTrans-Word translates “chaussure” into Lao (example) This is very convenient to find all the occurrences of a word including, for example, plural or conjugated forms. Actually, this was done so because the original lexicon used in LaoTrans is a LaoFrench dictionary in which an entry like ກິ່ງ is translated by the periphrasis “l'autre partie d'une paire” and another like ເກີບ is translated by the plural forms “souliers” and “chaussures”. So it appeared natural to search the French word not as an exact translation but only as any sub-chain. In both directions, Lao-French and French-Lao, the user can investigate further his translated words as they are displayed with hyperlinks. For example, when the user will click on the first translation given for “chaussure” (ເກີບ), LaoTrans-Word will return the three following results. Figure 3 : LaoTrans-Word translates “ເກີບ” into plural forms (example) 3 / 11 First International Conference on Lao Studies, May 20-22, 2005 I.2 TRANSLATION OF TEXTS: LAOTRANS-TEXT The second service offered by LaoTrans is more complex as it aims at easing the reading and the translation of a full piece of Lao text (“active reading support” tool). With this service, the user receives a word for word translation in response to his Lao text as shown in the figure below. As you can see in this figure, a word which has more than one translation in the dictionary will be displayed with the list of all its translations separated with ‘/’ characters. Figure 4 : LaoTrans-Text translates a Lao text into a list of its words translated into French Obviously, this is not yet a complete machine translation system with production of sentences. But it is also obvious that the human user will easily and quickly produce his own translation thanks to the translations he got for the words. 4 / 11 First International Conference on Lao Studies, May 20-22, 2005 I.3 TEXTS INPUT IN UNICODE: LAOUNIKEY Unicode is an initiative for standardizing the representation in computers of all the writing systems in the world. Though the Lao writing system is part of the Unicode standard since June 1992 (Unicode version 1.01, see http://www.unicode.org/Public/UNIDATA/DerivedAge.txt), Unicode fonts that work properly for Lao only appeared a few years ago1. So, after a long period during which Lao Unicode remained almost unused, we now observe that more and more Web sites adopt Unicode so we are confident that it will become THE standard in a close future and that Lao texts will at last be readable on any computer in the world. In LaoTrans, all the Lao texts are encoded in Unicode. As Lao text input is still not natively present in Microsoft Windows (people in Laos mainly use Windows 98, Me and XP), the users need to rely on additional software to type their texts. LaoUniKey (ລາວຢຸນິກ້ີ) is such a software. It is available from the same Web site as LaoTrans: http://www.LaoSoftware.com. This software is very flexible (formats, hotkeys...) and its human-machine interface is trilingual: Lao, French and English. Figure 5 : LaoUniKey configuration window (Lao language selected) Alternatively, the text can also be copied from an existing document, and in particular from Web sites using the Unicode encoding. 1 In May 2001, we developed one of the first Unicode font which was properly working for Lao. It was presented during the Papillon seminar in Tokyo in May 2002 where it was used to show the Lao language in PapiLex, a mockup of collaborative online Unicode dictionary. See (Berment 2002) and (Berment 2004). The most popular fonts at the moment are done by John Durdin (e.g. Saysettha OT). 5 / 11 First International Conference on Lao Studies, May 20-22, 2005 I.4 DICTIONARY CONTENT: THE REUSE OF LINGWARE In order to offer a translation service of good quality, the software itself has to be good but it’s not enough. A good dictionary is also a key issue as it has to contain as many word-senses and as little mistakes as possible. Fortunately, the dictionaries made by teachers or by linguists with means such as word processors can often be recycled to become lexical databases. This process can be helped by tools such as Recupdic in which the textual structure of the dictionary is formally described in order to automate its transformation into a lexical base (Doan-Nguyen, 1998). In the case of LaoTrans, we could recycle a Lao-French lexicon of about 15,000 word-senses that was created by Paul Jadin2 who originally typed it in Excel. Its structure had two great advantages: • The entries and their translations were already separated in two different columns, • The entries corresponding to two different word-senses were in two different rows3. This 15,000 entries dictionary happens to provide a good coverage for the test texts. The process to transform Paul Jadin’s Excel file into a lexical database followed the three steps: • Save the Excel file into a “.csv” file, • Import this “.csv” file into a MySQL table by using phpMyAdmin4, • Transforming the proprietary encoding of the Lao entries into Unicode5. If we now look towards the future, we may wonder what is likely to be reusable in this “translation support” stuff. Actually, we think that the dictionary is the most essential element to reuse. The current step will have driven to a correct and large dictionary over a period of several months. This will be even more evident with the interactive use of LaoTrans and LaoLex as demonstrated in chapter III. The improvement, by adding lexical content (examples...) to the entries, participates to the lingware’s development. So this enriched and improved dictionary can be later used for itself to do the lexical transfer of a machine translation system. Moreover, it can also be used together with LaoTrans and a corpus to provide a help for analyzing and understanding the morphosyntactic structure of the Lao language. 2 Paul Jadin is an independent lexicographer who lives in Belgium. We got in contact thanks to the Internet. This last remark highlights the essential fact that Internet changed the way people are working together, and especially people involved in π-language computerization because they are very few and scattered around the world. The collaboration between Paul Jadin and the author is a typical example as we actually never physically met until now. For example, the word ແປ with the meanings “Traduire” and “Plat” were in two different rows. After selecting the table in the database, the “Structure” page is displayed. This page proposes to “Insert data from a text file into table” at its bottom. This can be used in particular for inserting a “.csv” file. For more details, see http://www.phpmyadmin.net/home_page/index.php and http://f2o.org/help/phpmyadmin.php. 5 This last action is handled by a C++ program deriving from LaoWord which can transform a text from any given encoding into any other, and in particular Unicode. This C++ program is called by a PHP code. 3 4 6 / 11 First International Conference on Lao Studies, May 20-22, 2005 II ARCHITECTURE AND ALGORITHMS: THE REUSE OF SOFTWARE II.1 LAOTRANS-WORD SOFTWARE ARCHITECTURE AND ALGORITHMS The LaoTrans-Word software is running on a Linux server. The languages used are PHP and C++, and the dictionary is stored in a MySQL database. The architecture is shown in the figure below. Form Web Client Request LaoTrans Code (PHP) Response Characters Chain to Process Lexical Data Lexical Base (MySQL) Response Web Server LaoProc Code (C++) Figure 6 : LaoTrans-Word: Architecture The browser (Web client) sends the word to translate to the Web server in a form (see Figure 1). The principle of the operation in the server is summarized in the following drawing. Web Client Web Server Form Canonic formatting (LaoProc) Word retrieval (LaoTrans/Lexical base) Response construction (LaoTrans) Response Figure 7 : LaoTrans-Word: Processing sequence First, the server transforms the received word into a canonical format (for example ໜ → ຫນ 6) which is the format of the Lao headwords in the database. Then, it simply searches the standardized word in the database and returns the response to the client. 6 For more details, see (Berment, 2004). 7 / 11 First International Conference on Lao Studies, May 20-22, 2005 II.2 LAOTRANS-TEXT SOFTWARE ARCHITECTURE AND ALGORITHMS From a “main bricks” point of view, the structure of LaoTrans-Text is the same as the one of LaoTrans-Word. The differences are that LaoProc also has to segment the text into syllables and to compute a phonological transcription for each of them. Then the PHP code of LaoTrans associates the syllables together to form words by using a longest matching algorithm (see (Rarunrom, 1991)). Web Client Web Server Form Canonic formatting (LaoProc) Syllabic segmentation (LaoProc) Phonological transcription (LaoProc) Word retrieval (LaoTrans/Lexical base) Response construction (LaoTrans) Response Figure 8 : LaoTrans-Text: Processing sequence II.3 GAIN REACHED BY REUSING AN EXISTING SOURCE CODE Though the three “linguistic” algorithms (canonic formatting, syllabic segmentation and phonological transcription) are not obvious, their part in the development of LaoProc did not even take 15 hours, thanks to the reuse of the source code of LaoWord (Berment, 2003a). Total Three linguistic modules LaoWord > 4,000 h ≈ 2,500 h LaoProc ≈ 300 h < 15 h Figure 9 : Compared development efforts7 The gain8 reached for the three linguistic modules is greater than 99 % (the rest of the two software programs is not comparable so talking of gain would not be relevant). The global gain for the LaoProc program is then (2,785 – 300) / 2,785 ≈ 89 %. The same order of magnitude was already observed for LaoUniKey which also derives from LaoWord (≈ 93 % for the linguistic modules) as well as for other software reuse experiences (Berment, 2004). We may observe that this software reuse was possible because the programming language (C++) can be compiled on both Windows and Linux platforms (LaoWord is running on Windows and LaoProc on a Linux server). 7 Development time, not including feasibility studies, mockups, design & test documentation, user’s manuals, distribution, support and maintenance. 8 This gain is defined by the formula : (T – T ) / T , with T = development time without reuse and T = S A S S A development time with reuse. 8 / 11 First International Conference on Lao Studies, May 20-22, 2005 III DICTIONARY ENRICHMENT: LAOLEX As any other dictionary, the dictionary used in LaoTrans may contain mistakes and entries may also be missing. The LaoLex software is designed to add and modify lexical entries in online dictionaries. Though this existing software was first presented in 2003, it has not been upgraded yet to handle Unicode. However, as it will soon be available as a natural complement to LaoTrans, we will say a few words here about its capabilities. For more details, see (Berment and Thongvilu 2003), (Berment et al. 2003) and (Berment 2003b). Figure 10 : Creation of an entry in LaoLex An efficient manner to improve the existing dictionary is to couple LaoTrans together with LaoLex dynamically. This means that every time a recorded user asks for a translation and the result does not give full satisfaction, he may modify or add entries in the dictionary9. The main dictionary will not be modified directly by LaoLex. Only the personal dictionaries of the recorded users will. However, when the content of a personal dictionary becomes significant, it can be checked by the linguist in charge of the main dictionary and then be added. This collaborating way of building the main dictionary can be very efficient as the users themselves do guarantee its lexical quality and completeness. 9 This coupled mechanism is foreseen to be included in the future implementation of LaoLex. 9 / 11 First International Conference on Lao Studies, May 20-22, 2005 As you can see in Figure 10, LaoLex can not only enrich the headwords and their translations (currently the only available data in the dictionary) but also the lexical content of the existing entries. The main additional lexical items are: • Old spellings, • Alternative spellings, • Part of speech, • Language level, • Specifier, • Definition (with its translation), • Examples (with their translations), • Idioms (with their translations), • Comments. Every Lao word in the dictionary can be associated with a category (ປະເພດ). This category, or part of speech, is attributed according to syntactical criteria (semantic and morphological criteria also participate to categorization). In LaoLex, we use a description in three levels: • A group level, • A category level, • A subcategory level. The first level, called “group” because its elements group together several categories, coincides with the “seven parts of speech” of the “traditional” grammar (Comité Littéraire, 1962): • Nouns (ຄຳນາມ), • Pronouns (ຄຳສັບພະນາມ / ຄຳສັພນາມ), • Verbs (ຄຳກິຣິຍາ), • Predicatives10 (ຄຳວິເສດ), • Prepositions (ຄຳບຸບພະບົດ / ຄຳບຸພບົດ), • Conjunctions (ຄຳສັນທານ), • Interjections (ຄຳອຸທານ). The second and third levels mostly derive from Marc Reinhorn's works completed by Lamvieng Inthamone's. See (Berment et al. 2003). The levels of language are also important in the Lao language. We are using the following ones: • General use • Respectful • Colloquial • Slangy • Specialty • Refined • Monk • Royal • Literary • Spoken • Archaic 10 Term borrowed from Marc Reinhorn’s grammar (Reinhorn, 1975). 10 / 11 First International Conference on Lao Studies, May 20-22, 2005 REFERENCES LIST Berment Vincent. 2002. Several Technical Issues for Building New Lexical Bases. Papillon Seminar, July 15 – 18, 2002, NII, Hitotsubashi, Chiyoda-ku, Tokyo, Japan. Berment Vincent, Thongvilu Houmphanh. 2003. Cooperative Lao ICT framework. Case study: construction of Lao lexical resources. Regional Conference on Digital GMS, February 26 – 28, 2003, Asian Institute of Technology, Bangkok, Thailand. Berment Vincent, Jacqmin Thakkhinh, Dechanet Blandine. 2003. Parts of Speech for the LaoLex Dictionary. Pan Asia Networking All Partners 2003, March 3 – 10, 2003, Vientiane Novotel, Vientiane, Laos. Berment Vincent. 2003a. LaoWord's Word Processing Functions. Pan Asia Networking All Partners 2003, March 3 – 10, 2003, Vientiane Novotel, Vientiane, Laos. Berment Vincent. 2003b. Current status of the Papillon-Lao database: Tools (LaoLex), dictionary (LaoDict), XML schema and export towards Papillon. Papillon Seminar, July 3 – 5, 2003, Sapporo University, Sapporo, Japan. Berment Vincent. 2004. Méthodes pour informatiser des langues et des groupes de langues « peu dotées ». Ph.D. thesis, Grenoble University, France. Comité Littéraire (ກົມວັນນະຄະດີ) 1962. Lao grammar (ໄວຍາກອນລາວ). Doan-Nguyen Hai. 1998. Techniques génériques d’accumulation d’ensembles lexicaux structurés à partir de ressources dictionnairiques informatisées multilingues hétérogènes. Ph.D. thesis, Grenoble University, France. Jadin Paul. 2005. Dictionnaire laotien-français. Non publié. Lafourcade Mathieu. 1994. Génie logiciel pour le génie linguiciel. Ph.D. thesis, Grenoble University, France. Nginn Pierre Somchine. 1980. Dictionnaire français-lao. Idase, Paris. Rarunrom Sampan. 1991. Dictionary-based Thai word separation,. Senior project report, Thailand. Reinhorn Marc. 1970. Dictionnaire laotien-français. CNRS. Reinhorn Marc. 1975. Grammaire de la langue lao. INALCO. 11 / 11