Electronic corpora for lexicographers: how we can optimise the
Transcription
Electronic corpora for lexicographers: how we can optimise the
Tdr1 Electronic corpora for lexicographers: how we can optimise the output of KWIC lists consultations 1 Abstract Since the past twenty years, lexicographers have been admitting the central role of corpus explorations, having at their disposal though exclusively monolingual corpora. However, the most recent research is starting to take into account the possibility of using bilingual corpora. In my presentation, I would like to concentrate on two points: – first of all, to evaluate how the availability of electronic corpora (principally aligned ones) can contribute to the richness of monolingual and bilingual dictionaries, – secondly, to explain how we can optimise the output of KWIC lists consultations (concordances) by adding automatic analyses and sorting procedures. I will pay particular attention to syntactic and semantic automatic analyses of each extraction (each line) of a KWIC list. I call “semantic analysis” a combination of synonymous information taken out from an electronic synonym dictionary and information extracted from the object-class system, developed for the nouns by the LLI, using the verb sub-categorisation frameworks. In the case of bilingual aligned corpora, syntactic and semantic analyses are facilitated by the presence of translations which help to disambiguate a complex analysis since two languages (the source and the target) do not present the same ambiguities in the same position. As a result, I will be able to compare analyses of each extraction and sort them according to these similarities, thus presenting a lexicographer with an ameliorated KWIC list which is more quickly and easily examined. 1. Introduction Lexicographers admit the central role of “corpus explorations” but, in the activity of French professional lexicographers, corpora have taken a relatively small place. I explain briefly this situation by four reasons. – First of all, there is a particular difficulty to provide a relevant description script for a unit search. Lexicographers work on every day language, which we can characterise as the non specialised language enriched by a specialised one, observed in popular 1 A special thanks to Hans Paulussen, Jean Véronis, Pierre Corbin, Katia Paykin and Leland Tracy for their suggestions and / or rereadings, their data, and all our discussions. 2001, communication au « Twelfth CLIN (Computational Linguistics In the Netherlands) Meeting » (Twente (Pays-Bas), 30 novembre 2001). [Rédigé en 2001 ; 20 878 caractères ; cf. C4] 988 Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia scientific or technical work. Sequences extracted from this type of language are more difficult to describe than terminological units, for example. We can efficiently research sequences which fit Noun Noun or Adjective Noun patterns in scientific or technical texts where they are represented frequently and where they appear in similar contexts, but it is more difficult to detect lexical units which do not have a regular form or particular marking and which appear with a high degree of frequency. – Secondly, because lexicographers work on every day language, they must have corpora that reflect real usages. As we will see later on, representative corpora, like the British National Corpus for example, comprise most efficient resources. For French, we have Frantext, the INALF’s corpus, which integrates primary documents produced from the thirteenth to the nineteenth century, with a large part of literary texts. Other resources can be found on CD-ROM’s containing various press materials, but we do not have a representative corpus for “standard” and contemporary French. Consequently, because French lexicographers cannot use French language corpora, they also rarely use corpora for other languages. – Thirdly, the qualitative inadaptability of computer programs used for extracting information from corpora presents a real problem. For example, while lexicographers of Oxford University Press use a program which provides lists of collocations and simple syntactic analyses for working with English corpora, lexicographers working with French must be satisfied with KWIC lists. Therefore, we have humans who must read and detect pertinent information. – Last, but not least, in France, editorial offices are old-fashioned and all of them have produced big dictionaries that become references for new creations. Furthermore, the relatively bad economic situation is not conducive to producing completely new and expensive projects (like COBUILD did when integrating corpora research into their lexicographical methods). Taken into account the situation just described, if we want to introduce corpora research into the French offices of lexicography in a systematic way, we must find programs that will make lexicographers more efficient, tools that will allow them to work more comfortably, with the same speed (or a better one) and with a higher richness of information. 2. What sorts of corpora are necessary for lexicographers? Lexicographer’s corpora must reflect real usages. Therefore, ideally, corpora must contain a very large compilation of linguistic samples that are produced in many different circumstances, in oral communication or by writing, with different registers of expression, etc. and that are represented with qualitative and quantitative equilibrium. As I mentioned earlier, we do not have representative corpora like that for “standard” and contemporary French language. However, lexicographers do have access to other different electronic resources. They use them when they work on specialised dictionaries, such as computer language dictionaries or œnological ones for example. They just take popular publications or transcriptions of conferences, etc. and build smaller and more specialized corpora than representative ones. Tdr1 – Electronic corpora for lexicographers 989 Properties of these corpora are conditioned by the specialised domain explored and the qualities of the given dictionary (for what sorts of audience?, what degree of precision used in lexical descriptions?, with what number of described items? etc.). Here, I will consider that all lexicographers can have access to and compile specialized electronic corpora and extract some information from them, thus observing a large panel of linguistic usages. More precisely, I will consider that it is possible to select multilingual corpora: parallel or preferentially aligned ones. With aligned corpora, like the excerpt you see in figure 1, lexicographers can build and read parallel KWIC lists that make eventual asymmetric linguistic behavior 2 in the different languages represented in the corpora more apparent. Figure 1. Bus in aligned computer language corpus (sentences extracted from a personal computer language corpus) As of 2003, all PCs and servers will be equipped with the new data bus Arapaho. A partir de 2003, tous les PC et serveurs seront équipés du nouveau bus de données Arapahoe. INTEL will boost Celeron with a 100 MHz bus. Intel va doper le Celeron avec un bus à 100 MHz. As of next year Celeron could reach the speed of 800 MHz. At the same time, the data bus of the processor should increase from 66 to 100 MHz. Dès l’année prochaine, la vitesse du Celeron pourrait atteindre 800 MHz. Parallèlement, le bus de données du processeur devrait passer de 66 à 100 MHz. Graphic card Voodoo5 5500 available for PCI bus. Carte graphique Voodoo5 5500 disponible sur bus PCI. This new card should meet the expectations of the users who do not have an AGP bus but who still want to improve their 3D graphics, especially for games. Cette nouvelle carte devrait répondre aux attentes des utilisateurs qui ne disposent pas de bus AGP dans leur micro et qui souhaitent tout de même améliorer leur affichage 3D, surtout quand il s’agit de jouer. How to distinguish between Pentium II and III? Solution: by comparing their processor and bus speeds. Comment différencier les Pentium II et III ? Solution : en comparant leur fréquence et la vitesse de bus. The M20 model benefits from a 133 MHz bus system and an Ultra 160 SCSI control. It accepts up to two Pentium III processors. Le modèle M20 bénéficie d’un bus système à 133 MHz et d’un contrôleur Ultra 160 SCSI. Il accepte jusqu’à deux processeurs Pentium III. La carte DGE-500SXPCI fonctionne en Bus The DGE-500SXPCI card functions in the Bus Master (32 or 64 bits), i.e. without data forwarding Master (32 ou 64 bits), c’est-à-dire sans que les by the processor, in integral duplex mode. données transitent par le processeur, en mode duplex intégral. GeForce 2 GTS for PCI bus. Une GeForce 2 GTS sur bus PCI. This Compaq 1U server employs two Pentium III Ce serveur 1U de Compaq exploite deux Pentium 800 MHz processors with a 133 MHz bus system. III à 800 MHz sur bus système 133 MHz. In today’s presentation, I will use three electronic corpora: 1) the personal computer language one, presented earlier; 2 Cf. Grundy (1996). 990 Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia 2) the Namur trilingual aligned corpus that regroups English, French and Dutch and that was created by Hans Paulussen in 1996 3; 3) and Le Monde’97 CD-ROM, French press material. 3. What sorts of pre-treatments? Three operations can optimise the electronic extractions of information: – the lemmatisation that associates to each word in the document its lemma (like figure 2), – the morpho-syntactic tagging that matches each word with its part-of-speech tag (like figure 2). Figure 2. Two lemmatised and tagged sentences with bus extracted in Namur trilingual aligned corpus The Hindu Kush ! the Hindu Kush ! She moans if she has to walk to the bus stop . it moan if it have to walk to the bus stop . ====début DETD NPMS NPMS PCTFORTE ====fin ====début PPER3S VINP3S SUB PPER3S VINP3S VINF PREP DETD NCMS NCMS PCTFORTE ====fin de phrase==== L’ HindouKoch ! de phrase==== de phrase==== Elle en parle comme si c’ était la porte à côté . de phrase==== le HindouKoch ! DETDEF NPMS il en parler comme si ce être le porte à côté . PPER3S PREP VINP3S CONJ PDS VINF DETDFS NCFS PREP NCMS PCTFORTE PCTFORTE – and the indexation that permits to find quickly all occurrences of one sequence and to know its frequency. Thus, with KWIC lists built in lemmatised, tagged and indexed corpus, a lexicographer can find each lexical unit and evaluate its exact place in the communication. However, since humans must read and identify all relevant information in real time, this automatic pre-treatment is insufficient as it does not give pertinent information for an efficient description of the syntactic and semantic behavior of these lexical units quickly enough. It only provides some lists of occurrences sorted according to the context 3 Cf. Paulussen (1999). Tdr1 – Electronic corpora for lexicographers 991 on the right. To get a good representation of a unit’s syntactic and semantic behavior, we must make a more complete linguistic analysis. I propose an automatic procedure of analysis for sentences extracted from KWIC lists, those that contain an observed unit. This analysis takes into account sentences where all units are matched with part-of-speech tags and identifies relations between the observed unit and verbs, nouns or adjectives. For example, if we take this unit to be the noun bus, we obtain the following information (figure 3) for each extracted sentence when using a bilingual (French / English) extraction from the Namur trilingual aligned corpus and from the earlier-mentioned computer language one. For the analysis, I lemmatise and tag sentences extracted from these corpora, but for a better readability, I do not reproduce these tags on figure 3 and I write identified relations with different colours. Figure 3. Relations between the observed unit (the noun bus) and verbs, nouns or adjectives (bilingual (French / English) extractions from the Namur trilingual aligned corpus and from the earlier-mentioned computer language one.) « Allez, il faut vraiment qu’on y aille, cette fois », dis-je pour briser l’enchantement. Mon car part dans dix minutes. → partir (NHUM, N) where I dutifully attended → to take (NHUM, N) lectures before dutifully taking the bus home to my parents every evening. quand je fréquentais les noirs amphithéâtres de la Sorbonne, prenant sagement l’autobus «S» chaque soir pour rentrer dans ma famille. → prendre (NHUM, N) I was getting off the bus → to get (NHUM, [N]off) in Hampden with two suitcases and fifty dollars in my pocket. je descendais du car à → descendre (NHUM, Hampden avec deux valises [N]de) et cinquante dollars en poche. ‘This time we really do have to be off,’ I said, to break the spell. ‘The bus leaves in ten minutes.’ A person was “caught by the clock”, or there was a last-minute hitch, or missed the train or the bus. The car wouldn’t start. → to leave (NHUM, N) → to miss (NHUM, N) On a été pris par le temps, → rater (NHUM, N) on a eu un empêchement de dernière minute, on a raté le train ou l’autobus, la voiture n’a pas “voulu” démarrer, → prendre (NHUM, N) → to go (NHUM, NLIEU, My father took Bert to [N]on) visit Queenie, so I went to Sainsbury’s on the bus. Mon père a emmené Bert voir Queenie à l’hôpital. J’ai donc pris le bus pour aller chez Sainsbury. The DGE-500SXPCI card → to function (NCOMPUTER-COMPONENT, functions in the Bus Master (32 or 64 bits), i.e. [N]in) […] La carte DGE-500SXPCI → fonctionner fonctionne en Bus Master (NCOMPOSANT-INFORMATIQUE, (32 ou 64 bits), c’est-à-dire [N]en) […] → to be equipped As of 2003, all PCs and servers will be equipped (NCOMPUTER-COMPONENT, with the new data bus [N]with) Arapaho. A partir de 2003, tous les PC et serveurs seront équipés du nouveau bus de données Arapahoe. → être équipé (NCOMPOSANT-INFORMATIQUE, [N]de) 992 Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia An automatic analysis can do better than this one. It can reassemble similar occurrences: all sentences where the target unit is subject of the verb “passer (bus, [NLIEU]par)” or object of the verb “monter (NHUM, [bus]dans)” or it appears in a prepositional phrase that modifies a noun “(desserte + liaison) [bus]par)” as we can see here for the French sentences extracted from Le Monde’97 CD-ROM (figure 4). 4. Introducing synonymous information It looks easy to do that, and I argue here that it is possible to do that even better by introducing synonymous information taken out from an electronic synonym dictionaries (the ELSAP’s one 4 for the French language, for example). The preceding groupings can be combined if the program knows that several verbs that take a given unit as their object or subject are synonyms in the constructions such as “(prendre + emprunter + (filer + aller)) [bus]en”. Figure 4. Relations between the noun bus and verbs or nouns, reassembled by similar syntaxic constructions and synonymous contexts (French sentences extracted from Le Monde’97 CD-ROM) même s’il leur faut prendre un bus à la porte Maillot, […], pour se rendre → prendre (NHUM, N) Seuls les clients de ces derniers peuvent emprunter le bus en se procurant un ticket → emprunter (NHUM, N) Les étudiantes partaient en bus vers Amsterdam pour se faire → partir (NHUM, (N)en, [NLIEU]vers) avorter ou bien il filait en bus jusqu’à la gare et descendait à Nanterrela-Folie → filer (NHUM, [N]en) “Sais-tu que maintenant tu peux rentrer chez toi en bus ?” → rentrer (NHUM, [N]en) “[Il] préférait se rendre sur place en train, en bus ou en taxi, le trophée dans → se rendre (NHUM, [N]en) les dizaines de milliers de personnes qui, venant en bus, par le → venir (NHUM, [N]en) train ou en voiture, 4 Les autres achetaient une carte Greyhound pour sillonner les États-Unis en bus, ↵ → sillonner (NHUM, NLIEU, [N]en) allaient en Grèce ou en Turquie en car. → aller (NHUM, (NLIEU)en, [N]en) 10 000 kilomètres parcourus à pied, en train et en bus”, explique Damien Seguy → parcourir (NHUM, NDISTANCE, [N]en) le troisième âge […] visite en bus et en paquebots les vues imprenables d’Europe → visiter (NHUM, N, [N]en) une dizaine de jeunes montent dans le bus à l’arrêt du cours de Verdun → monter (NHUM, [N]dans) nous devons ensuite nous trimbaler pendant parfois 80 km en bus ou en voiture → trimbaler (NHUM, NHUM, [N]en) Cf. http://elsap.unicaen.fr/dicosyn.html. Tdr1 – Electronic corpora for lexicographers 993 Ils seront transportés en bus - dont les rideaux seront fermés → transporter (N, NHUM, [N]en) - qui suivront Tous les habitants furent emmenés en bus, 1 500 véhicules furent mobilisés. → emmener (N, NHUM, [N]en) les cinq lignes de bus qui passent par le Mirail ne fonctionnent → passer (N, [NLIEU]par) plus. des conférences de presse matinales quotidiennes et des promenades en bus → promenade [N]en s’inscrire pour des excursions en bus, à bicyclette, sur l’eau ou → excursion [N]en en hélicoptère Pendant les longs voyages en bus lors des tournées théâtrales, → voyage [N]en il n’hésitait pas L’amélioration […] ainsi que des dessertes locales par bus est → desserte [N]par aussi une priorité entre l’ancienne et la nouvelle gare : barreau ferroviaire ou liaison par bus ? → liaison [N]par However, while groupings give us important indications, the observation of nongrouped sentences can help to point out homonymic or polysemous units. In this case, using synonymous information about the observed unit itself can help the lexicographer to identify the semantic variations that appear in different constructions. 5. Using information extracted from the object-class system Other resources that I use for the syntactic and semantic analyses are data extracted from the object-class system, developed for the nouns by Gaston Gross (and members of the LLI 5), using the verb sub-categorisation frameworks. We have seen before that for the noun bus I have identified groups of synonymous verbs that take it as their argument. With a sub-set of these verbs, the sense of bus seems to refer to a mean of transport. More precisely, the noun bus is a member of the <moyens de transport terrestre – à moteur – en commun> object-class. The Gaston Gross’s studies 6 about nouns of means of transport give us the list of all possible verbs, the operator verbs that delimit the set of possible uses for each unit of the object class. This listing, made with the help of linguistic tests, is meant to contain all possible operator verbs for each object class. This rich information should be able to take the place of the synonymous information used before for building groups of sentences. However, considering that these lists were not compiled for all French nouns, we can just use them to complete the synonymous ones. 5 6 “Laboratoire Linguistique Informatique”, Paris 13 University. These studies are related in Gross (1994). 994 6. Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia Conclusion Thus, I hope to have shown that in order to maximise the efficiency of lexicographers’ work, we need to provide them with data sorted out according to semantic information resulting from synonymous relations and operator verb lists for each object class. The necessary data are extracted from bilingual corpora and are subject to syntactic analyses, which point out the relationship between a given linguistic unit and different grammatical categories. References ATKINS B.T.S. (1990), « Corpus lexicography: the bilingual dimension », in L. Cignoni & C. Peters eds, Computational Lexicology and Lexicography. Special issue dedicated to Bernard Quemada, Linguistica computazionale VI, Pisa, Giardini Editori i Stampatori, pp. 43-64. BLANK I. (1995), « Sentence alignment: methods and implementation », Traitement automatique des langues 36.1-2, pp. 81-99. COBUILD = Collins COBUILD English Language Dictionary, London and Glasgow, Collins, 1987. GROSS G. (1994), « Classes d’objets et description des verbes », Langages 115, pp. 15-30. GRUNDY V. (1996), « L’utilisation d’un corpus dans la rédaction du dictionnaire bilingue », in H. Béjoint & P. Thoiron dir., Les dictionnaires bilingues, Aupelf-Uref / Louvain-laNeuve, Duculot, pp. 127-149. HABERT B., FABRE C. & ISSAC F. (1998), De l’écrit au numérique. Constituer, normaliser et exploiter les corpus électroniques, Paris, InterEditions. HABERT B., NAZARENKO A. & SALEM A. (1997), Les linguistiques de corpus, coll. U, Paris, Armand Colin. IDE N. & VÉRONIS J. (1994), « MULTEXT: Multilingual Text Tools and Corpora », in Proceedings of the 14 th International Conference on Computational Linguistics (COLING’94), Kyoto (Japan), pp. 588-592. LE PESANT D. (1994), « Les compléments nominaux du verbe lire. Une illustration de la notion de “classe d’objets” », Langages 115, pp. 31-46. MELAMED I.D. (1999), « Bitext maps and alignments via pattern recognition », Computational Linguistics 25.1, pp. 107-130. PAULUSSEN H. (1995), « Compiling a trilingual parallel corpus », Contragram (Quarterly Newsletter of the Constrative Grammar Research Group of the University of Gent) 3, pp. 10-13. PAULUSSEN H. (1999), A Corpus-based Contrastive Analysis of English “on”/“up”, Dutch “op” and French “sur” within a Cognitive Framework, PhD, University of Gent. PLOUX S. & VICTORRI B. (1998), « Construction d’espaces sémantiques à l’aide de dictionnaires de synonymes », Traitement automatique des langues 39.1, pp. 161-182. ROBERTS R.P. & MONTGOMERY C. (1996), « The use of corpora in bilingual lexicography », in M. Gellerstam, J. Järborg, S.-G. Malmgren, K. Norén, L. Rogström & C. Röder Papmehl eds, Euralex’96 Proceedings, Göteborg, Göteborg University, pp. 457-464. VÉRONIS J. & KHOURI L. (1995), « Étiquetage grammatical multilingue : le projet MULTEXT », Traitement automatique des langues 36.1-2, pp. 233-248.