Electronic corpora for lexicographers: how we can optimise the

Transcription

Electronic corpora for lexicographers: how we can optimise the
Tdr1
Electronic corpora for lexicographers:
how we can optimise the output of KWIC lists consultations 1
Abstract
Since the past twenty years, lexicographers have been admitting the central role
of corpus explorations, having at their disposal though exclusively monolingual corpora.
However, the most recent research is starting to take into account the possibility of
using bilingual corpora. In my presentation, I would like to concentrate on two points:
– first of all, to evaluate how the availability of electronic corpora (principally aligned
ones) can contribute to the richness of monolingual and bilingual dictionaries,
– secondly, to explain how we can optimise the output of KWIC lists consultations (concordances) by adding automatic analyses and sorting procedures.
I will pay particular attention to syntactic and semantic automatic analyses of each
extraction (each line) of a KWIC list. I call “semantic analysis” a combination of synonymous information taken out from an electronic synonym dictionary and information extracted from the object-class system, developed for the nouns by the LLI, using
the verb sub-categorisation frameworks.
In the case of bilingual aligned corpora, syntactic and semantic analyses are facilitated by the presence of translations which help to disambiguate a complex analysis
since two languages (the source and the target) do not present the same ambiguities
in the same position. As a result, I will be able to compare analyses of each extraction
and sort them according to these similarities, thus presenting a lexicographer with an
ameliorated KWIC list which is more quickly and easily examined.
1.
Introduction
Lexicographers admit the central role of “corpus explorations” but, in the activity
of French professional lexicographers, corpora have taken a relatively small place. I
explain briefly this situation by four reasons.
– First of all, there is a particular difficulty to provide a relevant description script for
a unit search. Lexicographers work on every day language, which we can characterise
as the non specialised language enriched by a specialised one, observed in popular
1
A special thanks to Hans Paulussen, Jean Véronis, Pierre Corbin, Katia Paykin and Leland Tracy for their
suggestions and / or rereadings, their data, and all our discussions.
2001, communication au « Twelfth CLIN (Computational Linguistics In the Netherlands) Meeting »
(Twente (Pays-Bas), 30 novembre 2001).
[Rédigé en 2001 ; 20 878 caractères ; cf. C4]
988
Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia
scientific or technical work. Sequences extracted from this type of language are more
difficult to describe than terminological units, for example. We can efficiently research
sequences which fit Noun Noun or Adjective Noun patterns in scientific or technical
texts where they are represented frequently and where they appear in similar contexts, but it is more difficult to detect lexical units which do not have a regular form
or particular marking and which appear with a high degree of frequency.
– Secondly, because lexicographers work on every day language, they must have corpora
that reflect real usages. As we will see later on, representative corpora, like the British
National Corpus for example, comprise most efficient resources. For French, we have
Frantext, the INALF’s corpus, which integrates primary documents produced from the
thirteenth to the nineteenth century, with a large part of literary texts. Other resources
can be found on CD-ROM’s containing various press materials, but we do not have a
representative corpus for “standard” and contemporary French. Consequently, because
French lexicographers cannot use French language corpora, they also rarely use corpora for other languages.
– Thirdly, the qualitative inadaptability of computer programs used for extracting information from corpora presents a real problem. For example, while lexicographers of
Oxford University Press use a program which provides lists of collocations and simple
syntactic analyses for working with English corpora, lexicographers working with French
must be satisfied with KWIC lists. Therefore, we have humans who must read and detect pertinent information.
– Last, but not least, in France, editorial offices are old-fashioned and all of them have
produced big dictionaries that become references for new creations. Furthermore, the
relatively bad economic situation is not conducive to producing completely new and
expensive projects (like COBUILD did when integrating corpora research into their
lexicographical methods).
Taken into account the situation just described, if we want to introduce corpora research into the French offices of lexicography in a systematic way, we must find programs that will make lexicographers more efficient, tools that will allow them to work
more comfortably, with the same speed (or a better one) and with a higher richness of
information.
2.
What sorts of corpora are necessary for lexicographers?
Lexicographer’s corpora must reflect real usages. Therefore, ideally, corpora must
contain a very large compilation of linguistic samples that are produced in many different circumstances, in oral communication or by writing, with different registers of
expression, etc. and that are represented with qualitative and quantitative equilibrium.
As I mentioned earlier, we do not have representative corpora like that for “standard”
and contemporary French language. However, lexicographers do have access to other
different electronic resources. They use them when they work on specialised dictionaries, such as computer language dictionaries or œnological ones for example. They
just take popular publications or transcriptions of conferences, etc. and build smaller
and more specialized corpora than representative ones.
Tdr1 – Electronic corpora for lexicographers
989
Properties of these corpora are conditioned by the specialised domain explored and
the qualities of the given dictionary (for what sorts of audience?, what degree of precision used in lexical descriptions?, with what number of described items? etc.). Here, I
will consider that all lexicographers can have access to and compile specialized electronic corpora and extract some information from them, thus observing a large panel
of linguistic usages.
More precisely, I will consider that it is possible to select multilingual corpora: parallel or preferentially aligned ones. With aligned corpora, like the excerpt you see in figure
1, lexicographers can build and read parallel KWIC lists that make eventual asymmetric linguistic behavior 2 in the different languages represented in the corpora more apparent.
Figure 1. Bus in aligned computer language corpus
(sentences extracted from a personal computer language corpus)
As of 2003, all PCs and servers will be equipped
with the new data bus Arapaho.
A partir de 2003, tous les PC et serveurs seront
équipés du nouveau bus de données Arapahoe.
INTEL will boost Celeron with a 100 MHz bus.
Intel va doper le Celeron avec un bus à 100 MHz.
As of next year Celeron could reach the speed of
800 MHz. At the same time, the data bus of the
processor should increase from 66 to 100 MHz.
Dès l’année prochaine, la vitesse du Celeron
pourrait atteindre 800 MHz. Parallèlement, le
bus de données du processeur devrait passer de
66 à 100 MHz.
Graphic card Voodoo5 5500 available for PCI
bus.
Carte graphique Voodoo5 5500 disponible sur bus
PCI.
This new card should meet the expectations of
the users who do not have an AGP bus but who
still want to improve their 3D graphics, especially
for games.
Cette nouvelle carte devrait répondre aux attentes
des utilisateurs qui ne disposent pas de bus AGP
dans leur micro et qui souhaitent tout de même
améliorer leur affichage 3D, surtout quand il
s’agit de jouer.
How to distinguish between Pentium II and III?
Solution: by comparing their processor and bus
speeds.
Comment différencier les Pentium II et III ?
Solution : en comparant leur fréquence et la
vitesse de bus.
The M20 model benefits from a 133 MHz bus
system and an Ultra 160 SCSI control. It accepts
up to two Pentium III processors.
Le modèle M20 bénéficie d’un bus système à 133
MHz et d’un contrôleur Ultra 160 SCSI. Il accepte
jusqu’à deux processeurs Pentium III.
La carte DGE-500SXPCI fonctionne en Bus
The DGE-500SXPCI card functions in the Bus
Master (32 or 64 bits), i.e. without data forwarding Master (32 ou 64 bits), c’est-à-dire sans que les
by the processor, in integral duplex mode.
données transitent par le processeur, en mode
duplex intégral.
GeForce 2 GTS for PCI bus.
Une GeForce 2 GTS sur bus PCI.
This Compaq 1U server employs two Pentium III Ce serveur 1U de Compaq exploite deux Pentium
800 MHz processors with a 133 MHz bus system. III à 800 MHz sur bus système 133 MHz.
In today’s presentation, I will use three electronic corpora:
1) the personal computer language one, presented earlier;
2
Cf. Grundy (1996).
990
Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia
2) the Namur trilingual aligned corpus that regroups English, French and Dutch and
that was created by Hans Paulussen in 1996 3;
3) and Le Monde’97 CD-ROM, French press material.
3.
What sorts of pre-treatments?
Three operations can optimise the electronic extractions of information:
– the lemmatisation that associates to each word in the document its lemma (like figure 2),
– the morpho-syntactic tagging that matches each word with its part-of-speech tag (like
figure 2).
Figure 2. Two lemmatised and tagged sentences with bus
extracted in Namur trilingual aligned corpus
The
Hindu
Kush
!
the
Hindu
Kush
!
She
moans
if
she
has
to walk
to
the
bus
stop
.
it
moan
if
it
have
to walk
to
the
bus
stop
.
====début
DETD
NPMS
NPMS
PCTFORTE
====fin
====début
PPER3S
VINP3S
SUB
PPER3S
VINP3S
VINF
PREP
DETD
NCMS
NCMS
PCTFORTE
====fin
de phrase====
L’
HindouKoch
!
de phrase====
de phrase====
Elle
en
parle
comme si
c’
était
la
porte
à
côté
.
de phrase====
le
HindouKoch
!
DETDEF
NPMS
il
en
parler
comme si
ce
être
le
porte
à
côté
.
PPER3S
PREP
VINP3S
CONJ
PDS
VINF
DETDFS
NCFS
PREP
NCMS
PCTFORTE
PCTFORTE
– and the indexation that permits to find quickly all occurrences of one sequence and
to know its frequency.
Thus, with KWIC lists built in lemmatised, tagged and indexed corpus, a lexicographer can find each lexical unit and evaluate its exact place in the communication. However, since humans must read and identify all relevant information in real time, this
automatic pre-treatment is insufficient as it does not give pertinent information for an
efficient description of the syntactic and semantic behavior of these lexical units quickly enough. It only provides some lists of occurrences sorted according to the context
3
Cf. Paulussen (1999).
Tdr1 – Electronic corpora for lexicographers
991
on the right. To get a good representation of a unit’s syntactic and semantic behavior,
we must make a more complete linguistic analysis.
I propose an automatic procedure of analysis for sentences extracted from KWIC lists,
those that contain an observed unit. This analysis takes into account sentences where
all units are matched with part-of-speech tags and identifies relations between the observed unit and verbs, nouns or adjectives. For example, if we take this unit to be the
noun bus, we obtain the following information (figure 3) for each extracted sentence when
using a bilingual (French / English) extraction from the Namur trilingual aligned corpus and from the earlier-mentioned computer language one.
For the analysis, I lemmatise and tag sentences extracted from these corpora, but
for a better readability, I do not reproduce these tags on figure 3 and I write identified
relations with different colours.
Figure 3. Relations between the observed unit (the noun bus) and verbs, nouns or adjectives
(bilingual (French / English) extractions from the Namur trilingual aligned corpus
and from the earlier-mentioned computer language one.)
« Allez, il faut vraiment
qu’on y aille, cette fois »,
dis-je pour briser
l’enchantement. Mon car
part dans dix minutes.
→ partir (NHUM, N)
where I dutifully attended → to take (NHUM, N)
lectures before dutifully
taking the bus home to
my parents every evening.
quand je fréquentais les
noirs amphithéâtres de la
Sorbonne, prenant
sagement l’autobus «S»
chaque soir pour rentrer
dans ma famille.
→ prendre (NHUM, N)
I was getting off the bus → to get (NHUM, [N]off)
in Hampden with two
suitcases and fifty dollars
in my pocket.
je descendais du car à
→ descendre (NHUM,
Hampden avec deux valises [N]de)
et cinquante dollars en
poche.
‘This time we really do
have to be off,’ I said, to
break the spell. ‘The bus
leaves in ten minutes.’
A person was “caught by
the clock”, or there was a
last-minute hitch, or
missed the train or the
bus. The car wouldn’t
start.
→ to leave (NHUM, N)
→ to miss (NHUM, N)
On a été pris par le temps, → rater (NHUM, N)
on a eu un empêchement de
dernière minute, on a raté
le train ou l’autobus, la
voiture n’a pas “voulu”
démarrer,
→ prendre (NHUM, N)
→ to go (NHUM, NLIEU,
My father took Bert to
[N]on)
visit Queenie, so I went
to Sainsbury’s on the bus.
Mon père a emmené Bert
voir Queenie à l’hôpital.
J’ai donc pris le bus pour
aller chez Sainsbury.
The DGE-500SXPCI card → to function
(NCOMPUTER-COMPONENT,
functions in the Bus
Master (32 or 64 bits), i.e. [N]in)
[…]
La carte DGE-500SXPCI
→ fonctionner
fonctionne en Bus Master (NCOMPOSANT-INFORMATIQUE,
(32 ou 64 bits), c’est-à-dire [N]en)
[…]
→ to be equipped
As of 2003, all PCs and
servers will be equipped (NCOMPUTER-COMPONENT,
with the new data bus
[N]with)
Arapaho.
A partir de 2003, tous les
PC et serveurs seront
équipés du nouveau bus
de données Arapahoe.
→ être équipé
(NCOMPOSANT-INFORMATIQUE,
[N]de)
992
Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia
An automatic analysis can do better than this one. It can reassemble similar occurrences: all sentences where the target unit is subject of the verb “passer (bus, [NLIEU]par)”
or object of the verb “monter (NHUM, [bus]dans)” or it appears in a prepositional phrase
that modifies a noun “(desserte + liaison) [bus]par)” as we can see here for the French
sentences extracted from Le Monde’97 CD-ROM (figure 4).
4.
Introducing synonymous information
It looks easy to do that, and I argue here that it is possible to do that even better by
introducing synonymous information taken out from an electronic synonym dictionaries (the ELSAP’s one 4 for the French language, for example). The preceding groupings can be combined if the program knows that several verbs that take a given unit
as their object or subject are synonyms in the constructions such as “(prendre + emprunter + (filer + aller)) [bus]en”.
Figure 4. Relations between the noun bus and verbs or nouns, reassembled by similar syntaxic
constructions and synonymous contexts (French sentences extracted from Le Monde’97 CD-ROM)
même s’il leur faut prendre un bus à la porte Maillot, […], pour
se rendre
→ prendre (NHUM, N)
Seuls les clients de ces derniers peuvent emprunter le bus en
se procurant un ticket
→ emprunter (NHUM, N)
Les étudiantes partaient en bus vers Amsterdam pour se faire → partir (NHUM, (N)en, [NLIEU]vers)
avorter
ou bien il filait en bus jusqu’à la gare et descendait à Nanterrela-Folie
→ filer (NHUM, [N]en)
“Sais-tu que maintenant tu peux rentrer chez toi en bus ?”
→ rentrer (NHUM, [N]en)
“[Il] préférait se rendre sur place en train, en bus ou en taxi,
le trophée dans
→ se rendre (NHUM, [N]en)
les dizaines de milliers de personnes qui, venant en bus, par le → venir (NHUM, [N]en)
train ou en voiture,
4
Les autres achetaient une carte Greyhound pour sillonner les
États-Unis en bus, ↵
→ sillonner (NHUM, NLIEU, [N]en)
allaient en Grèce ou en Turquie en car.
→ aller (NHUM, (NLIEU)en, [N]en)
10 000 kilomètres parcourus à pied, en train et en bus”,
explique Damien Seguy
→ parcourir (NHUM, NDISTANCE, [N]en)
le troisième âge […] visite en bus et en paquebots les vues
imprenables d’Europe
→ visiter (NHUM, N, [N]en)
une dizaine de jeunes montent dans le bus à l’arrêt du cours
de Verdun
→ monter (NHUM, [N]dans)
nous devons ensuite nous trimbaler pendant parfois 80 km en
bus ou en voiture
→ trimbaler (NHUM, NHUM, [N]en)
Cf. http://elsap.unicaen.fr/dicosyn.html.
Tdr1 – Electronic corpora for lexicographers
993
Ils seront transportés en bus - dont les rideaux seront fermés → transporter (N, NHUM, [N]en)
- qui suivront
Tous les habitants furent emmenés en bus, 1 500 véhicules
furent mobilisés.
→ emmener (N, NHUM, [N]en)
les cinq lignes de bus qui passent par le Mirail ne fonctionnent → passer (N, [NLIEU]par)
plus.
des conférences de presse matinales quotidiennes et des
promenades en bus
→ promenade [N]en
s’inscrire pour des excursions en bus, à bicyclette, sur l’eau ou → excursion [N]en
en hélicoptère
Pendant les longs voyages en bus lors des tournées théâtrales, → voyage [N]en
il n’hésitait pas
L’amélioration […] ainsi que des dessertes locales par bus est → desserte [N]par
aussi une priorité
entre l’ancienne et la nouvelle gare : barreau ferroviaire ou
liaison par bus ?
→ liaison [N]par
However, while groupings give us important indications, the observation of nongrouped sentences can help to point out homonymic or polysemous units. In this case,
using synonymous information about the observed unit itself can help the lexicographer to identify the semantic variations that appear in different constructions.
5.
Using information extracted from the object-class system
Other resources that I use for the syntactic and semantic analyses are data extracted
from the object-class system, developed for the nouns by Gaston Gross (and members
of the LLI 5), using the verb sub-categorisation frameworks.
We have seen before that for the noun bus I have identified groups of synonymous
verbs that take it as their argument. With a sub-set of these verbs, the sense of bus
seems to refer to a mean of transport. More precisely, the noun bus is a member of
the <moyens de transport terrestre – à moteur – en commun> object-class.
The Gaston Gross’s studies 6 about nouns of means of transport give us the list of
all possible verbs, the operator verbs that delimit the set of possible uses for each unit
of the object class.
This listing, made with the help of linguistic tests, is meant to contain all possible
operator verbs for each object class. This rich information should be able to take the
place of the synonymous information used before for building groups of sentences.
However, considering that these lists were not compiled for all French nouns, we can
just use them to complete the synonymous ones.
5
6
“Laboratoire Linguistique Informatique”, Paris 13 University.
These studies are related in Gross (1994).
994
6.
Des usages en corpus aux descriptions dictionnairiques : HDR – N. Gasiglia
Conclusion
Thus, I hope to have shown that in order to maximise the efficiency of lexicographers’ work, we need to provide them with data sorted out according to semantic information resulting from synonymous relations and operator verb lists for each object
class. The necessary data are extracted from bilingual corpora and are subject to syntactic analyses, which point out the relationship between a given linguistic unit and
different grammatical categories.
References
ATKINS B.T.S. (1990), « Corpus lexicography: the bilingual dimension », in L. Cignoni & C.
Peters eds, Computational Lexicology and Lexicography. Special issue dedicated to Bernard
Quemada, Linguistica computazionale VI, Pisa, Giardini Editori i Stampatori, pp. 43-64.
BLANK I. (1995), « Sentence alignment: methods and implementation », Traitement automatique des langues 36.1-2, pp. 81-99.
COBUILD = Collins COBUILD English Language Dictionary, London and Glasgow, Collins,
1987.
GROSS G. (1994), « Classes d’objets et description des verbes », Langages 115, pp. 15-30.
GRUNDY V. (1996), « L’utilisation d’un corpus dans la rédaction du dictionnaire bilingue »,
in H. Béjoint & P. Thoiron dir., Les dictionnaires bilingues, Aupelf-Uref / Louvain-laNeuve, Duculot, pp. 127-149.
HABERT B., FABRE C. & ISSAC F. (1998), De l’écrit au numérique. Constituer, normaliser et
exploiter les corpus électroniques, Paris, InterEditions.
HABERT B., NAZARENKO A. & SALEM A. (1997), Les linguistiques de corpus, coll. U, Paris,
Armand Colin.
IDE N. & VÉRONIS J. (1994), « MULTEXT: Multilingual Text Tools and Corpora », in Proceedings of the 14 th International Conference on Computational Linguistics (COLING’94),
Kyoto (Japan), pp. 588-592.
LE PESANT D. (1994), « Les compléments nominaux du verbe lire. Une illustration de la notion de “classe d’objets” », Langages 115, pp. 31-46.
MELAMED I.D. (1999), « Bitext maps and alignments via pattern recognition », Computational Linguistics 25.1, pp. 107-130.
PAULUSSEN H. (1995), « Compiling a trilingual parallel corpus », Contragram (Quarterly Newsletter of the Constrative Grammar Research Group of the University of Gent) 3, pp. 10-13.
PAULUSSEN H. (1999), A Corpus-based Contrastive Analysis of English “on”/“up”, Dutch
“op” and French “sur” within a Cognitive Framework, PhD, University of Gent.
PLOUX S. & VICTORRI B. (1998), « Construction d’espaces sémantiques à l’aide de dictionnaires de synonymes », Traitement automatique des langues 39.1, pp. 161-182.
ROBERTS R.P. & MONTGOMERY C. (1996), « The use of corpora in bilingual lexicography », in
M. Gellerstam, J. Järborg, S.-G. Malmgren, K. Norén, L. Rogström & C. Röder Papmehl
eds, Euralex’96 Proceedings, Göteborg, Göteborg University, pp. 457-464.
VÉRONIS J. & KHOURI L. (1995), « Étiquetage grammatical multilingue : le projet MULTEXT », Traitement automatique des langues 36.1-2, pp. 233-248.