ATOLL - Software Tools for Natural Language Processing

Transcription

ATOLL - Software Tools for Natural Language Processing
ATOLL
Software Tools for Natural Language Processing
Éric de la Clergerie
[email protected]
http://atoll.inria.fr
Evaluation Seminar
SYM C : Management and processing of language and data
Dourdan, November 15-16th 2005
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
1 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
2 / 39
ATelier d’Outils Logiciels pour le Langage naturel
Creation : 1997 – Computer Science
NLP
ATOLL objectives :
to develop tools and techniques, theoretical or applied, in order to
help to access, process and use documents in natural language.
INRIA scientific challenges :
To design new applications using the Web and multimedia data
bases
Keywords : Computational Linguistics ; Natural Language Processing (NLP) ;
Linguistic Engineering ; Parsing ; Syntactic Formalisms ; Linguistic resources ;
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
3 / 39
ATOLL’s composition
2002
Scientific leader
2003
2004
2005
Éric de la Clergerie (CR)
Permanents
Bernard Lang (DR)
Pierre Boullier (DR)
Philippe Deschamp (CR)
François Thomasset (DR)
Exteriors & Temporaries
Areski Nait Abdallah (Pr, Univ. Brest)
Alexis Nasr (Prof., Del. Paris 7)
François Barthélemy (MdC, CNAM)
Lionel Clément (PostDoc RLT, Ing. RNIL)
Guillaume Rousse (Ing. Biotim)
Stéphane Laurière (Ing. e-COTS)
PhD
Benoît Sagot
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
5 / 39
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)
Grammar : packed sets of relationships + combinaison rules
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
6 / 39
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)
Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
subst
⇒
NP
S
John
NP ↓ VP
V
P
NP
VP
John V
sleeps
sleeps
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
6 / 39
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)
Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
adj
subst
⇒
NP
S
John
NP ↓ VP
V
V
P
NP
VP
John V
sleeps
⇒
S
NP
⋆V Adv
VP
John V
a lot
V
sleeps
Adv
sleeps a lot
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
6 / 39
Parsing ?
Parsing : Identifying the relationships between words (and groups of words)
Grammar : packed sets of relationships + combinaison rules
Tree Adjoining Grammars [TAG]
adj
subst
⇒
NP
S
John
NP ↓ VP
V
V
P
NP
VP
John V
sleeps
⇒
S
NP
⋆V Adv
VP
John V
a lot
V
sleeps
Adv
sleeps a lot
Problems :
No consensus on the best linguistic formalism
Capturing all syntactic constructions
Capturing word usages (lexicon & statistics)
Handling amibiguities
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
INRIA
6 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
7 / 39
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata
& Tabulation
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
ATOLL : Thematics
Formal Language Theory
Formalisms, Automata
& Tabulation
(Open) Tools
Parser Compilers : DYAL OG, S YNTAX, RCG
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
ATOLL : Thematics
Formal Language Theory
(Open) Ling. Resources
Formalisms, Automata
& Tabulation
Lexicon : L EFFF
Grammar : S X LFG
MetaGrammar : FRMG
(Open) Tools
Parser Compilers : DYAL OG, S YNTAX, RCG
Ling. Dev. Environment : M GCOMP, MGTOOLS,
TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . .
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
INRIA
ATOLL : Thematics
Applications
Evaluation : EASy
Info. Extraction : Biotim
Corpora
Grid techniques
Formal Language Theory
(Open) Ling. Resources
Formalisms, Automata
& Tabulation
Lexicon : L EFFF
Grammar : S X LFG
MetaGrammar : FRMG
(Open) Tools
Parser Compilers : DYAL OG, S YNTAX, RCG
Pre Parsing : S X P IPE
Ling. Dev. Environment : M GCOMP, MGTOOLS,
TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . .
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
INRIA
ATOLL : Thematics
Applications
Evaluation : EASy
Info. Extraction : Biotim
Ling. Knowledge Acquisition
Corpora
Grid techniques
Formal Language Theory
(Open) Ling. Resources
Formalisms, Automata
& Tabulation
Lexicon : L EFFF
Grammar : S X LFG
MetaGrammar : FRMG
(Open) Tools
Parser Compilers : DYAL OG, S YNTAX, RCG
Pre Parsing : S X P IPE
Ling. Dev. Environment : M GCOMP, MGTOOLS,
TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . .
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
INRIA
ATOLL : Thematics
Applications
Normalization
Evaluation : EASy
Info. Extraction : Biotim
Ling. Knowledge Acquisition
Normalangue
ISO TC37 SC4
Corpora
Grid techniques
Formal Language Theory
(Open) Ling. Resources
Formalisms, Automata
& Tabulation
Lexicon : L EFFF
Grammar : S X LFG
MetaGrammar : FRMG
(Open) Tools
Parser Compilers : DYAL OG, S YNTAX, RCG
Pre Parsing : S X P IPE
Ling. Dev. Environment : M GCOMP, MGTOOLS,
TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . .
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
8 / 39
INRIA
ATOLL : Thematics
Applications
Normalization
Evaluation : EASy
Info. Extraction : Biotim
Ling. Knowledge Acquisition
Normalangue
ISO TC37 SC4
Corpora
Grid techniques
Formal Language Theory
(Open) Ling. Resources
Formalisms, Automata
& Tabulation
Lexicon : L EFFF
Grammar : S X LFG
MetaGrammar : FRMG
(Open) Tools
Free Software
Bernard Lang
INRIA
É. de la Clergerie
Parser Compilers : DYAL OG, S YNTAX, RCG
Pre Parsing : S X P IPE
Ling. Dev. Environment : M GCOMP, MGTOOLS,
TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . .
ATOLL
SymC 2005/11/15
8 / 39
INRIA
ATOLL’s positionning
Balance between theory, development and experimentation
NLP requires many tools and resources
with difficulties to access and exploit linguistic resources (for French)
◮
◮
◮
⇒ dev. effort + investigation of methodologies to speed up dev.
favor reuse and distribution ⇒ normalization & open source
Software Eng. practices : Versioning + Packaging + Catalog on line
favor emerging of resources ⇒ LexSynt action, collaborations
Search for comprehension of mechanisms of language
but also see language as cultural artifact :
◮
◮
collaboration with linguists & use of linguistic theories (and formalisms)
exploitation of corpora to capture language usage
NLP is an experimental field
◮
◮
◮
INRIA
need to play at real scale
large coverage grammars, large lexica, real documents, large corpora
need feedback : evaluation, statistics
real scale applications
É. de la Clergerie
ATOLL
SymC 2005/11/15
9 / 39
INRIA
Syntactic formalisms
Exploration of a wide range of syntactic formalisms
RCG [Boullier] Meta-RCG [Sagot]
Derivation
complexity
combining
structures
LCFRS
MCS
TAG
LIG
N
Feature TAG
A ↓ ⋆N
Datalog
V(sing)
CFG
NP
DCG LFG HPSG
S(gap(np))
λ-Prolog
Vocabulary Complexity
unification
Meta-Grammars : Abstract level of grammar description based on hierarchies of
classes grouping constraints and requiring/providing functionalities.
⇒ [MG compilation] Generation of target grammars (TAGs, LFGs)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
10 / 39
Linguistic Resources : (Meta-)Grammars
No easily available wide coverage French grammars
⇒ development of grammars
SXLFG : wide coverage French LFG grammar [Clément, Sagot, Boullier]
very efficient exploitation with S YNTAX
FRMG : wide coverage French Meta-Grammar [Clergerie]
MG + factorization operators ⇒ generates a very compact TAG grammar
126 trees with only 27 verb-anchored trees (to compare to usual 2-6Ktrees)
Dev. of env. of development for grammars
Edition
Visualization
Statistics (coverage, time, ambiguity) on test suites (EUROTRA & TSNLP)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
11 / 39
Linguistic Resources : Lexicon
Development of French lexicon L EFFF [Clément,Sagot]
Over 400 000 forms and following lemma distribution :
verbs common nouns proper nouns
adj
6788
37183
52938
10024
adv
2127
Verb morphology automatically learned on corpus (+ manual checking)
Syntactic information on verbs (subcategorization, control, . . . )
promet (promises)
v [pred=’promettre_1<subj|ssubj|vsubj,(obj|scomp),(à-obj)>’,cat=v,@SCompInd,@P3s]
v [...]
Multiple Inheritance-based architecture (∼ MG)
@promettre {
< @verbe_ditransitif_à_svp
< objet_phrastique_possible
< complétive_indicatif
| ... }
still incomplete and with errors,
but using corpus parsing to track errors (error mining)
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
12 / 39
INRIA
Parsing techniques
Parsing remains an algorithmic challenge :
Ambiguity handling & representation (Shared forests)
Push-Down Automata ; Dynamic Programming techniques
Formalisms
CFG
DCG / Logic Programming
TAG / LIG
MC-TAG/osRCG/MCS
Automata
PDA
Logical PDA
2-Stack Automata
Thread Automata
Tabulation
O(n3 )
completeness
O(n6 )
O(nk )
Notes
Lang
Lang & Clergerie
Clergerie & Pardo
Clergerie
Guiding techniques (e.g., supertagging [Boullier] & chunking [Sagot])
multi-pass parsing where the shared forest of a level guides the next level
Algorithms on shared forests (e.g., disambiguation)
Robustness (e.g., partial parsing, error correction techniques)
handling “ill formed” sentences, unknown words and constructions
Scaling issues (wide coverage grammars, large lexica)
grammar factorization (FRMG) & many algorithmic issues
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
13 / 39
INRIA
DyALog - Exploring unification-based grammars
An environment (compiler dyacc + abstract machine) :
For compiling tabular parsers
based on : stack automata & dynamic programming
⇒ computation sharing & loop detection
⇒ extraction of shared forests
Also a logic programming environment
⇒ power of logic + possibility of escaping within grammars
Strongly NLP oriented
◮
◮
◮
Ease grammar design : TFS, finite domains, . . .
Multiple grammatical formalisms : DCG, BMG, TAG & TIG, RCG
Functionalities and customization of parsers :
multiple parsing strategies, forests, word lattice, lexicalization, robustness
Used for
◮
◮
◮
◮
INRIA
a robust Potuguese grammar (bidir. head-driven DCG+BMG) [GLINT]
⇒ dev. of a similar Spanish Grammar [COLE]
MG compiler MGCOMP
French wide coverage TIG/TAG grammar FRMG [ATOLL]
French TAG grammar with semantic interface [LORIA]
É. de la Clergerie
ATOLL
SymC 2005/11/15
14 / 39
INRIA
Syntax - From CFGs to LFGs and RCGs
SYNTAX [Boullier]
Originally developed to parse Programming Languages with CFG
(+attributes)
Extended with tabular techniques to cover all CFGs
Recently extended for Lexical Functional Grammars
2 passes : CFG + computation of decorations on shared parse forests
+ disambiguation phase
⇒ used for large coverage LFG grammar SXLFG [Clément,Sagot,Boullier]
Extremely efficient on CFGs and on LFGs
+ 300Ksent. journalistic corpus (Monde Diplomatique) :
3
4
sent. < 0.1s
RCG [Boullier]
Derived from technology behind SYNTAX
Very powerful and also efficient
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
15 / 39
Linguistic infrastructure : SxPipe
French Morpho-Syntactic chain SXPIPE [Sagot, Boullier, . . . ] :
Word and sentence segmentations, including multi-words
Named entities (Proper Nouns, Dates, Addresses, URL, :-), . . . )
Spelling corrections
Returns a word lattice (DAG) as input for parsing
Jean
0
en outre {au} à
{abite} habite
1
{Jean} jean
2
3
4
{au} le
5
{1 , rue de la Pompe} _ADRESSE
7
6
{en outre} en_outre
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
16 / 39
Normalization
Motivation :
Reusability of resources
Interoperability of tools in Processing Chains
Participation (with Laurent Romary and Langue & Dialogue [LORIA]) at
INRIA consortium SYNTAX
French Normalangue/RNIL action
ISO TC 37 SC4 on “Linguistic Resource Management”
⇒ head of French delegation to ISO meetings
Partially involved in new European project LIRICS
Participation on emerging standards :
MAF Morphosyntactic Annotation Framework (Project leader)
FSR Feature Structure Representation (+ future compagnion FSD)
DCR Data Category Registry (registering ling. terminology)
LMF Lexical Markup Framework (lexicons)
SynAF Syntactic Annotation Framework
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
17 / 39
INRIA
Environment of development
Principles :
Coordinating small tools
LINGPIPE, TAG _ UTILS, FOREST _ UTILS, PARSERD , . . .
Use of XML intermediary formats
multiple views (XSL)
Grammars (TAGML), MetaGrammars, Morpho-Syntax (MAF),
Shared Forests (Derivations or Dependencies)
Visualization tools (web services)
◮
◮
◮
MAF (MAFD) : http://atoll.inria.fr/mafdemo
Parsing (PARSERD) : http://atoll.inria.fr/parserdemo
Grammar (FRMG) : http://atoll.inria.fr/perl/frmg/tree.pl
il a voulu en promettre une à Paul.
He has wanted to promise one of them to Paul
une
il
à
object (31)
subject (31)
pro:cln: (0)
a
pro:pro:50 (2)
N2 (1)
à:prep:41 (1)
N2 (1)
voulu
a:aux:54 (1)
Infl (1)
en
vmod (18)
clg (16)
pro:clg: (0)
N2 (1)
promettre
V (31)
vouloir:v:63 (1)
promettre:v:107 (31)
_:VMod:92 (1)
PP (1)
Paul
Paul:np:51 (1)
à:prep:2 (1)
preparg (6)
cll (15)
.
S (31)
pro:cll: (0)
_:S:33 (1)
void (1)
.:_: (0)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
18 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
19 / 39
French Parsing Evaluation Campaign EASy
Evaluation protocol : complex
(formalisms, deep vs shallow, ambiguous or not, . . . )
⇒ evaluation on shallow constituents and small set of dependencies
[Dec. 2004] Participation of FRMG & SXLFG (out of 14 parsers)
∼ 40Ksentences, with ∼ 4200 manually annotated, several styles
Use of our deep parsers to produce non-ambiguous shallow information
⇒ robustness (full and partial parsing) ⇒ disambiguation heuristics +
conversion
EASy campaign tests not just parsing but a full processing chain
L EFFF, S X P IPE + parser + post-parsing
GN 1
Jean
F1
subject
GN1
NV 2
abite
F2
verb
NV2
en
F3
GR 3
outre
F4
compl.
GP4
au
F5
1
F6
verb
NV2
,
F7
GP 4
rue de
F8 F9
modifier
GR3
É. de la Clergerie
ATOLL
SymC 2005/11/15
Pompe
F11
verb
NV2
Note : Still waiting for definitive and complete results
INRIA
la
F10
20 / 39
INRIA
EASy as a starting point
Using EASy expertise and resources to continue evaluations, get feedback,
compare FRMG & SXLFG (and acquire statistics)
Corpus distribution
84.0
general
#sentences distribution
67.4
68.1
%precision
81.9
litteraire
%recall
%fmeasure
76.5
76.0
73.7
mail
63.5
kind of corpus
64.0
questions (5%)
203
80.5
medical
70.9
general (18%)
755
oral_elda (12%)
502
71.1
72.2
oral_delic
63.4
62.7
oral_delic (12%)
522
77.0
oral_elda
70.8
71.5
litteraire (21%)
881
86.1
questions
83.5
83.4
medical (13%)
554
77.5
total
68.7
68.7
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
mail (20%)
852
75
80
85
90
95
100
%
⇒ NEW already used results for a very accurate CFG-based chunker [Sagot]INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
21 / 39
BIOTIM : from books to knowledge bases
ACI “Datawarehouse” BIOTIM : Processing botanical descriptions (flora)
Corpus “Flore du Cameroun” (1963 – 2001)
Volumes
31
Pages
9466
Av. Pages
305
Words
1.5M
Taxons
∼ 2400
Tasks :
Corpus preparation : spelling correction (OCR), logical structuring
Preliminary Linguistic Processing : morpho-syntactic processing
Terminology extraction & first experiments with “governor-governee”
relationship
“Ontology” extraction : use of parsing to extract syntactic dependencies
+ Harris hypothesis : similar syntactic contexts hints semantic similarities
⇒ lancéolé (adj) : leaf shape
Text Mining : getting the properties of each taxon
parsing + disambiguation through ontology
knowledge bases : Description Logics (RDF / OWL)
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
22 / 39
INRIA
Linguistic Resource Acquisition
Motivation : Bootstrap
Using resources in tools (parsers, taggers, . . . ) ⇒ Validation
error mining techniques (Van Noord, Clergerie)
Using tools to get resources from (raw) corpora
◮
◮
◮
◮
◮
Learning morphology and lemma (done for French & Slovak [Sagot])
Learning syntactic information (sub-categorization, support verbs . . . )
Learning semantic classes and selection restrictions
Learning probabilities (for desambiguation)
Generic idea : learning from contexts coming from dependencies
Reducing human cost ⇒ Free Linguistic Resources
Adapting resources to needs (evolution)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
23 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
24 / 39
Actions : ARC RLT (2001 – 2002)
ARC “Linguistic resource acquisition and representation for TAGs”
Participants ATOLL (coordinator), Langue et Dialogue,
Calligramme, TALaNa (Univ. Paris 7)
Objectives
Semi-Automatic acquisition of a French TAG lexicon
XML representation for TAGs
Emerging of notion of Meta-Grammars
Corpus
Annotated
corpus
Pre Parsing
Meta
grammar
Lexicon
compilation
Supervised
validation
Preferences
Grammar
Acquisition
Parser
compilation
Tree bank
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
25 / 39
Actions (cont’d)
Biotim ACI “Masse de Donnée” (2003 – 2006)
◮
◮
Participants : IRD, LIFO (Orléans), IMEDIA, Vertigo (CNAM), INRA
Objectives (ATOLL) : text mining of botanical corpora
EASy/EVALDA French Technolangue action for the evaluation of parsing
systems (2003 – 2005)
◮
Participants : ELDA, LIMSI, LLF, ATILF, DIAM-STIM, DELIC, GREYC, L&D, LPL,
Synapse Développement, Systal-Pertimm, Xerox, LIC2M, LATL, EPFL, FT R&D,
Tagmatica, VALORIA, ERSS
◮
Objectives (ATOLL) : participation with
FRMG
&
SXLFG
RNIL/Normalangue Technolangue action for the normalization of
linguistic resources (2003 – 2005)
◮
Participants : L&D, AFNOR, ATILF, LLF Jussieu, IRIN, LIMSI, CLIPS, RESO, CEA,
XRCE, EDF R&D, SYSTRAN, France Telecom R&D, Systems & Defense Electronics,
SOFTISSIMO, SINEQUA, LUCID-ID, J-WAY
◮
Objectives (ATOLL) : project leader on MAF + head of French delegation for
ISO TC37 SC4
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
26 / 39
Actions (cont’d)
LexSynt ILF-funded action (2005 – ? ? ?)
◮
◮
Coordination : Sylvain Kahane (Modyco), Susanne Salmon-Alt (ATILF), Éric de
la Clergerie (Projet ATOLL, INRIA)
Participants : ATILF, ERSS, IGM, LPL, Lattice, MoDyCo, ATOLL, Calligramme, L& D,
Signes, ATV (K.U. Leuven), OLST (Univ. of Montreal), Normalangue/RNIL
◮
objectives : to design & exploit a reference syntactic-semantic lexicon for
French.
GENI ARC on Generation and Inference (2002 - 2003)
◮
◮
Participants : L&D (coord.), Orpailleur, TALaNa, IRIT
Objectives (ATOLL) : TAG, generation & tabulation, lexical semantic
e-COTS : RNTL action (2001 – 2002, extended 2003) Bernard Lang
◮
◮
Participants : Thomson-CSF, EDF and Bull
Objectives : setup an open and cooperative WEB portal to manage information
about software components.
MOPROSCO ARC (2005 – 2006) Participation Areski Nait-Abdallah
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
27 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
28 / 39
INRIA Collaborations
Langue et Dialogue (LORIA) : TAG ; Meta-Grammars ; Linguistic
Infrastructure ; Normalization ; Lexica
ARC RLT and GENI ; Normalangue ; EASy ; LexSynt
Calligramme (LORIA) : MG ; ARC RLT ; LexSynt
Signes (Futurs, Bordeaux) : LexSynt
increased collaboration with the arrival at Bordeaux of Lionel Clément
IMEDIA (INRIA Rocquencourt) : Biotim
Orpailleur (LORIA) : text mining & knowledge extraction (ARC GENI)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
29 / 39
French Collaborations
Lattice/TALaNa (University Paris 7) :
◮
◮
◮
◮
TAG, MG (RLT & GENI) lexica (LexSynt), . . .
people (Lionel Clément & Alexis Nasr)
co-supervising of Sagot’s PhD with Laurence Danlos.
discussions towards a common structure
MoDyCo (University Paris 10 - Nanterre & CNRS) with Sylvain Kahane,
linguistic formalisms ; LexSynt
LIFO (Orléans) : NLP and Machine learning ; Biotim
BIODIVAL (IRD, Orléans) : Biotim
+ Potential collaborations with
◮
◮
◮
◮
INRIA
IGM (Marne-La-Vallée) [LADL tables, LexSynt],
ERSS (Toulouse) [Acquisition on corpus],
LIS (Paris 6) [Knowledge rep. and use ; Biotim+],
...
É. de la Clergerie
ATOLL
SymC 2005/11/15
30 / 39
INRIA
International Collaborations
COLE (La Coruña & Vigo, Spain) – Manuel Vilares Ferro
◮
◮
◮
TAGs ; Parsing techniques ; grammars ; use of DyALog ; information retrieval
and extraction applications.
French-Spanish “Programme d’Actions Intégrées” [PAI] PICASSO
⇒ visits (several-months student visits)
⇒ organization of 2 TAPD editions (Paris 1998, Vigo 2000)
Co-supervised PhD of Miguel Alonso Pardo (on 2SA and TAGs)
⇒ many common papers
GLINT* (New Univ. of Lisbon) – Gabriel Pereira Lopes
◮
◮
Use of DyALog and on machine learning techniques for NLP
French-Portuguese programs RELING, ICTII et PAI PESSOA ⇒ visits
XTAG (Univ. of Pennsylvanie) – Aravind Joshi
◮
◮
TAG parsing and Meta-Grammars ; PhD student A. Kinyon
Potential NSF–INRIA collaboration
RIADI (Univ. of La Manouba, Tunisie) – Mohamed Ben Ahmed
Preliminary contacts to set up a cooperation on the processing of Arabic
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
31 / 39
Visibility
(New) Invitation to deliver a course at the ESSLLI’06 summer school ;
invitation of Bernard Lang to many events.
Editorial board of French journal “Traitement Automatique des Langues”.
Guest Editor for the T.A.L. issue on “Evolutions in Parsing” (2003)
Program Committees of 11 national and int. conferences and workshops
9 national and int. PhD Juries, including 3 reviews
Standardization committee of ISO TC37 SC4 (head of French delegation)
Consulting and project reviews for actions Technolangue and ACI ;
Bernard Lang consulted by companies, administrations, and government.
Paper reviews for journals (T.A.L, JoLLI, TCS) and conferences/workshops
(TALN, EACL, IWPT, ACL, ICLP, PPDP, COLING, ESSLII, IJCNLP, ICALP,
POPL, Formal Grammars, MOL, ESSLLI, ICLP)
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
33 / 39
Publications
ATOLL’s publications are available on line at
http://atoll.inria.fr/biblio
Journals : T.A.L., TCS, Document numérique
Conferences & Workshops : ACL, NACL, COLING, TALN, IWPT, TAG+, FG,
MOL, LACL, LREC, . . .
02
03
04
PhD Thesis [Sagot]
H.D.R [Clergerie]
Journal
Conference proceedings
Book chapter
Book (edited)
Technical report
Total
05
1TBF
1TBF
1BL
4
2+1BL
6
1BL
1
5
11
1BL
6
2
2+3S+4BL
11+4P+2S+2BL
1
1
10
1
29
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
34 / 39
Outline
1
Generalities
2
Thematics & Contributions
3
Applications
4
Actions
5
Collaborations
6
Conclusions
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
35 / 39
About the 4 last years (2002 – 2005)
Difficulties in recruitment (no CR since 1994)
but nevertheless, ATOLL has welcomed (temporary) brilliant members :
Lionel Clément, Benoît Sagot, Guillaume Rousse
Fulfilled most of the planned objectives
syntactic formalisms ; parsing techniques ; robustness ; lexica ; linguistic
infrastructure ; evaluation ; applications
Get involved in unexpected but natural actions : EASy and Normalangue
⇒ (EASy) fast development pace for a real scale processing chain
Large opening of our thematics
morpho-syntax, lexica, corpora, grammar design, . . .
now working at a much larger scale.
wide coverage grammars, large lexicon, processing of large corpora
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
36 / 39
Scientific evolutions (2006 – 2009)
Better syntactic formalisms and syntactic descriptions
MGs, MC-TAGs, Meta-RCGs, dependencies, constraints
⇒ ARC proposal MOSAÏQUE (coordinator Signes)
Increased parsing robustness and efficiency
◮
◮
◮
◮
◮
acquisition and exploitation of stochastic information (Nasr)
algorithmic of n-best beam disambiguators on shared derivation forests
guiding techniques & cascade-based parsing
error correction techniques
integration of emerging techniques (e.g., HPSG’s quick check filtering)
Guiding + Tabular techniques + Stochastic methods + Evaluation
⇒ [Ambition] marry deep and shallow parsing, ambiguous or not,
to get efficient and accurate parsing systems
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
37 / 39
Scientific evolutions (cont’d)
Linguistic knowledge acquisition & lexica
◮
◮
◮
use of EASy references and Paris 7 treebank
acquisition on raw corpora
acquisition and exploitation of lexical semantic info.
Information extraction applications, (Biotim followups)
possibly with question/answering and (multilingual) generation
Opening to multilinguism (Arabic ?)
application of our tools and acquisition methodologies for new languages
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
38 / 39
Organizational aspects
Real need to renew the composition of ATOLL
planned departures ⇒ impossible to continue without new member(s)
◮
◮
◮
◮
? maintenance & development of tools & resources
? conduct large scale experimentations & exploit results
? covering of enough computational linguistic sub-fields
? following collaborations and actions on ATOLL’s side
Discussions with TALaNa towards a common structure
Reinforcing collaborations through INRIA Action d’Envergure for NLP ?
◮
◮
◮
exchanging resources, tools & expertise ;
sharing dev. effort ;
better covering of NLP sub-fields
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
39 / 39
Organizational aspects
Real need to renew the composition of ATOLL
planned departures ⇒ impossible to continue without new member(s)
◮
◮
◮
◮
? maintenance & development of tools & resources
? conduct large scale experimentations & exploit results
? covering of enough computational linguistic sub-fields
? following collaborations and actions on ATOLL’s side
Discussions with TALaNa towards a common structure
Reinforcing collaborations through INRIA Action d’Envergure for NLP ?
◮
◮
◮
exchanging resources, tools & expertise ;
sharing dev. effort ;
better covering of NLP sub-fields
Thank you !
INRIA
INRIA
É. de la Clergerie
ATOLL
SymC 2005/11/15
39 / 39