ATOLL - Software Tools for Natural Language Processing
Transcription
ATOLL - Software Tools for Natural Language Processing
ATOLL Software Tools for Natural Language Processing Éric de la Clergerie [email protected] http://atoll.inria.fr Evaluation Seminar SYM C : Management and processing of language and data Dourdan, November 15-16th 2005 INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 1 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 2 / 39 ATelier d’Outils Logiciels pour le Langage naturel Creation : 1997 – Computer Science NLP ATOLL objectives : to develop tools and techniques, theoretical or applied, in order to help to access, process and use documents in natural language. INRIA scientific challenges : To design new applications using the Web and multimedia data bases Keywords : Computational Linguistics ; Natural Language Processing (NLP) ; Linguistic Engineering ; Parsing ; Syntactic Formalisms ; Linguistic resources ; INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 3 / 39 ATOLL’s composition 2002 Scientific leader 2003 2004 2005 Éric de la Clergerie (CR) Permanents Bernard Lang (DR) Pierre Boullier (DR) Philippe Deschamp (CR) François Thomasset (DR) Exteriors & Temporaries Areski Nait Abdallah (Pr, Univ. Brest) Alexis Nasr (Prof., Del. Paris 7) François Barthélemy (MdC, CNAM) Lionel Clément (PostDoc RLT, Ing. RNIL) Guillaume Rousse (Ing. Biotim) Stéphane Laurière (Ing. e-COTS) PhD Benoît Sagot INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 5 / 39 Parsing ? Parsing : Identifying the relationships between words (and groups of words) Grammar : packed sets of relationships + combinaison rules INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39 Parsing ? Parsing : Identifying the relationships between words (and groups of words) Grammar : packed sets of relationships + combinaison rules Tree Adjoining Grammars [TAG] subst ⇒ NP S John NP ↓ VP V P NP VP John V sleeps sleeps INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39 Parsing ? Parsing : Identifying the relationships between words (and groups of words) Grammar : packed sets of relationships + combinaison rules Tree Adjoining Grammars [TAG] adj subst ⇒ NP S John NP ↓ VP V V P NP VP John V sleeps ⇒ S NP ⋆V Adv VP John V a lot V sleeps Adv sleeps a lot INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 6 / 39 Parsing ? Parsing : Identifying the relationships between words (and groups of words) Grammar : packed sets of relationships + combinaison rules Tree Adjoining Grammars [TAG] adj subst ⇒ NP S John NP ↓ VP V V P NP VP John V sleeps ⇒ S NP ⋆V Adv VP John V a lot V sleeps Adv sleeps a lot Problems : No consensus on the best linguistic formalism Capturing all syntactic constructions Capturing word usages (lexicon & statistics) Handling amibiguities INRIA É. de la Clergerie ATOLL SymC 2005/11/15 INRIA 6 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 7 / 39 ATOLL : Thematics Formal Language Theory Formalisms, Automata & Tabulation INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 ATOLL : Thematics Formal Language Theory Formalisms, Automata & Tabulation (Open) Tools Parser Compilers : DYAL OG, S YNTAX, RCG INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 ATOLL : Thematics Formal Language Theory (Open) Ling. Resources Formalisms, Automata & Tabulation Lexicon : L EFFF Grammar : S X LFG MetaGrammar : FRMG (Open) Tools Parser Compilers : DYAL OG, S YNTAX, RCG Ling. Dev. Environment : M GCOMP, MGTOOLS, TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . . INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 INRIA ATOLL : Thematics Applications Evaluation : EASy Info. Extraction : Biotim Corpora Grid techniques Formal Language Theory (Open) Ling. Resources Formalisms, Automata & Tabulation Lexicon : L EFFF Grammar : S X LFG MetaGrammar : FRMG (Open) Tools Parser Compilers : DYAL OG, S YNTAX, RCG Pre Parsing : S X P IPE Ling. Dev. Environment : M GCOMP, MGTOOLS, TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . . INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 INRIA ATOLL : Thematics Applications Evaluation : EASy Info. Extraction : Biotim Ling. Knowledge Acquisition Corpora Grid techniques Formal Language Theory (Open) Ling. Resources Formalisms, Automata & Tabulation Lexicon : L EFFF Grammar : S X LFG MetaGrammar : FRMG (Open) Tools Parser Compilers : DYAL OG, S YNTAX, RCG Pre Parsing : S X P IPE Ling. Dev. Environment : M GCOMP, MGTOOLS, TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . . INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 INRIA ATOLL : Thematics Applications Normalization Evaluation : EASy Info. Extraction : Biotim Ling. Knowledge Acquisition Normalangue ISO TC37 SC4 Corpora Grid techniques Formal Language Theory (Open) Ling. Resources Formalisms, Automata & Tabulation Lexicon : L EFFF Grammar : S X LFG MetaGrammar : FRMG (Open) Tools Parser Compilers : DYAL OG, S YNTAX, RCG Pre Parsing : S X P IPE Ling. Dev. Environment : M GCOMP, MGTOOLS, TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . . INRIA É. de la Clergerie ATOLL SymC 2005/11/15 8 / 39 INRIA ATOLL : Thematics Applications Normalization Evaluation : EASy Info. Extraction : Biotim Ling. Knowledge Acquisition Normalangue ISO TC37 SC4 Corpora Grid techniques Formal Language Theory (Open) Ling. Resources Formalisms, Automata & Tabulation Lexicon : L EFFF Grammar : S X LFG MetaGrammar : FRMG (Open) Tools Free Software Bernard Lang INRIA É. de la Clergerie Parser Compilers : DYAL OG, S YNTAX, RCG Pre Parsing : S X P IPE Ling. Dev. Environment : M GCOMP, MGTOOLS, TAG _ UTILS, FOREST _ UTILS, MRCG 2 RCG . . . ATOLL SymC 2005/11/15 8 / 39 INRIA ATOLL’s positionning Balance between theory, development and experimentation NLP requires many tools and resources with difficulties to access and exploit linguistic resources (for French) ◮ ◮ ◮ ⇒ dev. effort + investigation of methodologies to speed up dev. favor reuse and distribution ⇒ normalization & open source Software Eng. practices : Versioning + Packaging + Catalog on line favor emerging of resources ⇒ LexSynt action, collaborations Search for comprehension of mechanisms of language but also see language as cultural artifact : ◮ ◮ collaboration with linguists & use of linguistic theories (and formalisms) exploitation of corpora to capture language usage NLP is an experimental field ◮ ◮ ◮ INRIA need to play at real scale large coverage grammars, large lexica, real documents, large corpora need feedback : evaluation, statistics real scale applications É. de la Clergerie ATOLL SymC 2005/11/15 9 / 39 INRIA Syntactic formalisms Exploration of a wide range of syntactic formalisms RCG [Boullier] Meta-RCG [Sagot] Derivation complexity combining structures LCFRS MCS TAG LIG N Feature TAG A ↓ ⋆N Datalog V(sing) CFG NP DCG LFG HPSG S(gap(np)) λ-Prolog Vocabulary Complexity unification Meta-Grammars : Abstract level of grammar description based on hierarchies of classes grouping constraints and requiring/providing functionalities. ⇒ [MG compilation] Generation of target grammars (TAGs, LFGs) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 10 / 39 Linguistic Resources : (Meta-)Grammars No easily available wide coverage French grammars ⇒ development of grammars SXLFG : wide coverage French LFG grammar [Clément, Sagot, Boullier] very efficient exploitation with S YNTAX FRMG : wide coverage French Meta-Grammar [Clergerie] MG + factorization operators ⇒ generates a very compact TAG grammar 126 trees with only 27 verb-anchored trees (to compare to usual 2-6Ktrees) Dev. of env. of development for grammars Edition Visualization Statistics (coverage, time, ambiguity) on test suites (EUROTRA & TSNLP) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 11 / 39 Linguistic Resources : Lexicon Development of French lexicon L EFFF [Clément,Sagot] Over 400 000 forms and following lemma distribution : verbs common nouns proper nouns adj 6788 37183 52938 10024 adv 2127 Verb morphology automatically learned on corpus (+ manual checking) Syntactic information on verbs (subcategorization, control, . . . ) promet (promises) v [pred=’promettre_1<subj|ssubj|vsubj,(obj|scomp),(à-obj)>’,cat=v,@SCompInd,@P3s] v [...] Multiple Inheritance-based architecture (∼ MG) @promettre { < @verbe_ditransitif_à_svp < objet_phrastique_possible < complétive_indicatif | ... } still incomplete and with errors, but using corpus parsing to track errors (error mining) INRIA É. de la Clergerie ATOLL SymC 2005/11/15 12 / 39 INRIA Parsing techniques Parsing remains an algorithmic challenge : Ambiguity handling & representation (Shared forests) Push-Down Automata ; Dynamic Programming techniques Formalisms CFG DCG / Logic Programming TAG / LIG MC-TAG/osRCG/MCS Automata PDA Logical PDA 2-Stack Automata Thread Automata Tabulation O(n3 ) completeness O(n6 ) O(nk ) Notes Lang Lang & Clergerie Clergerie & Pardo Clergerie Guiding techniques (e.g., supertagging [Boullier] & chunking [Sagot]) multi-pass parsing where the shared forest of a level guides the next level Algorithms on shared forests (e.g., disambiguation) Robustness (e.g., partial parsing, error correction techniques) handling “ill formed” sentences, unknown words and constructions Scaling issues (wide coverage grammars, large lexica) grammar factorization (FRMG) & many algorithmic issues INRIA É. de la Clergerie ATOLL SymC 2005/11/15 13 / 39 INRIA DyALog - Exploring unification-based grammars An environment (compiler dyacc + abstract machine) : For compiling tabular parsers based on : stack automata & dynamic programming ⇒ computation sharing & loop detection ⇒ extraction of shared forests Also a logic programming environment ⇒ power of logic + possibility of escaping within grammars Strongly NLP oriented ◮ ◮ ◮ Ease grammar design : TFS, finite domains, . . . Multiple grammatical formalisms : DCG, BMG, TAG & TIG, RCG Functionalities and customization of parsers : multiple parsing strategies, forests, word lattice, lexicalization, robustness Used for ◮ ◮ ◮ ◮ INRIA a robust Potuguese grammar (bidir. head-driven DCG+BMG) [GLINT] ⇒ dev. of a similar Spanish Grammar [COLE] MG compiler MGCOMP French wide coverage TIG/TAG grammar FRMG [ATOLL] French TAG grammar with semantic interface [LORIA] É. de la Clergerie ATOLL SymC 2005/11/15 14 / 39 INRIA Syntax - From CFGs to LFGs and RCGs SYNTAX [Boullier] Originally developed to parse Programming Languages with CFG (+attributes) Extended with tabular techniques to cover all CFGs Recently extended for Lexical Functional Grammars 2 passes : CFG + computation of decorations on shared parse forests + disambiguation phase ⇒ used for large coverage LFG grammar SXLFG [Clément,Sagot,Boullier] Extremely efficient on CFGs and on LFGs + 300Ksent. journalistic corpus (Monde Diplomatique) : 3 4 sent. < 0.1s RCG [Boullier] Derived from technology behind SYNTAX Very powerful and also efficient INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 15 / 39 Linguistic infrastructure : SxPipe French Morpho-Syntactic chain SXPIPE [Sagot, Boullier, . . . ] : Word and sentence segmentations, including multi-words Named entities (Proper Nouns, Dates, Addresses, URL, :-), . . . ) Spelling corrections Returns a word lattice (DAG) as input for parsing Jean 0 en outre {au} à {abite} habite 1 {Jean} jean 2 3 4 {au} le 5 {1 , rue de la Pompe} _ADRESSE 7 6 {en outre} en_outre INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 16 / 39 Normalization Motivation : Reusability of resources Interoperability of tools in Processing Chains Participation (with Laurent Romary and Langue & Dialogue [LORIA]) at INRIA consortium SYNTAX French Normalangue/RNIL action ISO TC 37 SC4 on “Linguistic Resource Management” ⇒ head of French delegation to ISO meetings Partially involved in new European project LIRICS Participation on emerging standards : MAF Morphosyntactic Annotation Framework (Project leader) FSR Feature Structure Representation (+ future compagnion FSD) DCR Data Category Registry (registering ling. terminology) LMF Lexical Markup Framework (lexicons) SynAF Syntactic Annotation Framework INRIA É. de la Clergerie ATOLL SymC 2005/11/15 17 / 39 INRIA Environment of development Principles : Coordinating small tools LINGPIPE, TAG _ UTILS, FOREST _ UTILS, PARSERD , . . . Use of XML intermediary formats multiple views (XSL) Grammars (TAGML), MetaGrammars, Morpho-Syntax (MAF), Shared Forests (Derivations or Dependencies) Visualization tools (web services) ◮ ◮ ◮ MAF (MAFD) : http://atoll.inria.fr/mafdemo Parsing (PARSERD) : http://atoll.inria.fr/parserdemo Grammar (FRMG) : http://atoll.inria.fr/perl/frmg/tree.pl il a voulu en promettre une à Paul. He has wanted to promise one of them to Paul une il à object (31) subject (31) pro:cln: (0) a pro:pro:50 (2) N2 (1) à:prep:41 (1) N2 (1) voulu a:aux:54 (1) Infl (1) en vmod (18) clg (16) pro:clg: (0) N2 (1) promettre V (31) vouloir:v:63 (1) promettre:v:107 (31) _:VMod:92 (1) PP (1) Paul Paul:np:51 (1) à:prep:2 (1) preparg (6) cll (15) . S (31) pro:cll: (0) _:S:33 (1) void (1) .:_: (0) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 18 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 19 / 39 French Parsing Evaluation Campaign EASy Evaluation protocol : complex (formalisms, deep vs shallow, ambiguous or not, . . . ) ⇒ evaluation on shallow constituents and small set of dependencies [Dec. 2004] Participation of FRMG & SXLFG (out of 14 parsers) ∼ 40Ksentences, with ∼ 4200 manually annotated, several styles Use of our deep parsers to produce non-ambiguous shallow information ⇒ robustness (full and partial parsing) ⇒ disambiguation heuristics + conversion EASy campaign tests not just parsing but a full processing chain L EFFF, S X P IPE + parser + post-parsing GN 1 Jean F1 subject GN1 NV 2 abite F2 verb NV2 en F3 GR 3 outre F4 compl. GP4 au F5 1 F6 verb NV2 , F7 GP 4 rue de F8 F9 modifier GR3 É. de la Clergerie ATOLL SymC 2005/11/15 Pompe F11 verb NV2 Note : Still waiting for definitive and complete results INRIA la F10 20 / 39 INRIA EASy as a starting point Using EASy expertise and resources to continue evaluations, get feedback, compare FRMG & SXLFG (and acquire statistics) Corpus distribution 84.0 general #sentences distribution 67.4 68.1 %precision 81.9 litteraire %recall %fmeasure 76.5 76.0 73.7 mail 63.5 kind of corpus 64.0 questions (5%) 203 80.5 medical 70.9 general (18%) 755 oral_elda (12%) 502 71.1 72.2 oral_delic 63.4 62.7 oral_delic (12%) 522 77.0 oral_elda 70.8 71.5 litteraire (21%) 881 86.1 questions 83.5 83.4 medical (13%) 554 77.5 total 68.7 68.7 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 mail (20%) 852 75 80 85 90 95 100 % ⇒ NEW already used results for a very accurate CFG-based chunker [Sagot]INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 21 / 39 BIOTIM : from books to knowledge bases ACI “Datawarehouse” BIOTIM : Processing botanical descriptions (flora) Corpus “Flore du Cameroun” (1963 – 2001) Volumes 31 Pages 9466 Av. Pages 305 Words 1.5M Taxons ∼ 2400 Tasks : Corpus preparation : spelling correction (OCR), logical structuring Preliminary Linguistic Processing : morpho-syntactic processing Terminology extraction & first experiments with “governor-governee” relationship “Ontology” extraction : use of parsing to extract syntactic dependencies + Harris hypothesis : similar syntactic contexts hints semantic similarities ⇒ lancéolé (adj) : leaf shape Text Mining : getting the properties of each taxon parsing + disambiguation through ontology knowledge bases : Description Logics (RDF / OWL) INRIA É. de la Clergerie ATOLL SymC 2005/11/15 22 / 39 INRIA Linguistic Resource Acquisition Motivation : Bootstrap Using resources in tools (parsers, taggers, . . . ) ⇒ Validation error mining techniques (Van Noord, Clergerie) Using tools to get resources from (raw) corpora ◮ ◮ ◮ ◮ ◮ Learning morphology and lemma (done for French & Slovak [Sagot]) Learning syntactic information (sub-categorization, support verbs . . . ) Learning semantic classes and selection restrictions Learning probabilities (for desambiguation) Generic idea : learning from contexts coming from dependencies Reducing human cost ⇒ Free Linguistic Resources Adapting resources to needs (evolution) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 23 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 24 / 39 Actions : ARC RLT (2001 – 2002) ARC “Linguistic resource acquisition and representation for TAGs” Participants ATOLL (coordinator), Langue et Dialogue, Calligramme, TALaNa (Univ. Paris 7) Objectives Semi-Automatic acquisition of a French TAG lexicon XML representation for TAGs Emerging of notion of Meta-Grammars Corpus Annotated corpus Pre Parsing Meta grammar Lexicon compilation Supervised validation Preferences Grammar Acquisition Parser compilation Tree bank INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 25 / 39 Actions (cont’d) Biotim ACI “Masse de Donnée” (2003 – 2006) ◮ ◮ Participants : IRD, LIFO (Orléans), IMEDIA, Vertigo (CNAM), INRA Objectives (ATOLL) : text mining of botanical corpora EASy/EVALDA French Technolangue action for the evaluation of parsing systems (2003 – 2005) ◮ Participants : ELDA, LIMSI, LLF, ATILF, DIAM-STIM, DELIC, GREYC, L&D, LPL, Synapse Développement, Systal-Pertimm, Xerox, LIC2M, LATL, EPFL, FT R&D, Tagmatica, VALORIA, ERSS ◮ Objectives (ATOLL) : participation with FRMG & SXLFG RNIL/Normalangue Technolangue action for the normalization of linguistic resources (2003 – 2005) ◮ Participants : L&D, AFNOR, ATILF, LLF Jussieu, IRIN, LIMSI, CLIPS, RESO, CEA, XRCE, EDF R&D, SYSTRAN, France Telecom R&D, Systems & Defense Electronics, SOFTISSIMO, SINEQUA, LUCID-ID, J-WAY ◮ Objectives (ATOLL) : project leader on MAF + head of French delegation for ISO TC37 SC4 INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 26 / 39 Actions (cont’d) LexSynt ILF-funded action (2005 – ? ? ?) ◮ ◮ Coordination : Sylvain Kahane (Modyco), Susanne Salmon-Alt (ATILF), Éric de la Clergerie (Projet ATOLL, INRIA) Participants : ATILF, ERSS, IGM, LPL, Lattice, MoDyCo, ATOLL, Calligramme, L& D, Signes, ATV (K.U. Leuven), OLST (Univ. of Montreal), Normalangue/RNIL ◮ objectives : to design & exploit a reference syntactic-semantic lexicon for French. GENI ARC on Generation and Inference (2002 - 2003) ◮ ◮ Participants : L&D (coord.), Orpailleur, TALaNa, IRIT Objectives (ATOLL) : TAG, generation & tabulation, lexical semantic e-COTS : RNTL action (2001 – 2002, extended 2003) Bernard Lang ◮ ◮ Participants : Thomson-CSF, EDF and Bull Objectives : setup an open and cooperative WEB portal to manage information about software components. MOPROSCO ARC (2005 – 2006) Participation Areski Nait-Abdallah INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 27 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 28 / 39 INRIA Collaborations Langue et Dialogue (LORIA) : TAG ; Meta-Grammars ; Linguistic Infrastructure ; Normalization ; Lexica ARC RLT and GENI ; Normalangue ; EASy ; LexSynt Calligramme (LORIA) : MG ; ARC RLT ; LexSynt Signes (Futurs, Bordeaux) : LexSynt increased collaboration with the arrival at Bordeaux of Lionel Clément IMEDIA (INRIA Rocquencourt) : Biotim Orpailleur (LORIA) : text mining & knowledge extraction (ARC GENI) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 29 / 39 French Collaborations Lattice/TALaNa (University Paris 7) : ◮ ◮ ◮ ◮ TAG, MG (RLT & GENI) lexica (LexSynt), . . . people (Lionel Clément & Alexis Nasr) co-supervising of Sagot’s PhD with Laurence Danlos. discussions towards a common structure MoDyCo (University Paris 10 - Nanterre & CNRS) with Sylvain Kahane, linguistic formalisms ; LexSynt LIFO (Orléans) : NLP and Machine learning ; Biotim BIODIVAL (IRD, Orléans) : Biotim + Potential collaborations with ◮ ◮ ◮ ◮ INRIA IGM (Marne-La-Vallée) [LADL tables, LexSynt], ERSS (Toulouse) [Acquisition on corpus], LIS (Paris 6) [Knowledge rep. and use ; Biotim+], ... É. de la Clergerie ATOLL SymC 2005/11/15 30 / 39 INRIA International Collaborations COLE (La Coruña & Vigo, Spain) – Manuel Vilares Ferro ◮ ◮ ◮ TAGs ; Parsing techniques ; grammars ; use of DyALog ; information retrieval and extraction applications. French-Spanish “Programme d’Actions Intégrées” [PAI] PICASSO ⇒ visits (several-months student visits) ⇒ organization of 2 TAPD editions (Paris 1998, Vigo 2000) Co-supervised PhD of Miguel Alonso Pardo (on 2SA and TAGs) ⇒ many common papers GLINT* (New Univ. of Lisbon) – Gabriel Pereira Lopes ◮ ◮ Use of DyALog and on machine learning techniques for NLP French-Portuguese programs RELING, ICTII et PAI PESSOA ⇒ visits XTAG (Univ. of Pennsylvanie) – Aravind Joshi ◮ ◮ TAG parsing and Meta-Grammars ; PhD student A. Kinyon Potential NSF–INRIA collaboration RIADI (Univ. of La Manouba, Tunisie) – Mohamed Ben Ahmed Preliminary contacts to set up a cooperation on the processing of Arabic INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 31 / 39 Visibility (New) Invitation to deliver a course at the ESSLLI’06 summer school ; invitation of Bernard Lang to many events. Editorial board of French journal “Traitement Automatique des Langues”. Guest Editor for the T.A.L. issue on “Evolutions in Parsing” (2003) Program Committees of 11 national and int. conferences and workshops 9 national and int. PhD Juries, including 3 reviews Standardization committee of ISO TC37 SC4 (head of French delegation) Consulting and project reviews for actions Technolangue and ACI ; Bernard Lang consulted by companies, administrations, and government. Paper reviews for journals (T.A.L, JoLLI, TCS) and conferences/workshops (TALN, EACL, IWPT, ACL, ICLP, PPDP, COLING, ESSLII, IJCNLP, ICALP, POPL, Formal Grammars, MOL, ESSLLI, ICLP) INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 33 / 39 Publications ATOLL’s publications are available on line at http://atoll.inria.fr/biblio Journals : T.A.L., TCS, Document numérique Conferences & Workshops : ACL, NACL, COLING, TALN, IWPT, TAG+, FG, MOL, LACL, LREC, . . . 02 03 04 PhD Thesis [Sagot] H.D.R [Clergerie] Journal Conference proceedings Book chapter Book (edited) Technical report Total 05 1TBF 1TBF 1BL 4 2+1BL 6 1BL 1 5 11 1BL 6 2 2+3S+4BL 11+4P+2S+2BL 1 1 10 1 29 INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 34 / 39 Outline 1 Generalities 2 Thematics & Contributions 3 Applications 4 Actions 5 Collaborations 6 Conclusions INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 35 / 39 About the 4 last years (2002 – 2005) Difficulties in recruitment (no CR since 1994) but nevertheless, ATOLL has welcomed (temporary) brilliant members : Lionel Clément, Benoît Sagot, Guillaume Rousse Fulfilled most of the planned objectives syntactic formalisms ; parsing techniques ; robustness ; lexica ; linguistic infrastructure ; evaluation ; applications Get involved in unexpected but natural actions : EASy and Normalangue ⇒ (EASy) fast development pace for a real scale processing chain Large opening of our thematics morpho-syntax, lexica, corpora, grammar design, . . . now working at a much larger scale. wide coverage grammars, large lexicon, processing of large corpora INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 36 / 39 Scientific evolutions (2006 – 2009) Better syntactic formalisms and syntactic descriptions MGs, MC-TAGs, Meta-RCGs, dependencies, constraints ⇒ ARC proposal MOSAÏQUE (coordinator Signes) Increased parsing robustness and efficiency ◮ ◮ ◮ ◮ ◮ acquisition and exploitation of stochastic information (Nasr) algorithmic of n-best beam disambiguators on shared derivation forests guiding techniques & cascade-based parsing error correction techniques integration of emerging techniques (e.g., HPSG’s quick check filtering) Guiding + Tabular techniques + Stochastic methods + Evaluation ⇒ [Ambition] marry deep and shallow parsing, ambiguous or not, to get efficient and accurate parsing systems INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 37 / 39 Scientific evolutions (cont’d) Linguistic knowledge acquisition & lexica ◮ ◮ ◮ use of EASy references and Paris 7 treebank acquisition on raw corpora acquisition and exploitation of lexical semantic info. Information extraction applications, (Biotim followups) possibly with question/answering and (multilingual) generation Opening to multilinguism (Arabic ?) application of our tools and acquisition methodologies for new languages INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 38 / 39 Organizational aspects Real need to renew the composition of ATOLL planned departures ⇒ impossible to continue without new member(s) ◮ ◮ ◮ ◮ ? maintenance & development of tools & resources ? conduct large scale experimentations & exploit results ? covering of enough computational linguistic sub-fields ? following collaborations and actions on ATOLL’s side Discussions with TALaNa towards a common structure Reinforcing collaborations through INRIA Action d’Envergure for NLP ? ◮ ◮ ◮ exchanging resources, tools & expertise ; sharing dev. effort ; better covering of NLP sub-fields INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39 Organizational aspects Real need to renew the composition of ATOLL planned departures ⇒ impossible to continue without new member(s) ◮ ◮ ◮ ◮ ? maintenance & development of tools & resources ? conduct large scale experimentations & exploit results ? covering of enough computational linguistic sub-fields ? following collaborations and actions on ATOLL’s side Discussions with TALaNa towards a common structure Reinforcing collaborations through INRIA Action d’Envergure for NLP ? ◮ ◮ ◮ exchanging resources, tools & expertise ; sharing dev. effort ; better covering of NLP sub-fields Thank you ! INRIA INRIA É. de la Clergerie ATOLL SymC 2005/11/15 39 / 39