Presentation is loading. Please wait.

Presentation is loading. Please wait.

The CHAOS Project: Theory and Practice

Similar presentations


Presentation on theme: "The CHAOS Project: Theory and Practice"— Presentation transcript:

1 The CHAOS Project: Theory and Practice
Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma “Tor Vergata”

2 People INVESTIGATORS FORMER CONTRIBUTORS Roberto Basili
Fabio Massimo Zanzotto Maria Teresa Pazienza FORMER CONTRIBUTORS Daniele Pighin Daniele Previtali Alessandro Bahgat Marco Pennacchiotti Massimo Di Nanni Michele Vindigni Luigi Mazzucchelli Paola Velardi Paolo Zirilli Alessandro Cucchiarelli Alessandro Marziali Fabrizio Grisoli Gianluca De Rossi

3 Outline Theory: Customizable parsing architectures
XDG: eXtended Dependency Graph Task oriented parsing design Practice: System Implementation and Use A component-based approach An object-oriented platform Linguistic data Processing modules How to use the parser in an application Demo!!!

4 Customizable parsing architectures
Theory Customizable parsing architectures

5 Motivation The Chaos Project unofficially began in ’96
… on the long tradition of ARIOSTO (Basili, Pazienza, the University of Rome “Tor Vergata” (RTV) Aim building robust parsers for Italian and for English that use verb sub-categorization (syntactic) lexicons induced from corpora that can be used in applications Constraints use the long RTV “Social” background Microtheories for microphenomena Language analysis can be reduced to a cascade of modules (e.g., FSA) Application-oriented language anaysis (e.g., IE) Robust (formely, shallow) parsing approaches

6 Motivation contribute-NP-PP(to) value-NP-PP(at) Inf(S1) Inf(S2)
[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

7 Motivation (found on vinyl supports)
Different NLP applications have different performance constraints in term of: Accuracy Throughput Customizable parsing architectures are reusable in different application scenarios if: the architectural design supports performance control

8 Customizable parsing architectures (found on vinyl supports)
Modularization clarifies the interdependency between different syntactic information (grammatical/lexicalized) allows to control throughput via eliciting modules quality via a clear relation between modules (prerequisites/contributions)

9 Pi(Si,Ki)=Si+1  Pi(Si)=Si+1
Modular approach Syntactic parser SP(S,K)=I  SP(S)=I Syntactic parsing module: Pi(Si,Ki)=Si+1  Pi(Si)=Si+1 Modular syntactic parser SP = Pn... P2P1

10 Modular approach To push a modular approach we need:
a suitable annotation scheme a classification of the processing modules

11 A suitable annotation scheme
Requirements: Modularization a stable representation of partially analyzed structures Lexicalization a clear representation of the (semantic) head of a given structure able to activate the lexicalized rule

12 XDG: Extended Dependency Graph
XDG combines constituency and dependency based formalisms XDGGD=(C,D) C = {(c,t,h)|cS,t,hc} D = {(c1,c2,t)| c1,c2C, t} Nice property: allow to store persistent ambiguity (for interpretations projected by the same nodes)

13 XDG: Extended Dependency Graph
C are constituents syntactic head potential semantic governor D are dependencies among constituents

14 Classification of parsing modules
Pi(XDGi,Ki)=Pi(XDGi)=XDGi+1 The classification is performed according to: the type of information K used how they manipulate the sentence representation

15 Task oriented parsing design
Given: The NLP application requirements R The test-bed T A pool of parsing modules PM The designing activity is: The research of a combination of the parsing modules PM that fits R on the T

16 NLP application requirements
Target phenomena: es. VP_PP, NP_PP, etc Metrics: Recall R per sentence Precision P per sentence F-measure per sentence

17 CHAOS: Levels of Analysis
POS Chunks Clauses Dependencies NPK VPK PPK NNS TO VB IN PRP MD Strategies to use with questions you cannot answer

18 Verb dependencies and Clause Boundaries
contribute-NP-PP(to) value-NP-PP(at) Inf(S1) Inf(S2) [ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

19 Verb dependencies and Clause Boundaries
contribute-NP-PP(to) value-NP-PP(at) Inf(S1) Inf(S2) [ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

20 Verb dependencies and Clause Boundaries
contribute-NP-PP(to) value-NP-PP(at) Inf(S1) Inf(S2) [ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

21 Verb dependencies and Clause Boundaries
The algorithm: Initial Hypoteses: Minimal boundaries of the clauses in the sentence Derived Hierarchy Until all verbs have not been analyzed: Take the rightmost not analyzed verb v: Take the lexicalized rules R(v) for the verb v Find the dependencies of Augment the clause boundaries

22 System Implementation and Use
Practice System Implementation and Use

23 A Computational Framework
Object-oriented backbone Objects for the different data Objects for the different sub-processes Linguistic sub-processors as libraries Coexisting languages: Java, C++, C, Prolog

24 System implementation
A component-based approach An object-oriented platform Linguistic data Textual entities: Text, Paragraphs XDG Linguistic processors

25 A Component-based Approach
Advantages: Computational efficiency Rapid prototyping Integration of different technologies Easy reuse

26 Linguistic processors

27 Linguistic processors
Tokenizer, Complex Tokenizer Dictionary lookup modules Yellow page look-up Morphology analyzer Name Entity Recognition Part-of-speech tagging Chunker Verb shallow analyzer Shallow analyzer

28 Linguistic modules Each process is encapsulated in an object
initialize() Load lexicons and rules (general or domain specific) finalize() Dismiss the process rules and lexicons run() Enrich the input with the contributes of the process

29 Linguistic processors
Microtheories for microphenomena Each processor implements its own theory: It has its language for describing rules It is written in its own programming language

30 Processor: Yellow page look-up, Morphology analyzer
Dictionary compra comprare d(a) v.tran.sempl 2.sing.imper.pres ~:u:~ compra comprare d(a) v.tran.sempl 3.sing.ind.pres ~:u:~ comprai comprare d(a) v.tran.sempl 1.sing.ind.pass_rem ~:u:~ comprammo comprare d(a) v.tran.sempl 1.plur.ind.pass_rem ~:u:~ compran comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~ comprando comprare d(a) v.tran.sempl geru.pres ~:u:~ comprano comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~

31 Processor: Chunker Rules …
constituent_class([_cst1, _cst2, _cst3], 'VerFin', _mor, 1, 3):- verb_finite(_cst1), verb_to_have(_cst1), verb_past_particle(_cst2), verb_to_be(_cst2), verb_past_particle(_cst3), common_morfology(_cst1,_mor).

32 Processor: Verb Shallow Analyser
Sub-categorization lexicon pattern(comprare,[ [(oggetto,Post),(per,Post)], [(oggetto,Post),(da,Post),(per,Post)], [(oggetto,Post),(a,Post),(per,Post)],[(oggetto,Post)]]). pattern(comprendere,[[(oggetto,Post)],[],[(oggetto,Post)]]). pattern(comprimere,[[(oggetto,Post)],[(oggetto,Post)]]). pattern(compromettere,[[(con,Post)],[(oggetto,Post)]]). pattern(comunicare,[[], [(con,Post)], [(a,Post)], [(oggetto,Post),(a,Post)],[(oggetto,Post)]]).

33 Implemented Italian Shallow Grammar
Constituent Categories Part-of-Speech Tags Chunk Types Dependency Categories Dependency Categories over Chunk Types

34 A survival user guide Version stand-alone: Version client-server:
chaosparser -h Version client-server: chaosserver –h chaosclient –h XDG editor and actual gui: choasgui

35 Using CHAOS in applications
In JAVA applications: ConfigurationHandler.initialize(); ConfigurationHandler.parseKBPropFile(“LANGUAGE”,”KB”); Parser ms = new Parser(); ms.initialize(); In Non-JAVA applications: Using one of the possible output forms: XDG in Xml XDG in Prolog XDG in QLF (in prolog)

36 Perspective Building a statistical Italian parser
Increasing the Itailan annotated corpora Reusing existing corpora TUT SITAL VIT

37 Tools XDG editor DEMO!!!! Syntactic annotation transformer

38 People INVESTIGATORS FORMER CONTRIBUTORS Roberto Basili
Fabio Massimo Zanzotto Maria Teresa Pazienza FORMER CONTRIBUTORS Daniele Pighin Daniele Previtali Alessandro Bahgat Marco Pennacchiotti Massimo Di Nanni Michele Vindigni Luigi Mazzucchelli Paola Velardi Paolo Zirilli Alessandro Cucchiarelli Alessandro Marziali Fabrizio Grisoli Gianluca De Rossi


Download ppt "The CHAOS Project: Theory and Practice"

Similar presentations


Ads by Google