Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge aac@cl.cam.ac.uk

Outline Information Extraction Combining deep and shallow processing RMRS MRS basic ideas of RMRS RASP-RMRS RMRS and IE in Deep Thought SciBorg project

Acknowledgements Deep Thought (EU funded, 2002-2004) Computer Lab: Ann Copestake, Anna Ritchie, Ben Waldron Sussex, Saarland, DFKI, Xtramind, CELI, NTNU SciBorg (EPSRC, 2005-2009) Computer Lab: Ann Copestake, Simone Teufel, CJ Rupp, Advaith Siddharthan Chemistry: Peter Murray-Rust, Peter Corbett CeSC: Mark Hayes, Andy Parker DELPH-IN (informal ongoing collaboration) Boeing funding to Computer Lab: Ben Waldron especially Dan Flickinger, Alex Lascarides, Stephan Oepen, John Carroll, Anette Frank

Information extraction Classic IE: MUC-style template filling, gene/protein interactions IE in general: acquiring specific types of knowledge from text via language processing: e.g., organic chemistry syntheses ontological relationships relationships between texts (for search) IR, QA, I2E

IE from Chemistry texts recipe expressed in CML To a solution of aldimine 1 (1.5mmol) in THF (5mL) was added LDA (1mL, 1.6 M in THF) at 0 °C under argon, the resulting mixture was stirred for 2h, then was cooled to -78 °C...... alkaloids and other complex polycyclic azacycles... Enamines have been used widely... (citation Y), however,... did not provide the desired products. X cites Y (contrast)

Standard IE architecture 1. Preprocessing of markup etc (specific to text type) 2. Tokenisation (not domain-specific) 3. Named Entity Recognition (domain-specific ontologies, domain-specific patterns) 4. Chunking: detection of noun and verb groups (not domain-specific) 5. Anaphora resolution (domain-specific ontologies) 6. Relationship detection via patterns over chunks (domain- and task- specific) 7. DB instantiation (task-specific)

State of the art in IE Several options for whole IE systems and individual components, especially for English Increasing integration of ontologies Commercial systems for some applications But, many IE-style tasks still done manually: IE performance (especially when high precision required) IE robustness to different text types IE porting requirement (especially NER and relation patterns) Performance of standard architecture may be reaching a plateau More advanced IE tasks are not generally attempted e.g., organic synthesis example could be done with adaptation of standard architecture, but would take substantial effort by highly trained people. Skill set: substantial domain skills plus substantial NLP

Objectives Integrate and adapt tools for language processing in general Eventual use by non-NLP people: black box for language processing Incorporate deeper processing (DELPH-IN technology): aim to get above plateau Integration with XML, semantic web Methodology: Combine statistical and symbolic processing, machine learning and hand-crafting Open Source where possible, collaborative development No toy systems, no artificial evaluations Multilingual via collaboration

Deep processing in IE Some early IE systems attempted to use deep processing: SRI (and also NYU) FASTUS was originally shallow preprocessor for TACITUS but TACITUS was dropped: much too slow, not sufficiently robust Often claimed: deep processing failed for IE, but: only two serious attempts(?), both under time pressure, limited types of IE task deep processing has improved since early 1990s: speed empirical coverage (note that hand-built deep grammars do scale, unlike traditional AI knowledge bases) integration of statistical techniques into deep processing if existing IE architecture is approaching a plateau, we have to try something else – i.e., combined deep and shallow processing (DFKI Whiteboard project)

Integrating processing No single system can do everything: deep and shallow processing have inherent strengths and weaknesses shallow: speed and robustness: e.g., POS tagging, chunking deep: detail, precision, potential for bidirectional processing: e.g., HPSG-based parsers and generators (DELPH-IN technology) also intermediate: RASP (Robust accurate statistical parser): relatively detailed but no lexicon. Domain-dependent and domain-independent processing must be linked Desirable to have a common representation language for processing above sentence level (e.g., anaphora) Long-term solutions...

Compositional semantics for component integration Need a common representation language for systems: pairwise compatibility between systems is too limiting Syntax is theory-specific and unnecessarily language-specific Eventual goal of sentence analysis should be semantics Core idea: shallow processing gives underspecified semantic representation, so deep and shallow systems can be integrated Full interlingua / common lexical semantics is too difficult (certainly currently), but can link predicates to ontologies, etc.

Integration via underspecified semantics Integrated parsing: shallow parsed phrases incorporated into deep parsed structures deep parsing invoked incrementally in response to information needs Knowledge sources expressed via semantics can be used by multiple components: e.g., NER, IE templates, anaphora resolution Advantages over ad-hoc representation approaches: Ability to link with detailed lexical semantics as it becomes available Language generation from semantic representation Explicit logic: formal properties clearer, representations more generally usable Deep semantics taken as normative: extensibility

Robust Minimal Recursion Semantics Minimal Recursion Semantics: MRS. Compositional semantics for deep processing: Copestake, Flickinger, Sag and Pollard (1999, in press) adopted for DELPH-IN and other HPSG work also compatible with LFG etc logically well-defined flat semantics (easier to process, allows information to be ignored) underspecification of quantifier scope (avoid ambiguity) novel approach to composition (monostratal) Robust MRS: adaptation of MRS allowing processing without a subcategorization lexicon

RMRS: Extreme underspecification Goal is to split up semantic representation into minimal components (cf Verbmobil VITs) Scope underspecification (MRS) Splitting up predicate argument structure Explicit equalities Hierarchies for predicates and sorts Compatibility with deep grammars: Sorts and (some) closed class word information in SEM-I (API for grammar, more later) No lexicon for shallow processing (apart from POS tags and possibly closed class words)

Semantics from POS tagging every_AT1 cat_NN1 chase_VVD some_AT1 dog_NN1 _every_q(x1), _cat_n(x2 sg ), _chase_v(e past ), _some_q(x3), _dog_n(x4 sg ) Tag lexicon: AT1 _lemma_q(x) NN1_lemma_n(x sg ) VVD _lemma_v(e past )

Deep parser output Conventional semantic representation Every dog chased some cat every(x,cat(x sg ),some(y sg,dog1(y sg ),chase(e sp,x sg,y sg ))) some(y sg,dog1(y sg ),every(x sg,cat(x sg ),chase(e sp,x sg,y sg ))) Compositional: reflects morphology and syntax Scope ambiguity is explicit May be awkward to process if you dont care about quantifier scope

Modifying syntax of deep grammar semantics: overview 1.Underspecification of quantifier scope: Minimal Recursion Semantics (MRS) – next 6 slides... 2.Robust MRS Separating arguments Explicit equalities Conventions for predicate names and sense distinctions Hierarchy of sorts on variables

PC trees every x cat x some y dog1 chase y xy some y dog1 y every x cat chase x Every cat chased some dog e xye

PC trees share structure every x cat x some y dog1 chase y some y dog1 y every x cat chase x xye xye

Bits of trees every x cat x some y dog1 y chase Reconstruction conditions: tree-ness variable binding xye

Label nodes and holes lb1:every xlb2:cat x lb4:some y lb5:dog1 y lb3:chase h6 h7 h0 h0 – hole corresponding to the top of the tree Valid solutions: equate holes and labels xye

Maximize splitting lb1:every x lb2:cat x lb4:some y lb5:dog1 y lb3:chase h6 h7 h0 h8 Constraints: h8=lb5 h9=lb2 h9 xye

MRS: flat representation elementary predications: lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), scope constraints: h9=lb2,h8=lb5 (actually qeqs) easy to ignore quantification when not relevant for application: cat(x), dog1(y), chase(e,x,y)

RMRS: Separating arguments lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), h9=lb2,h8=lb5 goes to: lb1:every(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase(e),ARG1(lb3,x),ARG2(lb3,y), h9=lb2,h8=lb5

Naming conventions:predicate names without a lexicon lb1:_every_q(x1 sg ),RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ),RSTR(lb4,h8),BODY(lb4,h7), lb3:_chase_v(e sp ),ARG1(lb3,x2 sg ),ARG2(lb3,x4 sg ) h9=lb2,h8=lb5, x1 sg = x2 sg, x3 sg = x4 sg note also explicit equalities

POS output as underspecification DEEP – lb1:_every_q(x1 sg ), RSTR(lb1,h9), BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(e sp ), ARG1(lb3,x2 sg ),ARG2(lb3,x4 sg ), h9=lb2,h8=lb5, x1 sg =x2 sg, x3 sg =x4 sg POS – lb1:_every_q(x1), lb2:_cat_n(x2 sg ), lb3:_chase_v(e past ), lb4:_some_q(x3), lb5:_dog_n(x4 sg )

POS output as underspecification DEEP – lb1:_every_q(x1 sg ), RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(e sp ), ARG1(lb3,x2 sg ),ARG2(lb3,x3 sg ), h9=lb2,h8=lb5, x1 sg =x2 sg, x3 sg =x4 sg POS – lb1:_every_q(x1), lb2:_cat_n(x2 sg ), lb3:_chase_v(e past ), lb4:_some_q(x3), lb5:_dog_n(x4 sg )

RMRS principles Split up information content as much as possible Accumulate information monotonically by simple operations Dont represent what you dont know but preserve everything you do know Use a flat representation to allow pieces to be accessed individually

Semantics from RASP RASP: robust, domain-independent, statistical parsing (Briscoe and Carroll) cant produce conventional semantics because no subcategorization can often identify arguments: S -> NP VP NP supplies ARG1 for V potential for partial identification: VP -> V NP S -> NP S NP might be ARG2 or ARG3

RMRS construction deep grammars: MRS RMRS converter. POS-RMRS: tag lexicon. RASP-RMRS: tag lexicon plus semantic rules associated with RASP rules. no lexical subcategorization, so rely on grammar rules to provide the ARGs output aims to match deep grammar (ERG) developed on basis of ERG semantic test suite default composition principles when no rule RMRS specified Composition algebra: MRS composition assumes a lexicalized approach: algebra defined in Copestake, Lascarides and Flickinger (2001) RMRS with non-lexicalised grammars has similar basic algebra All approaches have common composition principles, so there is compatibility at a phrasal level.

Some cat sleeps (in RASP) [h3,e],, {h3:_sleep(e)} sleeps [h,x],, {h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat S->NP VP: Head=VP, ARG1(, ) [h3,e],, {h3:_sleep(e), ARG1(h3,x), h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat sleeps

ERG-RMRS / RASP-RMRS

Inchoative

Infinitival subject (unbound in RASP-RMRS)

Mismatch: Expletive it

SEM-I: semantic interface Meta-level: manually specified `grammar relations (constructions and closed-class) Object-level: linked to lexical database for deep grammars Object-level SEM-I auto-generated from expanded lexical entries in deep grammars (because type can contribute relations) Validation of other lexicons Need closed class items for RMRS construction from shallow processing

Alignment and XML Comparing RMRSs for same text efficiently requires `characterization labels RMRSs according to their source in the text currently characters, but also XPath plus characters RMRS-XML RMRS seen as levels of mark-up: standoff annotation

RMRS approach: current and planned applications Question answering: Cambridge CSTIT: deep parse questions, shallow parse answers QA from structured knowledge: Frank et al (QUETAL project) Information extraction: emails (Deep Thought) Chemistry texts (SciBorg) Dictionary definition parsing for Japanese and English (Bond and Flickinger) Rhetorical structure, multi-document summarization... also LOGON: semantic transfer. MRSs from LFG used in HPSG generator.

RMRS in Deep Thought Different systems integrated via the HoG: Invoke shallow or deep parsing, full or partial results, all expressed in RMRS. Also shallow parsing as precursor to deep parsing: NER, unknown words. Preliminary test on email response application (Xtramind Mailminder): email categorized, then category-specific templates built from RMRS increase in precision of automatically instantiated templates (up to 29%) with the addition of deep parser to the system

IE architecture using deeper processing and RMRS 1. Preprocessing of markup etc 2. Tokenisation 3. Named Entity Recognition: delivers RMRS 4. Shallow processing (including chunking): delivers RMRS 5. Deep parsing: uses shallow processing and NER, delivers RMRS 6. Word sense disambiguation: uses RMRS from best available source, further instantiates RMRS according to ontology 7. Anaphora resolution: uses RMRS from best available source, further instantiates RMRS 8. Relationship detection via patterns over deepest possible RMRSs 9. DB instantiation

SciBorg: Chemistry texts eScience project started in October at Cambridge Computer Laboratory, Chemistry, CeSC Partners: Nature Publishing, Royal Society of Chemistry, International Union of Crystallography (supplying papers and publishing expertise) Aims: 1. Develop an NL markup language which will act as a platform for extraction of information. Link to semantic web languages. 2. Develop IE technology and core ontologies for use by publishers, researchers, readers, vendors and regulatory organisations. 3. Model scientific argumentation and citation purpose in order to support novel modes of information access. 4. Demonstrate the applicability of this infrastructure in a real-world eScience environment.

Outline architecture RSC papers Nature papers base XML IUCr papers Biology and CL (pdf) POS tagging NER RASP sentence splitting ERG/PET WSD anaphora tasks standoff annotation rhetorical analysis RMRS merge

Research markup Chemistry: The primary aims of the present study are (i) the synthesis of an amino acid derivative that can be incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative to function as a photoswitch in a biological environment. Computational Linguistics: The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model.

RMRS and research markup Specify cues in RMRS: e.g., l1:objective(x), ARG1(l1,y), l2:research(y) The concept objective generalises the predicates for aim, goal etc and research generalises study, work etc. Ontology for rhetorical structure. Deep process possible cue phrases to get RMRSs: feasible because domain-independent more general and reliable than shallow techniques allows for complex interrelationships e.g., our goal is not to... but to... Use zones for advanced citation maps (e.g., X cites Y (contrast)) and other enhancements to repositories

Conclusions Information Extraction is more than company mergers or gene- protein interactions! Combined deep-shallow processing techniques have potential for IE RMRS is a representation language that allows for deep-shallow compatibility via extreme underspecification various systems adapted to output RMRS and further work ongoing RMRS offers detailed compatibility at a phrasal level RMRS processing can be integrated with ontologies in various ways RMRS tools are distributed as Open Source via DELPH-IN SciBorg will further develop this approach for eScience applications using a generic standoff architecture

Further work on RASP-RMRS Fast enough (time not significant compared to RASP processing time because no ambiguity) Too many RASP rules! Need to generalise over classes. Requires SEM-I: i.e., API for MRS/RMRS from deep grammar RASP and ERG may change: compatible test suites – semi-automatic rule update? alternative technique for composition? Parse selection – need to generalise over RMRSs weighted intersections of RMRSs (cf RASP grammatical relations)

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge

Similar presentations

Presentation on theme: "Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge

Similar presentations

Presentation on theme: "Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge"— Presentation transcript:

Similar presentations

About project

Feedback