Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.

Similar presentations


Presentation on theme: "Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS."— Presentation transcript:

1 Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS University of California, Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech

2 Project overview A system for flexible querying of text that has been annotated with the results of NLP processing. Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. Designed to scale to very large corpora. Demo of LQL (Layered Query Language) on examples taken from the NLP literature.

3 Key Contributions Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size.

4 Layers of Annotations Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be Sequential Overlapping (e.g., two multiple-word concepts sharing a word) Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology

5 Annotation Layers Example

6 System Architecture (Main table) ANNOTATION_IDPMIDSECTIONLAYER_IDSENTENCE FIRST_ WORD_POS LAST_ WORD_POS TAG_TYPEWORD_ID START_ CHAR_POS END_ CHAR_POS 36445656t10012158480014 36545656t101227556381524 36645656t102319747742527 36845656t103421130282834 37045656t104530369063540 37145656t105687994041 36745656t300231None024 36945656t302334None2527 37245656t303531None2840 37345656t4005None040 36378539845656t60027269None024 36378539945656t60454364None3540

7 System Architecture (Indexes) (Forward) +doc_id+section+layer_id+sentence+first_ word_pos+last_word_pos+tag_type (Inverted) +layer_id+tag_type+doc_id+section+sente nce+first_word_pos+last_word_pos (Inverted) +word_id+layer_id+tag_type+doc_id+secti on+sentence+first_word_pos

8 Example query I Protein-Protein Interactions Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.

9 Example query I - LQL SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

10 Example query I – Sample output PROTEIN 1INTERACTION VERBPROTEIN 2FREQUENCY Ca2activatesprotein kinase312 Cln3activateprotein kinase234 TAPbindstranscription factor192 TNFactivatesprotein tyrosine kinase133 serine/threonine kinasebindingRhoA GTPase132 PhospholambaninhibitsATPase114 PRLactivatedtranscription factor108 Interleukin 2activatestranscription factor84 Prolactinactivatestranscription factor84 AMPAactivatedprotein kinase78 Nerve growth factoractivatesprotein kinase78 LPSinhibitedMHC class II75 Heat shock proteinBindingp5972 EPOactivatedSTAT563 EGFactivatedPP2A60 cisbindsSp150

11 Example query II Chemical–Disease Interactions “Adherence to statin prevents one coronary heart disease event for every 429 patients.” Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals.

12 Example query II - LQL [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number ~ 'C%'] AS disease $ ] ] AS sent SELECT sent.pmid, chemical.text, disease.text, sent.text


Download ppt "Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS."

Similar presentations


Ads by Google