Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand.

Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand Balagurunathan

Outline Information extraction – motivation. Current Trends in Information Extraction. Sentence structures. Complex Sentence Processing –A Rule Engineering Approach. –An Automated Approach Experimental Results. Conclusion.

Information Extraction - Motivation Enormous amount of knowledge stored in unstructured text. Lack of easy access to information for processing. Biomedical domain - Piping hot results that curated databases don’t offer.

What can we get from text Literature –Protein functional information. –Interactions between genes, proteins, chemicals and drugs. –Signal transduction pathways. Clinical records –Diagnosis patterns. –Drugs to disease association.

The information extraction process

Text Retrieval Keyword retrieval (PubMed) Vector-Space Model Probabilistic models Latent Semantic Indexing

Text Classification Knowledge Engineering –Rules specified by a domain expert. –‘knowledge acquisition bottleneck’. Machine Learning –‘Hard’ and ‘soft’ classification. –Traditional classifiers with automatic feature selection algorithms. –Frequent word co-occurrences and/or Natural Language patterns. –Ontologies to identify word synonyms.

Named Entity Recognition Dictionary look up. –LocusLink, Gene Ontology, HUGO, SwissProt and UMLS provide extensive lists of biomedical terms. –Constrained to identify names present in the vocabularies alone. Regular Expressions. –Patterns defined on words and phrases. –E.g.. Alphanumeric words, hyphenated mixed cases.

Named Entity Recognition Boosted Wrapper Induction –Name boundaries to identify entities. Supervised Classifiers –Hidden Markov Models. –Support Vector Machines. Supervised Rule Learners to tag entity names – the RAPIER system –Rules based on word patterns. Probabilistic Rule Learner – ABGene –Rules based on co-occurrence of POS tags and neighboring words.

Named Entity Recognition Results from Shared task in JNLPBA, 2004 (Trained and tested on GENIA corpus) Technique usedPrecision / Recall / F-measure Combination of Hidden Markov Models and Support Vector Machines 76.0 / 69.4 / 72.6 Classifier trained on Local and syntactic features of entity names71.6 / 68.6 / 70.1 Conditional Random Fields (CRF) and Novel Feature Sets70.3 / 69.3 / 69.8 SVM + CRF67.8 / 64.8 / 66.3 HMM model69.1 / 61.0 / 64.8 Training SVM classifiers on linguistic knowledge67.4 / 61.0 / 64.0

Extracting Relationships Word co-occurrences –Presence of a functional word and entity names in same sentence. Regular Expressions –Regular expressions defined on part of speech tags. –Processing complex sentence structures using regular expressions.

Extracting Relationships Grammar rules based on sentence structures – shallow parses. –Context Free grammars. –Probabilistic Context Free Grammars. –Automatic learning of grammar rules using supervised rule learners – FOIL, PROGOL. –Combination of Naïve Bayes Classifier and FOIL.

Extracting Relationships Hidden Markov Models –Learning templates based on words and sentence structures. Preposition based templates – the Arizona Relation parser –Slot filler approach. –Shallow parsing of sentence structure. Decision trees - GIS –Wording and term distributions represented as a decision tree to identify protein interactions.

Extracting relationships Manually built templates –SUISEKI - Frame based system. –BioRAT - Templates based on sentence structure and semantics. –GENIES, GeneWays – Extension of the MedLEE frame based system. Full sentence parses –The structure of the entire sentence is considered. –Potentially more accurate. –Rule learners based on Link grammar analysis of sentences. –Dependency sub-graph from the link grammar word-dependency output.

Evaluation of Extraction Systems Precision –Correctness of system True Positives True Positives + False Positives Recall –Sensitivity of system True Positives True Positive + False Negative F-measure –Combination of Precision and Recall (harmonic mean) 2 * P * R (P + R)

Common problems Complex sentence structures. c-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain. Contextual information. c-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM).

Common problems Negation. DNA-PK does not bind detectably to Ku in the absence of DNA. Modality and hypothesis. Cdk4 may be inhibited by tyrosine phosphorylation. Gadd45 binds to Cdk1 and Gadd45 inhibits Cdk1 activity, probably by displacing Cyclin B1.

Complex Sentence Processing - Motivation Sentences structures in text are seldom simple. Simple, Complex, compound and Complex-compound structures are addressed. Dependent and independent clauses. Multiple clauses used to specify interactions. Multiple interactions specified in clauses.

Sentence Structures - Examples Simple Phosphorylation by ATM activates c-Abl. Complex c-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain. Compound c-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM) ; the c-Abl SH2 domain is a specificity determinant for this reaction. Complex-compound Because the same signals that induce somitic expression of Pax-3 also induce myogenesis and because expression of Pax-3 precedes that of MyoD and Myf-5 in vivo and that of MyoD in vitro, we investigated whether Pax-3 might play a role in activating the myogenic program in somitic cell.

Complex Sentence Processing Manual Rule Engineering approach –Based on Part of Speech tags. –Domain expert engineers rules for complex sentence structures. Automated approach –Based on syntactic relationships between the words in a sentence. –Link grammar parser.

Evaluation of Our Systems Precision No. of Correct interactions Total no. of interactions extracted Recall No of correct interactions Total no of interactions identified in text F-measure (2*P*R)/P+R

Rule Engineering Approach Patterns learnt by interaction with domain expert. Sentences tagged by a natural Language processor. Extraction rules written based on Part of speech tags. False positives reduced by use of Ontologies.

Rule Engineering Approach Toufeeq

Rule Engineering Approach System developed in XSB prolog. A Java GUI integrated to the back-end processing system. Dictionaries coded as knowledge bases in prolog. Input – Sentences in a text file. Ouput – An interaction triplet. –interact (upstream-gene, interaction word, downstream-gene).

Pos patterns for simple sentences extract([word([tag = NNP],_h13160),word([tag = VBZ],_h13161), word([tag=JJ],_h13162)], interact(_h13160,_h13161,_h13162),true]). extract([ng(_h99513),vg(_h99514),ng(_h99515)], interact(_h99513,_h99514,_h99515),true). extract([ng(_h108321),vg(_h108322),word([tag=NNP],_h108323)], interact(_h108321,_h108322,_h108323),true). Extract an interaction from a (Noun, Verb, Adjective) pattern.

Knowledge Bases Used LocusLink (NCBI) –compilation of the genes and related information such as alias names, protein products, PubMed IDs of articles referring to the gene etc. –Gene, protein names and their aliases. UMLS (NLM) –Electronic "Knowledge Sources" and associated lexical programs for biological domain. –Specialist Lexicon used to extract interaction words. WordNet –English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. –Used in conjunction with UMLS to extract interaction words. –Interaction words are stemmed using the Porter Stemming algorithm to match across numbers and tense.

Representation of Knowledge Bases %gene and protein names isa('3''-nucleotidase',gene). isa('3'' repair exonuclease 1',protein). isa('E2F4',gene). isa('5''-nucleotidase, cytosolic IA',protein). isa('M1',gene). isa('PP2A, subunit B''',protein) % interaction words interaction('accumulat'). interaction('activat'). interaction('elevat'). interaction('hasten'). interaction('incite'). interaction('increas'). interaction('induc').

Complex Sentence Processing Extraction Methodology –Identify the Complex Sentence patterns. –Split the complex and compound sentences to simple sentences based on extraction rules for the pattern. –Process the sub-sentences using pattern matching against simple sentence patterns. An iterative process was used to build the rules and add simple sentence patterns manually. Training and Testing – curated text from a 1999 Molecular Biology of the Cell article by Kurtis Kohn.

Rules Written to Address Complex Sentence Structures Case - List of arguments - a comma separated list or a list separated by conjunctions. Rule - Sentence is rewritten with each of the arguments. Each rewrite is processed as a single sentence. Sample Input- Dyhydrofolate reductase (DHFR) is activated via the E2F transactivation domain, whereas B-myb, Cyclin E, E2F-1, E2F-2, and Cdc2 are regulated via the repression domain of pRb family proteins. Sample output - interact([Dyhydrofolate,reductase],[is,activated], [the,E2F,transactivation,domain],1.0). interact(B-myb,[are,regulated],[the,repression,domain],0.6667). interact([Cyclin,E],[are,regulated],[the,repression,domain],0.6667). interact(E2F-1,[are,regulated],[the,repression,domain],0.6667). interact(E2F-2,[are,regulated],[the,repression,domain],0.6667). interact(Cdc2,[are,regulated],[the,repression,domain],0.6667).

Rules – Continued extractListStart([],_). extractListStart([X|Y],Part) :- conc([X],Part,Whole), extractComplex(Whole), extractListStart(Y,Part). extractComplex(S):- contains([W1,word([_],','),_,wor d([_],','),_],S), splitList([W1],S,S1,S2), S1=[], getList(S2,List), length(List,Len), ith(Len,List,Ele), sublist([Ele],S,Part1,Part2), extractListStart(List,Part2),!.

Snapshot of the system GUI

Sample Run of the system

Results Training dataTesting dataPrecision (%) Recall (%) Parts A, B 9888 Parts A, BPart C5743 Parts A, B, CPart E81.2571

Discussion Pronoun resolution done by hand. Dependent on skills of domain expert. Extension to other sentence structures will involve re-engineering rules. Recall tends to be higher as rules are manually created.

Automated Extraction System Automatic Integrated system for –Pronoun resolution. –Entity Tagging. –Complex sentence processing. –Interaction Extraction. Uses dependencies generated by the Link grammar parser Scalable Less error prone

Automated Approach to CSP

Anaphora Resolution –Identification of pairs of noun phrases that refer to each other. Need for Anaphora resolution –Entities are often referred to by pronouns and referent noun phrases. –Identifying the noun phrases referred to improves recall of the system.

Types of Anaphora in Literature

Pronoun Resolution Module Identifies and resolves pronominal anaphora. System written in XSB prolog. Interprolog Java API is used to integrate the module to the extraction system. Addresses third person pronouns such as it, they. Handles both singular and plural pronouns as well as reflexives (itself, themselves) and possessives (its, their).

Pronoun Resolution - Algorithm 1.Process sentences to get POS tags for the words. 2.Perform chunking on the words to form noun phrases. 3.Identify pronouns in the sentence 4.Remove redundant reflexives ( Gene A, by itself remains activated.) 5.Replace identified pronouns with noun phrases matching number and voice.

Pronoun Resolution: Walkthrough Ku loads onto dsDNA ends and it can diffuse along the DNA in an energy-independent manner. Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an energy-independent manner. When breast cancers were examined for NGAL mRNA and protein levels, they were found to exhibit heterogeneous expression. When breast cancers were examined for NGAL mRNA and protein levels, breast cancers were found to exhibit heterogeneous expression.

Pronoun Resolution: Example 2

Entity Tagging Combination of dictionary approach and heuristic rules. Dictionaries built from LocusLink, UMLS, GO and WordNet. Entities addressed by dictionaries - genes, proteins, chemicals, location and interaction words. Regular expression developed manually using the information from the dictionaries. Heuristic rules combine regular expression and POS tags to mark entities.

Gene and Protein Name Dictionary Source - NCBI’s LocusLink. Fields in LocusLink contain information about gene loci. –Official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers, UniGene clusters, homology, map locations, related web sites, etc. Perl parser written to identify gene and protein names and their aliases. Names and aliases maintained in a list loaded by the system during processing.

Gene and Protein Name Dictionary (a) LocusLink fields considered for extraction (b) Partial LocusLink record (c) Regular expression used for gene name matching

Gene and Protein Name Tagging Exact matches with dictionary Process of relaxation –Mark the entire noun group as a gene if any of the word has been identified as gene / protein name. ‘The c-abl kinase activity’ is a gene type Words that match a manually built regular expression. –POS tags are used to exclude false positives such as adverbs, modals, prepositions, etc. 508,477 entries in dictionary

Chemical and Location Names Used to tag location of interaction and agents in interaction for modifier analysis. Source - UMLS Metathesaurus and Semantic Network, GO. Location names dictionary contains the GO cellular components ontology. Semantic types identified from Semantic Network using the ‘isa’ relationship.

Chemical and Location Tagging Concepts categorized under the identified semantic types are added to the corresponding dictionaries. A perfect match with an entry in the dictionary is tagged as chemical name / cellular location. 790,000 chemical names and 118, 018 location names were identified.

Interaction Words Source – Verbs from UMLS Specialist Lexicon and WordNet. Interaction words are identified as –(Specialist verbs ∩ WordNet verbs) Missing interaction words are added manually. 1400 verbs in their stemmed form are stored in the list. Porter Stemming Algorithm allows matches across tense and number.

Complex Sentence Processor (CSP) Tool used – Link Grammar Parser (LGP) from CMU. Verb based extraction. –John played the pipes. Linkages returned by the link grammar used to identify syntactic constructs. Syntactic constructs associated with a verb form simple sentences. Internal clause format representation. Java system with a GUI.

Link Grammar Parser Based on the Link Grammar for English Syntax. Dependency based grammar system written in C. Every word in a sentence is connected to every other word directly or indirectly through links. Dictionary specifies linking requirements for each word. Linkages in a sentence should satisfy –Connectivity. –Planarity.

Linking Requirement Link Grammar Parser (LGP) trained on sentences from conversational English (70 telephone conversations). Linking requirements specified using ‘&’ and ‘or’ operators. the: ({AL-} & {@L+} & (D+ or DD+)) or DG+ or (TR- & U+); Each word should follow its linking requirement.

Linkage Words in sentence matched against dictionary –Linkage formed using the satisfied linking requirements. Null linkages – ambiguous / unknown words –LG ignores some words if an acceptable guess of the requirement is not found. Linkages returned in increasing order of cost.

Using Link Grammar Parser in CSP LGP considers each sentence as a having one verb. Provides sub-linkages for each clause in complex sentence. Sub linkages are combined to a single linkage using !union option. Combined linkage analyzed and split into clauses.

Combined sub-linkages - !union operator

Pre-processing Text for Link Grammar Gene names are not in the vocabulary of LG. –Force LGP to identify entities as nouns. –Capitalized and hyphenated phrases with entity tags. –E.g. ‘The C-abl protein kinase activity’ is converted to ‘Gene-The-C-abl-protein-kinase-activity’. Phrases / words in parentheses cause incorrect linkages. –Remove words within parentheses.

Preprocessing – removing parentheses Rapid degradation of Cyclin D1 requires phosphorylation at threonine-286 (kinase unknown, but not Cdk2 or Cdk4); degradation is by way of the ubiquitin-proteasome pathway Sentence crashes link grammar if processed as is, with parentheses.

Pre-processing Text for Link Grammar Windows version of Link Grammar Parser –Does not handle sentences of more than 70 words Mark the sentences of more than 70 words and discard them for further processing –Words of more than 50 characters also crash the LGP Mark sentences and discard them from further processing Sentences without two participants and one interaction word do not contain interactions –Discard sentences to improve processing time –Discarding is done after pronoun resolution so referrals are not disturbed.

Complex Sentence Processing Input –Tagged and pre-processed text. Output –Clauses from the sentence representing the simple sentences. Internal clause format representation Subject + Verb + Object (s) + Verb Modifying Phrases (s) Simple Sentences processed by Interaction Extractor module to get interactions from text.

Complex Sentence Processor- Goal Kohn-partC3|2|Another DMP1-regulated gene is CD13/aminopeptidase N, which is activated cooperatively by DMP1 and c-Myb; CD13/aminopeptidase N activation by DMP1 is inhibited by cyclin D, independent of Cdk4/6. Complex Sentence Processing Kohn-partC3|2|Gene-Another-DMP1-regulated-gene|is|Gene-CD13/aminopeptidase- N, which#|| Kohn-partC3|2|Gene-CD13/aminopeptidase-N |is activated||cooperatively#by Gene- DMP1 and Gene-c-Myb#| Kohn-partC3|2|Gene-N-activation by Gene-DMP1|is inhibited||by Gene-cyclin-D, independent of Gene-Cdk4/6#| Abstract ID | Sentence ID | Subject | Verb | Objects(s) | Verb Modifier(s)

Our Contribution: CSP Algorithm 1.Processing complex sentences - Set ranges in sentence 1.Identify the S links in the sentence. Each S link forms the beginning of a simple sentence. 2.Identify the Components from each S link and expand the range to include prepositional, adverbial and adjectival phrases. 2.Extract the components of the simple sentence from the original sentence using the ranges as the substring markers.

CSP - Illustration Upon growth factor stimulation of quiescent cells, p130 declines late in G1 and p130 is replaced by p107, which is absent in quiescent cells. Kohn-partE2|8|Upon growth factor stimulation of quiescent cells, Gene- p130 declines late in Gene-G1 and Gene-p130 is replaced by Gene- p107, which is absent in quiescent cells. | Entity Tagging and Pre-processing

Illustration of the CSP algorithm Kohn-partE2|8| upon growth factor stimulation of quiescent cells, Gene-p130 |declines || late # in Gene-G1 #| Kohn-partE2|8| Gene-p130 | is replaced || by Gene-p107 #| Kohn-partE2|8| Gene-p107 | is absent || in quiescent cells#|

CSP Example – Pronoun resolution, preprocessing

Experimental results System compared with two other systems –BioRAT [Corney et al, 2004] Uses manually generated sentence structure templates. Partial parses of sentences. –GeneWays [Rzhetsky et al, 2004] An improvement on GENIES system. Extraction rules are semantic grammars defined on the sentence structures Partial parses of the sentences.

Evaluation - BioRAT DIP entries with both participants present in SwissProt. 229 abstracts, 660 entries in DIP. Gene name dictionary restricted to SwissProt names to emulate the BioRAT system. 452 interactions obtained. Precision of 54.39% Recall of 19.98%. F-measure of 29.18%

Comparison with BioRAT Comparison with BioRAT and DIP was undertaken as a combined work with Toufeeq

Comparison with DIP Difference in Recall - the interaction is not mentioned in the abstract. (Manual and semi-automatic curation of entries in DIP)

Comparison with GeneWays 2500 interactions obtained from the authors corresponding to more than 10 citations in PubMed. 30 interactions with most citations were selected and PubMed abstracts (495) obtained through NCBI’s eutilities facility corresponding to the entities. Extraction system obtained 1485 interactions. 156 interactions occurred more than twice in the abstracts with Precision of 76.13 %. Recall performed on a single full text article quoted in GENIES. 42 interactions extracted from the article with a recall of 63.64%.

Comparison with GeneWays

Analysis of errors in comparison with GeneWays

Conclusion Extraction system is able to extract multiple interactions mentioned through complex sentence structure. Limitations and bottlenecks –Protein name identification. –Link Grammar Parsing problems. –Identification of relationships between clauses.

Thank you My advisors – Dr. Davulcu, Dr. Baral for their guidance and encouragement Yoga, for the periodic inputs and pointers Dr. David Corney, for sharing the data from BioRAT Dr. Andrew Rzhetsky for providing data from the GeneWays system Toufeeq, for sharing his code and for his contribution in evaluation.

Questions

Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand.

Similar presentations

Presentation on theme: "Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand.

Similar presentations

Presentation on theme: "Processing Complex Sentences for Information Extraction Deepthi Chidambaram December 22, 2004 BY 510 Committee Dr. Hasan Davulcu Dr. Chitta Baral Dr. Yoganand."— Presentation transcript:

Similar presentations

About project

Feedback