CIS 630 Advanced Topics in NLP Sharon Diskin

CIS 630 Advanced Topics in NLP Sharon Diskin
Disambiguating proteins, genes, and RNA in text: a machine learning approach Hatzvassiloglou V., Duboue P., Rzhetsky A. (2001). Bioinformatics. 17:S97-S106 CIS 630 Advanced Topics in NLP Sharon Diskin November 10, 2018

INTRODUCTION November 10, 2018

Introduction: Context
GeneWays under development at Columbia University and Queens College signal transduction pathways extraction simulation NLP component of GeneWays preprocessing tagging genes, proteins, RNA disambiguation of entities pattern recognition based on entities and lexical grammar relationship extraction November 10, 2018 Friedman, et al (2001) Bioinformatics

Introduction: Motivation
Genes, Proteins, and RNAs often share the same name in biological databases and literature. “By UV cross-linking and immunoprecipition, we show that SBP2 specifically binds selenoprotein mRNAs both in vitro and in vivo.” (Protein) “The SBP2 clone used in this study generates a 3173 nt transcript (2541 nt of coding sequence plus a 632 nt 3’ UTR truncated at the polyadenylation site).” (Gene) In order to reason about relationships between genes, proteins, RNAs, we must be able to disambiguate them eg. Proteins activate genes (genes do not activate proteins) November 10, 2018

Introduction: Approach
Other methods used for word sense disambiguation: Modeling the context of each ambiguous word as a vector of neighboring words (Brown et al., 1991, Gale et al., 1992) Accuracy 65-92% depending on word being disambiguated and alternate senses Drawback: requires labeled data Others suggested ways of avoiding annotation bootstrapping (Hearst 1991) use of parallel texts (Dagan and Itai, 1994) constructing pseudo-words for training (Gale et al., 1992) use of contextual evidence that indicates some of unlabeled terms belong to a particular class (Yarowsky, 1995) Approach taken: Applied Yarowsky approach in that they recognized that many genes, proteins and RNAs are disambiguated by the words “gene”, “protein”, or “mRNA” immediately after it Assume other words in vicinity are also indicative of true class November 10, 2018

METHODS November 10, 2018

Methods: Data Collection & Annotation
Automatic download of HTML Converted to XML - HTML::Parse and LT XML tools (Brew et al., 2000) Section boundaries - using HTML formatting Sentence detection - MXTerminator (Reynar and Ratnaparkhi, 1997) Tokenizer - pattern matching finite state automata POS tagger - statistical method (Brill, 1992) Term Tagging - Genes, Proteins and mRNAs simple lookup method using Genbank (>200k names) excluded single-word names that conflict with English (0.9%) identify single word names that are found in multi-word context: “gp41-mediated” 2 Categories: Disambiguated : used for training and also evaluation (disambiguating word is masked) Not disambiguated : used for testing/evaluation only November 10, 2018

Methods: Example Annotation
November 10, 2018

Methods: Feature Definitions
Three general ways of defining features and including positional info Bag of Words (no positional information) given a term, collect all words near the term in a vector of counts representation Two Bags of Words (some positional information) Separate the nearby words into two bags: those before the term and those after Words Annotated with distance from term (complete positional information) Separate counts for each position Utilize morphological, distributional and shallow syntactic info: Capitalization Part of Speech Stopwords and similarly distributed words Identified words not in stop list that are equally distributed among the classes Remove those words for which a chi-squared test contrasting their distributions across the alternative classes does not yield a statistically significant difference at the .05 level Stemming Eg. Phosphorylate and phosphorylation are treated as same feature Neighborhood of term: N words to left and N words to right November 10, 2018

Methods: Naïve Bayes Goal: Assign to a term class c that maximizes P(c|ξ), where ξ is the evidence available to the learning algorithm for that occurrence (ie. The features f1..fk in the term’s context) P(c|ξ) = P(ξ|c) * P(c) P(ξ) = P(f1, f2, …fk|c) * P(c) P(f1, f2, …fk|c) Prior probability of class c Prior probability of evidence (constant for all classes) Recognizing that P(ξ) is constant for all c, one can maximize: P(f1, f2, …fk|c) * P(c) Naïve Baye’s makes a strong independence assumption; we maximize: P(c) * Πi P(fi|c) , i= 1..k November 10, 2018 (Actual calculations are on log scale)

Methods: Decision Trees
Goal: Build a decision tree by using information gain, such that: each node corresponds to a feature and each arc to a possible value of that feature. A leaf of the tree specifies the expected class for the feature vectors described by the path from the root to that leaf. Each node represents the feature which is most informative among the attributes not yet considered in the path from the root. Select node based on information gain Classify Test Instances: follow path of decision tree from the root to a leaf based on the feature vector for the test instance In general, if we are given a probability distribution P = (p1, p2, .., pn),then the Information conveyed by this distribution is: I(P) = -(p1*log(p1) + p2*log(p2) pn*log(pn)) For example, if P is (0.5, 0.5) then I(P) is 1, if P is (0.67, 0.33) then I(P) is 0.92, if P is (1, 0) then I(P) is 0. Note that the more uniform is the probability distribution, the greater is its information/entropy November 10, 2018

Methods: Decision Tree Example
Weather Data Example: Here’s the weather, are you going out and play? ID code Outlook Temperature Humidity Windy Play? a b c d e f g h i j k l m n Sunny Overcast Rainy Hot Mild Cool High Normal False True No Yes November 10, 2018

Methods: Decision Tree Example
Outlook humidity windy yes no sunny overcast rainy high normal false true November 10, 2018

Methods: Inductive Rule Learning
Experimented with RIPPER implementation (Cohen, 1996) Rules involving tests on features are iteratively constructed, these rules: map a particular combination of features to a class label are applied sequentially during prediction (rules where the system has the highest confidence are applied first) Negative information (ie. feature does NOT appear near a term) is made explicit in the rules - Features as sets - Decision on which feature to use when building rules appears to be based on entropy Decision trees can be converted to a similar rule path by tracing the path from the root to a leaf (Fig 2) to right Rule 408 : References to genes are often followed by a citation to work of other researchers Rule 530 : Captures fact that Genes encode information November 10, 2018

Methods: Experimental Design
Optimal number of words chosen by holding out 1/10 of data and performing 10-fold cross validation for N=2..35. Training and evaluation performed using 10-fold cross validation on the remaining 9/10 of data Performance Measures Overall: accuracy rate Specific classes: precision, recall, specificity, sensitivity, F-measure November 10, 2018

RESULTS AND EVALUATION
November 10, 2018

Results: Algorithm Assessment
Measure performance using the rate of accuracy on: close to 10,000 locally disambiguated cases (automatically labeled) 550 manually labeled ambiguous occurrences distributional characteristics of words in context of ambiguous occurrences may be different than that found in locally disambiguated occurrences Two target data sets Two-way : classification of genes and proteins (ignoring mRNA) Three-way : classification of genes, proteins and mRNA Comparable accuracy Naïve Bayes for remaining experiments faster training faster predictions Naïve Bayes chosen for efficiency reasons November 10, 2018

Results: Feature Definition Assessment
Positional Information full positional information universally lowered accuracy (up to 6%) before/after positional information slightly decreased accuracy (1-1.5% on avg) possibly due to sparse data when same word mapped to different features consider conditional use of positional information according to frequency of word Capitalization mapping all words to lower case did not alter performance included in final system (reduce feature count, enhance speed, lower memory) Part of Speech generally helped overall accuracy, but by less than ~1% on average likely due to technical nature of domain (less ambiguity of non-terms) included in final system November 10, 2018

Results: Feature Definition Assessment
Similarly distributed words eliminating these words had a slight negative effect ( %) not included in final version of feature definition however, significantly reduces number of features that need to be considered 25,000 to 5000 features promising approach if 5 -fold gain in computational efficiency is worth a slight hit on performance Stopwords useful for both increasing performance ( %) and reducing feature space included in final feature definition Stemming increased accuracy 0.4% on average November 10, 2018

Results: Performance w./ final features
Overall Accuracy (on non-ambiguous terms in test set) 84.48% - Two Way 78.11% - Three Way Detailed evaluation scores for particular classes: November 10, 2018

Manually Reviewed (by three experts) 15 articles, extracted 75 paragraphs, tagged 550 terms (using GenBank) Average pairwise agreement : 77.58% Baseline: always assign protein (since protein is most common) November 10, 2018

Performance significantly lower (~15%) on manually labeled compared to automatically disambiguated, two possible explanations they offer: humans are correct perhaps selected cases are harder to classify inconsistencies not handled by system (eg. tRNA as mRNA) system more consistent and being penalized when actually making correct decisions support: system better against non-ambiguous terms than humans are against each other Future plan: mask disambiguating terms (eg. “gene”) have experts classify terms compare disambiguating performance Baseline: always assign protein (since protein is most common) Seems to me there is a third possible explanation - Trained with 95% locally disambiguated and only 5% manually labeled. Perhaps the 5% of manually disambiguated casees is not enough to overcome the different distribution of words surrounding these two categories of terms. November 10, 2018

CONCLUSION November 10, 2018

Conclusion Automatically annotated training set
based on contextual information Performed a large scale evaluation (9 million word corpus) Performed manual evaluation Optimized accuracy and computational efficiency Achieved high levels of accuracy within range of statistical sense disambiguation applications near human agreement rate when evaluated against humans Seems extremely useful, however I have to wonder if the accuracy measures are inflated due to the use of ‘locally disambiguated terms” 95% were locally disambiguated (automatic) 5% manually labeled/disambiguated Could still be that the distributional context of words is different between locally disambiguated terms and ambiguous terms. This is another explanation to the descrepency between non-ambig and manually labeled,. November 10, 2018

Some Additional Thoughts
Approach appears very successful Curious about exact reason for performance decrease when compared against manual annotation what about fact that training data is 95% locally disambiguated could still be that word distribution is different for these cases would be interesting to know the results of their future plans to have experts annotate the locally disambiguated terms Wonder why they didn’t consider use of maximum entropy (as opposed to including both decision trees and inference rule learning) Seems extremely useful, however I have to wonder if the accuracy measures are inflated due to the use of ‘locally disambiguated terms” 95% were locally disambiguated (automatic) 5% manually labeled/disambiguated Could still be that the distributional context of words is different between locally disambiguated terms and ambiguous terms. This is another explanation to the descrepency between non-ambig and manually labeled,. November 10, 2018

References Friedman C., Dra P., Yu H., Krauthammer M, Rzhetsky A. (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics. 17:S74-S82. Cohen W. (1996) Learning trees and rules with set-valued features. In Proc. 14th AAAI.. Dagan I., Itai A. (1994). Word sense disambiguation using a second language monolingual corpus. Comput. Linguist. 20(4), Hatzivassiloglou V., Duboue P., Rzhetsky A. (2001) Disambiguating proteins, genes and RNA in text: a machine learning approach. Bioinformatics. 17:S97-S106. Hearst M.A. (1991). Noun homograph disambiguation using local context in a large text corpora. In Using Corora, U of Waterloo. Yarowsky D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proc. 33rd ACL, November 10, 2018

CIS 630 Advanced Topics in NLP Sharon Diskin

Similar presentations

Presentation on theme: "CIS 630 Advanced Topics in NLP Sharon Diskin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 630 Advanced Topics in NLP Sharon Diskin

Similar presentations

Presentation on theme: "CIS 630 Advanced Topics in NLP Sharon Diskin"— Presentation transcript:

Similar presentations

About project

Feedback