Information extraction from bioinformatics related documents

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
1/23 Applications of NLP. 2/23 Applications Text-to-speech, speech-to-text Dialogues sytems / conversation machines NL interfaces to –QA systems –IR systems.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Information Retrieval in Practice
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Naïve Bayes for Text Classification: Spam Detection
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Text Classification, Active/Interactive learning.
How to classify reading passages into predefined categories ASH.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Naive Bayes Classifier
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Mining di dati web Classificazioni di Documenti Web Fondamenti di Classificazione e Naive Bayes A.A 2006/2007.
CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Classification Techniques: Bayesian Classification
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Data Mining: Text Mining
Data Mining and Decision Support
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
NATURAL LANGUAGE PROCESSING
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Sentiment analysis algorithms and applications: A survey
Naive Bayes Classifier
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Natural Language Processing (NLP)
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Machine Learning in Natural Language Processing
Statistical NLP: Lecture 9
CSE 635 Multimedia Information Retrieval
Information Retrieval
Natural Language Processing (NLP)
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

Information extraction from bioinformatics related documents

Introduction Extracting structured information from unstructured and/or semi-structured m/c- readable documents. Processing human language texts by means of NLP methods. Text/Images/audio/video

Goals Computation on previously unstructured data. from an online news sentence such as: Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." Logical reasoning to draw inferences Text simplification 

Subtasks Named entity Co-reference resolution Relationship extraction Language and vocabulary analysis extraction Audio extraction

Hand-written regular expressions Classifiers Approaches Hand-written regular expressions Classifiers Generative: naïve Bayes Discriminative: maximum entropy models Sequence models Hidden Markov model

Natural Language Processing (NLP) Introduction Field of CS, AI and CL concerned with the interactions between computers and Natural languages. Major Focus HCI NLU NLG

NLP Methods Hand written rules Statistical inference algorithms to produce models robust to unfamiliar input (e.g. containing words or structures that have not been seen before) erroneous input (e.g. with misspelled words or words accidentally omitted).  Methods used are  stochastic, probabilistic and statistical  Methods for disambiguation often involve the use of corpora and Markov models. 

Major tasks in NLP Automatic summarization Discourse analysis Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding

Applications of NLP Native Language Identification Stemming Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing

NLP Techniques for Bioinformatics Biomedical text mining (BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field on the edge of NLP, bioinformatics, medical informatics and computational linguistics.

Motivation There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

General Framework of NLP John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Pred: RUN Agent:John Semantic Analysis John run Context processing Interpretation

General Framework of NLP John runs. Morphological and Lexical Processing John run+s. P-N V 3-pre N plu S Syntactic Analysis NP VP P-N V Pred: RUN Agent:John Semantic Analysis John run Context processing Interpretation John is a student. He runs.

General Framework of NLP Tokenization Morphological and Lexical Processing Part of Speech Tagging Inflection/Derivation Compounding Syntactic Analysis Term recognition Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

General Framework of NLP Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Most words in English are ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation

Nouns, Verbs extraction from textual documents number of methods for determining context automatic topic detection/theme extraction. "what" is being discussed. Nouns and noun phrases to define context. Named entity recognition and extraction. 

Wordnet for synonym finding large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual- semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. freely and publicly available for download. WordNet's structure makes it a useful tool for CL and NLP works.

WordNet similarity to thesaurus (words and meanings) WordNet interlinks not just word forms—strings of letters—but specific senses of words. words that are found in close proximity to one another in the network are semantically disambiguated. Semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

CATEGORIZATION / CLASSIFICATION Given: A description of an instance, xX, where X is the instance language or instance space. e.g: how to represent text documents. A fixed set of categories C = {c1, c2,…, cn} Determine: The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.

A GRAPHICAL VIEW OF TEXT CLASSIFICATION NLP Graphics AI Theory Arch.

TEXT CLASSIFICATION This concerns you as a patient. Our medical records indicate you have had a history of illness. We are now encouraging all our patients to use this highly effective and safe solution. Proven worldwide, feel free to read the many reports on our site from the BBC & ABC News. We highly recommend you try this Anti-Microbial Peptide as soon as possible since its world supply is limited. The results will show quickly. Regards, http://www.superbiograde.us/bkhog/ 85% of all email!!

EXAMPLES OF TEXT CATEGORIZATION LABELS=BINARY “spam” / “not spam” LABELS=TOPICS “finance” / “sports” / “asia” LABELS=OPINION “like” / “hate” / “neutral” LABELS=AUTHOR “Shakespeare” / “Marlowe” / “Ben Jonson” The Federalist papers

Methods (1) Manual classification Automatic document classification Used by Yahoo!, Looksmart, about.com, ODP, Medline very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale Automatic document classification Hand-coded rule-based systems Reuters, CIA, Verity, … Commercial systems have complex query languages (everything in IR query languages + accumulators)

Methods (2) Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, … Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But can be built (and refined) by amateurs

Bayesian Methods Learning and classification methods based on probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how data is produced Uses prior probability of each category given no information about an item. Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

Maximum a posteriori Hypothesis

Maximum likelihood Hypothesis If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values

Naïve Bayes Classifier: Assumptions P(cj) Can be estimated from the frequency of classes in the training examples. P(x1,x2,…,xn|cj) Need very, very large number of training examples  Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

The Naïve Bayes Classifier Flu X1 X2 X5 X3 X4 fever sinus cough runnynose muscle-ache Conditional Independence Assumption: features are independent of each other given the class:

Learning the Model Common practice:maximum likelihood simply use the frequencies in the data

Feature selection via Mutual Information We might not want to use all words, but just reliable, good discriminators In training set, choose k words which best discriminate the categories. One way is in terms of Mutual Information: For each word w and each category c

OTHER APPROACHES TO FEATURE SELECTION T-TEST CHI SQUARE TF/IDF (CFR. IR lectures) Yang & Pedersen 1997: eliminating features leads to improved performance

Tf·idf term-document matrix. tfidf(t, d) = tf(t, d) · idf(t) where Nt,d is the number of occurrences of a term t in a document d, and the denominator is the sum of occurrences of all terms in that document d where W(t) is the number of documents containing the term t

Chi- Square Statistics Used to evaluate the independence between two events. The relevance of a term t in a class c can be estimated by the following formula F11: #documents belonging to c and containing t; F10: #documents which are not in c but containing t; F01: #documents belonging to c but not containing t; F00: #documents which are not in c and not containing t.

MAP Estimates Classification task: to decide which class to choose. Measure importance of term t for a class c. where Nt|c and Nt are the numbers of term t in the class c and in the entire corpus, respectively. Nc is the number of distinct classes. where Nd|c is the number of documents in the scene class c, and Nd is the entire number of documents. Note that α1 and α2 are the smoothing parameters that are typically determined empirically.

OTHER CLASSIFICATION METHODS K-NN DECISION TREES LOGISTIC REGRESSION SUPPORT VECTOR MACHINES Cfr. Manning 8