Information extraction from bioinformatics related documents

Information extraction from bioinformatics related documents

Introduction Extracting structured information from unstructured and/or semi-structured m/c- readable documents. Processing human language texts by means of NLP methods. Text/Images/audio/video

Goals Computation on previously unstructured data.
from an online news sentence such as: Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp." Logical reasoning to draw inferences Text simplification

Subtasks Named entity Co-reference resolution Relationship extraction
Language and vocabulary analysis extraction Audio extraction

Hand-written regular expressions Classifiers
Approaches Hand-written regular expressions Classifiers Generative: naïve Bayes Discriminative: maximum entropy models Sequence models Hidden Markov model

Natural Language Processing (NLP) Introduction
Field of CS, AI and CL concerned with the interactions between computers and Natural languages. Major Focus HCI NLU NLG

NLP Methods Hand written rules
Statistical inference algorithms to produce models robust to unfamiliar input (e.g. containing words or structures that have not been seen before) erroneous input (e.g. with misspelled words or words accidentally omitted). Methods used are stochastic, probabilistic and statistical Methods for disambiguation often involve the use of corpora and Markov models.

Major tasks in NLP Automatic summarization Discourse analysis
Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding

Applications of NLP Native Language Identification Stemming
Text simplification Text-to-speech Text-proofing Natural language search Query expansion Automated essay scoring Truecasing

NLP Techniques for Bioinformatics
Biomedical text mining (BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field on the edge of NLP, bioinformatics, medical informatics and computational linguistics.

Motivation There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

General Framework of NLP
John runs. Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

John runs. Morphological and Lexical Processing John run+s. P-N V pre N plu Syntactic Analysis Semantic Analysis Context processing Interpretation

John runs. Morphological and Lexical Processing John run+s. P-N V pre N plu S Syntactic Analysis NP VP P-N V Semantic Analysis John run Context processing Interpretation

John runs. Morphological and Lexical Processing John run+s. P-N V pre N plu S Syntactic Analysis NP VP P-N V Pred: RUN Agent:John Semantic Analysis John run Context processing Interpretation

John runs. Morphological and Lexical Processing John run+s. P-N V pre N plu S Syntactic Analysis NP VP P-N V Pred: RUN Agent:John Semantic Analysis John run Context processing Interpretation John is a student. He runs.

Tokenization Morphological and Lexical Processing Part of Speech Tagging Inflection/Derivation Compounding Syntactic Analysis Term recognition Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Predefined Aspects of Information Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge General Framework of NLP Most words in English are ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Semantic Analysis Context processing Interpretation

Difficulties of NLP (1) Robustness: Incomplete Knowledge
General Framework of NLP Morphological and Lexical Processing (2) Ambiguities: Combinatorial Explosion Syntactic Analysis Structural Ambiguities Predicate-argument Ambiguities Semantic Analysis Context processing Interpretation

Nouns, Verbs extraction from textual documents
number of methods for determining context automatic topic detection/theme extraction. "what" is being discussed. Nouns and noun phrases to define context. Named entity recognition and extraction.

Wordnet for synonym finding
large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual- semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. freely and publicly available for download. WordNet's structure makes it a useful tool for CL and NLP works.

WordNet similarity to thesaurus (words and meanings)
WordNet interlinks not just word forms—strings of letters—but specific senses of words. words that are found in close proximity to one another in the network are semantically disambiguated. Semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity.

CATEGORIZATION / CLASSIFICATION
Given: A description of an instance, xX, where X is the instance language or instance space. e.g: how to represent text documents. A fixed set of categories C = {c1, c2,…, cn} Determine: The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.

A GRAPHICAL VIEW OF TEXT CLASSIFICATION
NLP Graphics AI Theory Arch.

TEXT CLASSIFICATION This concerns you as a patient. Our medical records indicate you have had a history of illness. We are now encouraging all our patients to use this highly effective and safe solution. Proven worldwide, feel free to read the many reports on our site from the BBC & ABC News. We highly recommend you try this Anti-Microbial Peptide as soon as possible since its world supply is limited. The results will show quickly. Regards, 85% of all !!

EXAMPLES OF TEXT CATEGORIZATION
LABELS=BINARY “spam” / “not spam” LABELS=TOPICS “finance” / “sports” / “asia” LABELS=OPINION “like” / “hate” / “neutral” LABELS=AUTHOR “Shakespeare” / “Marlowe” / “Ben Jonson” The Federalist papers

Methods (1) Manual classification Automatic document classification
Used by Yahoo!, Looksmart, about.com, ODP, Medline very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale Automatic document classification Hand-coded rule-based systems Reuters, CIA, Verity, … Commercial systems have complex query languages (everything in IR query languages + accumulators)

Methods (2) Supervised learning of document-label assignment function: Autonomy, Kana, MSN, Verity, … Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But can be built (and refined) by amateurs

Bayesian Methods Learning and classification methods based on probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how data is produced Uses prior probability of each category given no information about an item. Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

Maximum a posteriori Hypothesis

Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

Naive Bayes Classifiers
Task: Classify a new instance based on a tuple of attribute values

Naïve Bayes Classifier: Assumptions
P(cj) Can be estimated from the frequency of classes in the training examples. P(x1,x2,…,xn|cj) Need very, very large number of training examples  Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

The Naïve Bayes Classifier
Flu X1 X2 X5 X3 X4 fever sinus cough runnynose muscle-ache Conditional Independence Assumption: features are independent of each other given the class:

Learning the Model Common practice:maximum likelihood
simply use the frequencies in the data

Feature selection via Mutual Information
We might not want to use all words, but just reliable, good discriminators In training set, choose k words which best discriminate the categories. One way is in terms of Mutual Information: For each word w and each category c

OTHER APPROACHES TO FEATURE SELECTION
T-TEST CHI SQUARE TF/IDF (CFR. IR lectures) Yang & Pedersen 1997: eliminating features leads to improved performance

Tf·idf term-document matrix.
tfidf(t, d) = tf(t, d) · idf(t) where Nt,d is the number of occurrences of a term t in a document d, and the denominator is the sum of occurrences of all terms in that document d where W(t) is the number of documents containing the term t

Chi- Square Statistics
Used to evaluate the independence between two events. The relevance of a term t in a class c can be estimated by the following formula F11: #documents belonging to c and containing t; F10: #documents which are not in c but containing t; F01: #documents belonging to c but not containing t; F00: #documents which are not in c and not containing t.

MAP Estimates Classiﬁcation task: to decide which class to choose. Measure importance of term t for a class c. where Nt|c and Nt are the numbers of term t in the class c and in the entire corpus, respectively. Nc is the number of distinct classes. where Nd|c is the number of documents in the scene class c, and Nd is the entire number of documents. Note that α1 and α2 are the smoothing parameters that are typically determined empirically.

OTHER CLASSIFICATION METHODS
K-NN DECISION TREES LOGISTIC REGRESSION SUPPORT VECTOR MACHINES Cfr. Manning 8

Information extraction from bioinformatics related documents

Similar presentations

Presentation on theme: "Information extraction from bioinformatics related documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information extraction from bioinformatics related documents

Similar presentations

Presentation on theme: "Information extraction from bioinformatics related documents"— Presentation transcript:

Similar presentations

About project

Feedback