BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
BIOI7791 Spring 2005 Projects in bioinformatics: natural language processing March 31, 2005 © Kevin Cohen.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Part-of-Speech Tagging & Sequence Labeling
From Textual Information to Numerical Vectors Chapters Presented by Aaron Hagan.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Some Advances in Transformation-Based Part of Speech Tagging
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
NLP LINGUISTICS 101 David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Ling 570 Day 17: Named Entity Recognition Chunking.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Linguistic Essentials
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam’s excellent.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Lecture 9: Part of Speech
Introduction to Machine Learning and Text Mining
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
CSCI 5832 Natural Language Processing
Natural Language Processing
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

PGES upregulates PGE2 production in human thyrocytes (GeneRIF: )

Syntax: what are the relationships between words/phrases? Parsing: figuring out the structure –Full parse –Shallow parse Shallow parse Partial parse Syntactic chunking

Full parse PGES upregulates PGE2 production in human thyrocytes

Shallow parse PGES upregulates PGE2 production in human thyrocytes NounGroup VerbGroup NounGroup PrepositionalGroup

Shallow vs. full parsing Different depths –Full parse goes down to level of individual words –Shallow parse doesn’t go down any further than the base phrase Different “heights” –Full parse goes “up” to root node –Shallow parse doesn’t (generally) go further up than base phrase

Shallow vs. full parsing Different number of levels of structure –Full parse has many levels –Shallow parse has far fewer

Shallow vs. full parsing Either way, you need POS information…

POS tagging: why you need it All syntax is built on it Overcome sparseness problem by abstracting away from specific words Help you decide how to stem Potential basis for entity identification

What “POS tagging” is POS: part of speech School: 8 (noun, verb, adjective, interjection…) Real life: 40 or more

How do you get from 8 to 80? NounNN (noun, singular or mass) NNS (plural noun) NNP (proper noun) NNPS (plural proper noun)

How do you get from 8 to 80? VerbVB (base form) VBD (past tense) VBG (gerund) VBN (past participle) VBP (singular present-tense non-3 rd - person) VBZ (3 rd- person singular present tense)

Others that are good to recognize AdjectiveJJ (adjective) JJR (comparative adjective) JJS (superlative adjective)

Others that are good to recognize Coordinating conjunctions Determiners Prepositions To Punctuation CC DT IN TO, (comma). (sentence-final) : (sentence-medial)

POS tagging Definition: assigning POS “tags” to a string of tokens Input: –string of tokens –tag set Output: –Best tag for each token

How do you define noun, verb, etc.? Semantic: –“A noun is a person, place, or thing…” –“A verb is…” Distributional characteristics: –“A noun can take the plural and genitive morphemes” –“A noun can appear in the environment All of my twelve hairy ___ left before noon”

Why’s it hard? Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

POS tagging: rule-based 1.Assign each word its list of potential parts of speech 2.Use rules to remove potential tags from the list The EngCG system: 56,000-item dictionary 3,744 rules Note that all taggers need a way to deal with unknown words (OOV or “out-of- vocabulary”).

As always, (about) two approaches…. Rule-based Learning-based

An aside: tagger input formats apoptosis in a human tumor cell line. apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN./. apoptosis in a human tumor cell line. NN IN DT JJ NN.

Just how ambiguous is natural language? Most English words are not ambiguous… …but, many of the most common ones are. Brown corpus: only 11.5% of word types ambiguous… …but > 40% of tokens ambiguous. Dictionary doesn’t give you a good estimate of the problem space… …but corpus data does. Empirical question: how ambiguous is biomedical text?

A statistical approach: TnT Second-order Markov model Smoothing by linear interpolation of ngrams λ estimated by deleted interpolation Tag probabilities learned for word endings; used for unknown words

TnT Ngram: an n-tag or n-word sequence N = 1 –DET –NOUN –role Bigrams –DET NOUN –NOUN PREPOSITION –a role Trigrams

The Brill Tagger

The Brill tagger Uses rules …but, set of rules are induced.

The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then

The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then 3.Propose rules to fix errors 4.Evaluate performance, then 5.If you’ve improved, GOTO 3, else END

The Brill tagger Change Determiner Verb “of” …to… Determiner Noun “of” The/Determiner running/Verb of/IN The/Determiner running/Noun of/IN

An aside: evaluating POS taggers Accuracy Confusion matrix How hard is the task? Domain/genre- specific… –Baseline –Ceiling –State of the art: 96-97% total accuracy Lower for non-punctuation Give each word its most common tag Interannotator agreement --usually high 90’s Low 90’s on some corpora!

Confusion matrix JJNNVBD JJ NN.5-- VBD Columns = tagger output Rows = right answer

An aside: unknown words Call them all nouns Learn most common POS from training data Use morphology Suffix trees Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

POS tagging: extension(s) Entity identification What else??

First step in any POS tagging effort: –Tokenization –…maybe sentence segmentation

First programming assignment: tokenization What was hard? What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?