Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

Similar presentations


Presentation on theme: "BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen."— Presentation transcript:

1 BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

2 PGES upregulates PGE2 production in human thyrocytes (GeneRIF: 12145315)

3 Syntax: what are the relationships between words/phrases? Parsing: figuring out the structure –Full parse –Shallow parse Shallow parse Partial parse Syntactic chunking

4 Full parse PGES upregulates PGE2 production in human thyrocytes

5 Shallow parse PGES upregulates PGE2 production in human thyrocytes NounGroup VerbGroup NounGroup PrepositionalGroup

6 Shallow vs. full parsing Different depths –Full parse goes down to level of individual words –Shallow parse doesn’t go down any further than the base phrase Different “heights” –Full parse goes “up” to root node –Shallow parse doesn’t (generally) go further up than base phrase

7 Shallow vs. full parsing Different number of levels of structure –Full parse has many levels –Shallow parse has far fewer

8 Shallow vs. full parsing Either way, you need POS information…

9 POS tagging: why you need it All syntax is built on it Overcome sparseness problem by abstracting away from specific words Help you decide how to stem Potential basis for entity identification

10 What “POS tagging” is POS: part of speech School: 8 (noun, verb, adjective, interjection…) Real life: 40 or more

11 How do you get from 8 to 80? NounNN (noun, singular or mass) NNS (plural noun) NNP (proper noun) NNPS (plural proper noun)

12 How do you get from 8 to 80? VerbVB (base form) VBD (past tense) VBG (gerund) VBN (past participle) VBP (singular present-tense non-3 rd - person) VBZ (3 rd- person singular present tense)

13 Others that are good to recognize AdjectiveJJ (adjective) JJR (comparative adjective) JJS (superlative adjective)

14 Others that are good to recognize Coordinating conjunctions Determiners Prepositions To Punctuation CC DT IN TO, (comma). (sentence-final) : (sentence-medial)

15 POS tagging Definition: assigning POS “tags” to a string of tokens Input: –string of tokens –tag set Output: –Best tag for each token

16 How do you define noun, verb, etc.? Semantic: –“A noun is a person, place, or thing…” –“A verb is…” Distributional characteristics: –“A noun can take the plural and genitive morphemes” –“A noun can appear in the environment All of my twelve hairy ___ left before noon”

17 Why’s it hard? Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

18 POS tagging: rule-based 1.Assign each word its list of potential parts of speech 2.Use rules to remove potential tags from the list The EngCG system: 56,000-item dictionary 3,744 rules Note that all taggers need a way to deal with unknown words (OOV or “out-of- vocabulary”).

19 As always, (about) two approaches…. Rule-based Learning-based

20 An aside: tagger input formats apoptosis in a human tumor cell line. apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN./. apoptosis in a human tumor cell line. NN IN DT JJ NN.

21 Just how ambiguous is natural language? Most English words are not ambiguous… …but, many of the most common ones are. Brown corpus: only 11.5% of word types ambiguous… …but > 40% of tokens ambiguous. Dictionary doesn’t give you a good estimate of the problem space… …but corpus data does. Empirical question: how ambiguous is biomedical text?

22 A statistical approach: TnT Second-order Markov model Smoothing by linear interpolation of ngrams λ estimated by deleted interpolation Tag probabilities learned for word endings; used for unknown words

23 TnT Ngram: an n-tag or n-word sequence N = 1 –DET –NOUN –role Bigrams –DET NOUN –NOUN PREPOSITION –a role Trigrams

24 The Brill Tagger

25 The Brill tagger Uses rules …but, set of rules are induced.

26 The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then

27 The Brill tagger Iterative error reduction 1.Assign most common tags, then 2.Evaluate performance, then 3.Propose rules to fix errors 4.Evaluate performance, then 5.If you’ve improved, GOTO 3, else END

28 The Brill tagger Change Determiner Verb “of” …to… Determiner Noun “of” The/Determiner running/Verb of/IN The/Determiner running/Noun of/IN

29 An aside: evaluating POS taggers Accuracy Confusion matrix How hard is the task? Domain/genre- specific… –Baseline –Ceiling –State of the art: 96-97% total accuracy Lower for non-punctuation Give each word its most common tag Interannotator agreement --usually high 90’s Low 90’s on some corpora!

30 Confusion matrix JJNNVBD JJ--.64.6 NN.5-- VBD5.4.01-- Columns = tagger output Rows = right answer

31 An aside: unknown words Call them all nouns Learn most common POS from training data Use morphology Suffix trees Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

32 POS tagging: extension(s) Entity identification What else??

33 First step in any POS tagging effort: –Tokenization –…maybe sentence segmentation

34 First programming assignment: tokenization What was hard? What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?


Download ppt "BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen."

Similar presentations


Ads by Google