Information Extraction Lecture 5 – Named Entity Recognition III CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Information Extraction David Kauchak cs160 Fall 2009 some content adapted from:
University of Sheffield NLP Module 4: Machine Learning.
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Information Retrieval in Practice
Information Extraction PengBo Dec 2, Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Aki Hecht Seminar in Databases (236826) January 2009
Induction of Decision Trees
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
An Introduction to Information Retrieval and Question Answering Jimmy Lin College of Information Studies University of Maryland Wednesday, December 8,
15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam Some slides from William Cohen, Andrew McCallum.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Overview of Search Engines
School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
INFORMATION EXTRACTION David Kauchak cs457 Fall 2011 some content adapted from:
CS 478 – Introduction1 Introduction to Machine Learning CS 478 Professor Tony Martinez.
Information Extraction Yunyao Li EECS /SI /29/2006.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Information Extraction Referatsthemen CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Basic Machine Learning: Linear Models Alexander Fraser CIS, LMU München WSD and MT.
CSE 5539: Web Information Extraction
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
ENTITY EXTRACTION: RULE-BASED METHODS “I’m booked on the train leaving from Paris at 6 hours 31” Rule: Location (Token string = “from”) ({DictionaryLookup=location}):loc.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
INFORMATION EXTRACTION David Kauchak cs159 Spring 2011 some content adapted from:
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
CS 6961: Structured Prediction Fall 2014 Course Information.
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Semi-automatic Product Attribute Extraction from Store Website
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Information Extraction Lecture 3 – Rule-based Named Entity Recognition CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
COM24111: Machine Learning Decision Trees Gavin Brown
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Information Retrieval in Practice
Information Extraction Lecture 5 – Named Entity Recognition III
CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS
Information Extraction Lecture 3 – Rule-based Named Entity Recognition
Information Extraction Lecture 4 – Named Entity Recognition II
Prof. Dr. Alexander Fraser, CIS
Social Knowledge Mining
Prof. Dr. Alexander Fraser, CIS
Machine Learning: Lecture 3
IE by Candidate Classification: Califf & Mooney
Artificial Intelligence 9. Perceptron
A task of induction to find patterns
Information Retrieval
A task of induction to find patterns
Presentation transcript:

Information Extraction Lecture 5 – Named Entity Recognition III CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS

Outline IE end-to-end Introduction: named entity detection as a classification problem 2

CMU Seminars task Given an about a seminar Annotate –Speaker –Start time –End time –Location

CMU Seminars - Example Type: cmu.cs.proj.mt Topic: Nagao Talk Dates: 26-Apr-93 Time: 10: :00 AM PostedBy: jgc+ on 24-Apr-93 at 20:59 from NL.CS.CMU.EDU (Jaime Carbonell) Abstract: This Monday, 4/26, Prof. Makoto Nagao will give a seminar in the CMT red conference room am on recent MT research results.

IE Template Slot NameValue SpeakerProf. Makoto Nagao Start time :00 End time :00 LocationCMT red conference room Message Identifier EDU (Jaime Carbonell).0 Template contains *canonical* version of information There are several "mentions" of speaker, start time and end- time (see previous slide) Only one value for each slot Location could probably also be canonicalized Important: also keep link back to original text

How many database entries? In the CMU seminars task, one message generally results in one database entry –Or no database entry if you process an that is not about a seminar In other IE tasks, can get multiple database entries from a single document or web page –A page of concert listings -> database entries –Entries in timeline -> database entries

Summary IR: end-user –Start with information need –Gets relevant documents, hopefully information need is solved –Important difference: Traditional IR vs. Web R IE: analyst (you) –Start with template design and corpus –Get database of filled out templates Followed by subsequent processing (e.g., data mining, or user browsing, etc.)

IE: what we've seen so far So far we have looked at: Source issues (selection, tokenization, etc) Extracting regular entities Rule-based extraction of named entities Learning rules for rule-based extraction of named entities We also jumped ahead and looked briefly at end-to-end IE for the CMU Seminars task

Information Extraction Source Selection Tokenization& Normalization Named Entity Recognition Instance Extraction Fact Extraction Ontological Information Extraction ? 05/01/67  and beyond...married Elvis on Elvis Presleysinger Angela Merkel politician ✓ ✓ Information Extraction (IE) is the process of extracting structured information from unstructured machine-readable documents

IE, where we are going We will stay with the named entity recognition (NER) topic for a while –How to formulate this as a machine learning problem (later in these slides) –Next time: brief introduction to machine learning 10

Named Entity Recognition Named Entity Recognition (NER) is the process of finding entities (people, cities, organizations, dates,...) in a text. Elvis Presley was born in 1935 in East Tupelo, Mississippi. 11

Extracting Named Entities Person: Mr. Hubert J. Smith, Adm. McInnes, Grace Chan Title: Chairman, Vice President of Technology, Secretary of State Country: USSR, France, Haiti, Haitian Republic City: New York, Rome, Paris, Birmingham, Seneca Falls Province: Kansas, Yorkshire, Uttar Pradesh Business: GTE Corporation, FreeMarkets Inc., Acme University: Bryn Mawr College, University of Iowa Organization: Red Cross, Boys and Girls Club Slide from J. Lin

More Named Entities Currency: 400 yen, $100, DM 450,000 Linear: 10 feet, 100 miles, 15 centimeters Area: a square foot, 15 acres Volume: 6 cubic feet, 100 gallons Weight: 10 pounds, half a ton, 100 kilos Duration: 10 day, five minutes, 3 years, a millennium Frequency: daily, biannually, 5 times, 3 times a day Speed: 6 miles per hour, 15 feet per second, 5 kph Age: 3 weeks old, 10-year-old, 50 years of age Slide from J. Lin

Information extraction approaches For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. Slide from Kauchak

IE Posed as a Machine Learning Task  Training data: documents marked up with ground truth  Extract features around words/information  Pose as a classification problem 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun prefixcontentssuffix … … Slide from Kauchak

Sliding Windows Information Extraction: Tuesday 10:00 am, Rm 407b For each position, ask: Is the current window a named entity? Window size = 1 Slide from Suchanek

Sliding Windows Information Extraction: Tuesday 10:00 am, Rm 407b For each position, ask: Is the current window a named entity? Window size = 2 Slide from Suchanek

Features Information Extraction: Tuesday 10:00 am, Rm 407b Prefix window Content window Postfix window Choose certain features (properties) of windows that could be important: window contains colon, comma, or digits window contains week day, or certain other words window starts with lowercase letter window contains only lowercase letters... Slide from Suchanek

Feature Vectors Prefix colon 1 Prefix comma 0... … Content colon 1 Content comma 0... … Postfix colon 0 Postfix comma 1 Features Feature Vector The feature vector represents the presence or absence of features of one content window (and its prefix window and postfix window) Information Extraction: Tuesday 10:00 am, Rm 407b Prefix window Content window Postfix window Slide from Suchanek

Sliding Windows Corpus NLP class: Wednesday, 7:30am and Thursday all day, rm 667 Now, we need a corpus (set of documents) in which the entities of interest have been manually labeled. time location From this corpus, compute the feature vectors with labels: Nothing TimeNothingLocation... Slide from Suchanek

Machine Learning Nothing Time Information Extraction: Tuesday 10:00 am, Rm 407b Machine Learning Use the labeled feature vectors as training data for Machine Learning classify Result Slide from Suchanek

Sliding Windows Exercise Elvis Presley married Ms. Priscilla at the Aladin Hotel. What features would you use to recognize person names? UpperCase hasDigit … Slide from Suchanek

Good Features for Information Extraction Example word features: –identity of word –is in all caps –ends in “-ski” –is part of a noun phrase –is in a list of city names –is under node X in WordNet or Cyc –is in bold font –is in hyperlink anchor –features of past & future –last person name was female –next two words are “and Associates” begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question- word begins-with-subject blank contains-alphanum contains-bracketed- number contains-http contains-non-space contains-number contains-pipe contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30 Slide from Kauchak

Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Word Features  lists of job titles,  Lists of prefixes  Lists of suffixes  350 informative phrases HTML/Formatting Features  {begin, end, in} x {,,, } x {lengths 1, 2, 3, 4, or longer}  {begin, end} of line Good Features for Information Extraction Slide from Kauchak

NER Classification in more detail In the previous slides, we covered a basic idea of how NER classification works In the next slides, I will go into more detail –I will compare sliding window with boundary detection Machine learning itself will be presented in more detail in the next lecture 25

How can we pose this as a classification (or learning) problem? 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun prefixcontentssuffix … … DataLabel train a predictive model classifier Slide from Kauchak

Lots of possible techniques Any of these models can be used to capture words, formatting or both. Classify Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Wrapper Induction Abraham Lincoln was born in Kentucky. Learn and apply pattern for a website PersonName Slide from Kauchak

Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window Slide from Kauchak

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window Slide from Kauchak

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window Slide from Kauchak

GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Information Extraction by Sliding Window Slide from Kauchak

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix … … Standard supervised learning setting –Positive instances? –Negative instances? Information Extraction by Sliding Window Slide from Kauchak

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix … … Standard supervised learning setting –Positive instances: Windows with real label –Negative instances: All other windows –Features based on candidate, prefix and suffix Information Extraction by Sliding Window Slide from Kauchak

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slide from Kauchak

IE by Boundary Detection Input: Linear Sequence of Tokens Date : Thursday, October 25 Time : 4 : : 30 PM How can we pose this as a machine learning problem? DataLabel train a predictive model classifier Slide from Kauchak

IE by Boundary Detection Input: Linear Sequence of Tokens Date : Thursday, October 25 Time : 4 : : 30 PM Method: Identify start and end Token Boundaries Output: Tokens Between Identified Start / End Boundaries Date : Thursday, October 25 Time : 4 : : 30 PM Start / End of Content … Unimportant Boundaries Date : Thursday, October 25 Time : 4 : : 30 PM Slide from Kauchak

Learning: IE as Classification Learn TWO binary classifiers, one for the beginning and one for the end 1 if i begins a field 0 otherwise Begin(i)= Date : Thursday, October 25 Time : 4 : : 30 PM End Begin POSITIVE (1) ALL OTHERS NEGATIVE (0) Slide from Kauchak

One approach: Boundary Detectors A “Boundary Detectors” is a pair of token sequences ‹ p,s ›  A detector matches a boundary if p matches text before boundary and s matches text after boundary  Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. Date: Thursday, October 25 Would this boundary detector match anywhere? Slide from Kauchak

Date: Thursday, October 25 One approach: Boundary Detectors A “Boundary Detectors” is a pair of token sequences ‹ p,s ›  A detector matches a boundary if p matches text before boundary and s matches text after boundary  Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc. Slide from Kauchak

Combining Detectors Begin boundary detector: End boundary detector: PrefixSuffix <a href="http empty ">"> text match(es)? Slide from Kauchak

Combining Detectors Begin boundary detector: End boundary detector: PrefixSuffix <a href="http empty ">"> text BeginEnd Slide from Kauchak

Learning: IE as Classification Learn TWO binary classifiers, one for the beginning and one for the end Date : Thursday, October 25 Time : 4 : : 30 PM End Begin POSITIVE (1) ALL OTHERS NEGATIVE (0) Say we learn Begin and End, will this be enough? Any improvements? Any ambiguities? Slide from Kauchak

Some concerns BeginEndBegin EndBeginEnd BeginEnd … Slide from Kauchak

Learning to detect boundaries  Learn three probabilistic classifiers:  Begin(i) = probability position i starts a field  End(j) = probability position j ends a field  Len(k) = probability an extracted field has length k  Score a possible extraction (i,j) by Begin(i) * End(j) * Len(j-i)  Len(k) is estimated from a histogram data  Begin(i) and End(j) may combine multiple boundary detectors! Slide from Kauchak

Problems with Sliding Windows and Boundary Finders  Decisions in neighboring parts of the input are made independently from each other.  Sliding Window may predict a “seminar end time” before the “seminar start time”.  It is possible for two overlapping windows to both be above threshold.  In a Boundary-Finding system, left boundaries are laid down independently from right boundaries Slide from Kauchak

Slide sources Slides were taken from a wide variety of sources (see the attribution at the bottom right of each slide) I'd particularly like to mention Dave Kauchak of Pomona College 51

Next time: machine learning We will take a break from NER and look at classification in general We will first focus on learning decision trees from training data Powerful mechanism for encoding general decisions Example on next slide 52

Decision Trees “Should I play tennis today?” A decision tree can be expressed as a disjunction of conjunctions (Outlook = sunny)  (Humidity = normal)  (Outlook = overcast)  (Wind=Weak) Outlook Humidity Wind NoYes NoYes No Sunny Rain Overcast HighLow Strong Weak Slide from Mitchell/Ponce

Thank you for your attention! 54