Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.

Slides:



Advertisements
Similar presentations
Information Extraction David Kauchak cs160 Fall 2009 some content adapted from:
Advertisements

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Information Retrieval in Practice
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Information Extraction CS 652 Information Extraction and Integration.
FSA and HMM LING 572 Fei Xia 1/5/06.
Aki Hecht Seminar in Databases (236826) January 2009
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
Hidden Markov Models David Meir Blei November 1, 1999.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam Some slides from William Cohen, Andrew McCallum.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
INFORMATION EXTRACTION David Kauchak cs457 Fall 2011 some content adapted from:
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Information Extraction Lecture 5 – Named Entity Recognition III CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.
CSE 5539: Web Information Extraction
Graphical models for part of speech tagging
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
INFORMATION EXTRACTION David Kauchak cs159 Spring 2011 some content adapted from:
Hidden Markov Models for Information Extraction CSE 454.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 Information Extraction using HMMs Sunita Sarawagi.
Presenter: Shanshan Lu 03/04/2010
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Information Extraction Lecture 5 – Named Entity Recognition III
School of Computer Science & Engineering
IE by Candidate Classification: Califf & Mooney
Presentation transcript:

Information Extraction PengBo Dec 2, 2010

Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM

What is IE?

Example: The Problem Martin Baker, a person Genomics job Employers job posting form

Example: A Solution

Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1

Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

Data Mining the Extracted Job Information

Two ways to manage information Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx retrieval QueryAnswerQueryAnswer advisor(wc,vc) advisor(yh,tm) affil(wc,mld) affil(vc,lti) fn(wc,``William”) fn(vc,``Vitor”) Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx inference “ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ?{X=em; X=vc} AND IE

What is Information Extraction? Recovering structured data from formatted text

What is Information Extraction? Recovering structured data from formatted text Identifying fields (e.g. named entity recognition)

What is Information Extraction? Recovering structured data from formatted text Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association)

What is Information Extraction? Recovering structured data from formatted text Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association) Normalization and deduplication

What is Information Extraction? Recovering structured data from formatted text Identifying fields (e.g. named entity recognition) Understanding relations between fields (e.g. record association) Normalization and deduplication Today, focus mostly on field identification & a little on record association

Applications

IE from Research Papers

IE from Chinese Documents regarding Weather Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

Wrapper Induction

“ Wrappers ” If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual form/content management system that doesn ’ t allow such querying We need to “ wrap ” the data in a component that understands database-style querying Hence the term “ wrappers ”

Title: Schulz and Peanuts: A Biography Author: David MichaelisDavid Michaelis List Price: $34.95

Wrappers: Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. Price pattern: “ \b\$\d+(\.\d{2})?\b ” May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “ \b\$\d+(\.\d{2})?\b ” Post-filler pattern: “ ”

Wrapper tool-kits Wrapper toolkits Specialized programming environments for writing & debugging wrappers by hand Some Resources Wrapper Development Tools LAPIS

Wrapper Induction Problem description: Task: learn extraction rules based on labeled examples Hand-writing rules is tedious, error prone, and time consuming Learning wrappers is wrapper induction

Induction Learning Rule induction: formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.scientific modelpatterns INPUT: Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy Desired output: Rule that performs well both on training and testing data

Wrapper induction Highly regular source documents  Relatively simple extraction patterns  Efficient learning algorithm Build a training set of documents paired with human-produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Learner User gives first K positive—and thus many implicit negative examples

Kushmerick ’ s WIEN system Earliest wrapper-learning system (published IJCAI ’ 97) Special things about WIEN: Treats document as a string of characters Learns to extract a relation directly, rather than extracting fields, then associating them together in some way Example is a completely labeled page

WIEN system: a sample wrapper

l1, r1, …, lK, rKl1, r1, …, lK, rK Example: Find 4 strings ,,,   l 1, r 1, l 2, r 2  labeled pages wrapper Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 Learning LR wrappers

LR wrapper Left delimiters L1=“ ”, L2=“ ”; Right R1=“ ”, R2=“ ”

LR: Finding r 1 Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 r 1 can be any prefix eg

LR: Finding l 1, l 2 and r 2 Some Country Codes Congo 242 Egypt 20 Belize 501 Spain 34 r 2 can be any prefix eg l 2 can be any suffix eg l 1 can be any suffix eg

WIEN system Assumes items are always in fixed, known order … Name: J. Doe; Address: 1 Main; Phone: Name: E. Poe; Address: 10 Pico; Phone: … Introduces several types of wrappers LR

Learning LR extraction rules Admissible rules: prefixes & suffixes of items of interest Search strategy: start with shortest prefix & suffix, and expand until correct

Summary of WIEN Advantages: Fast to learn & extract Drawbacks: Cannot handle permutations and missing items Must label entire page Requires large number of examples

Sliding Windows

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

A “Naïve Bayes” Sliding Window Model 1. Create dataset of examples like these: +(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…) - (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….) … 2. Train a NaiveBayes classifier 3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the content window is a location. To think about: what if the extracted entities aren’t consistent, eg if the location overlaps with the speaker? [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix … …

“Naïve Bayes” Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Person Name:30% Location:61% Start Time:98%

Finite State Transducers

Finite State Transducers for IE Basic method for extracting relevant information IE systems generally use a collection of specialized FSTs Company Name detection Person Name detection Relationship detection

Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Text Analyzer: Frodo – Proper Name Baggins – Proper Name works – Verb for – Prep Hobbit – UnknownCap Factory – NounCap Inc– CompAbbr

Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc. Some regular expression for finding company names: “some capitalized words, maybe a comma, then a company abbreviation indicator” CompanyName = (ProperName | SomeCap)+ Comma? CompAbbr

Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc word (CAP | PN) CAB comma CAB word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string Company Name Detection FSA

Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc word  word (CAP | PN)   CAB  CN comma   CAB  CN word  word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName Company Name Detection FST

Finite State Transducers for IE Frodo Baggins works for Hobbit Factory, Inc word  word (CAP | PN)   CAB  CN comma   CAB  CN word  word CAP = SomeCap, CAB = CompAbbr, PN = ProperName,  = empty string, CN = CompanyName Company Name Detection FST Non-deterministic!!!

Finite State Transducers for IE Several FSTs or a more complex FST can be used to find one type of information (e.g. company names) FSTs are often compiled from regular expressions Probabilistic (weighted) FSTs

Finite State Transducers for IE FSTs mean different things to different researchers in IE. Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis

Example: FASTUS Finite State Automaton Text Understanding System (SRI International) Cascading FSTs Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures

Hidden Markov Models

Hidden Markov Models formalism HMM=states s 1, s 2, … (special start state s 1 special end state s n ) token alphabet a 1, a 2, … state transition probs P(s i |s j ) token emission probs P(a i |s j ) Widely used in many language processing tasks, e.g., speech recognition [Lee, 1989], POS tagging [Kupiec, 1992], topic detection [Yamron et al, 1998]. HMM = probabilistic FSA

Applying HMMs to IE Document  generated by a stochastic process modelled by an HMM Token  word State  “reason/explanation” for a given token ‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’, ‘company’, … Extraction: via the Viterbi algorithm, a dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

HMM for research papers: transitions [Seymore et al., 99]

HMM for research papers: emissions [Seymore et al., 99] authortitleinstitution Trained on 2 million words of BibTeX data from the Web... note ICML submission to… to appear in… stochastic optimization... reinforcement learning… model building mobile robot... carnegie mellon university… university of california dartmouth college supported in part… copyright...

What is an HMM? Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies between states

What is an HMM? Green circles are hidden states Dependent only on the previous state: Markov process “ The past is independent of the future given the present. ”

What is an HMM? Purple nodes are observed states Dependent only on their corresponding hidden state

HMM Formalism {S, K,  S : {s 1 … s N } are the values for the hidden states K : {k 1 … k M } are the values for the observations SSS KKK S K S K

HMM Formalism {S, K,     are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities A B AAA BB SSS KKK S K S K

Need to provide structure of HMM & vocabulary Training the model (Baum-Welch algorithm) Efficient dynamic programming algorithms exist for Finding Pr(K) The highest probability path S that maximizes Pr(K,S) (Viterbi) Title Journal Author Transition probabilities Year ABCABC XBZXBZ YACYAC Emission probabilities dddd dd

Using the HMM to segment Find highest probability path through the HMM. Viterbi: quadratic dynamic programming algorithm House otot Road City Pin 115 Grant street Mumbai House Road City Pin 115 Grant ……… otot House Road City Pin House Road Pin

Most Likely Path for a Given Sequence The probability that the path is taken and the sequence is generated: transition probabilities emission probabilities

Example A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 beginend A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2

oToT o1o1 otot o t-1 o t+1 Finding the most probable path Find the state sequence that best explains the observations Viterbi algorithm (1967)

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

Viterbi : Dynamic Programming House otot Road City Pin No 115 Grant street Mumbai House Road City Pin 115 Grant ……… otot House Road City Pin House Road Pin

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

Hidden Markov Models Summary Popular technique to detect and classify a linear sequence of information in text Disadvantage is the need for large amounts of training data Related Works System for extraction of gene names and locations from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document segments that occur in a fixed or partially fixed order (title, author, journal) Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships

IE Technique Landscape

IE with Symbolic Techniques Conceptual Dependency Theory Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual events from sentences at a conceptual level (i.e., the actor and an action) Frame Theory Minsky, 1975 a frame stores the properties of characteristics of an entity, action or event it typically consists of a number of slots to refer to the properties named by a frame Berkeley FrameNet project Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics and supported by corpus evidence FASTUS (Finite State Automation Text Understanding System) Hobbs, 1996 using cascade of FSAs in a frame based information extraction approach

IE with Machine Learning Techniques Training data: documents marked up with ground truth In contrast to text classification, local features crucial. Features of: Contents Text just before item Text just after item Begin/end boundaries

Good Features for Information Extraction Example word features: identity of word is in all caps ends in “ -ski ” is part of a noun phrase is in a list of city names is under node X in WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was female next two words are “ and Associates ” begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketed- number contains-http contains-non-space contains-number contains-pipe contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30 Creativity and Domain Knowledge Required!

Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list ( “ J. C. Penny ” ) In list of company suffixes (Inc, & Associates, Foundation) Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases HTML/Formatting Features {begin, end, in} x {,,, } x {lengths 1, 2, 3, 4, or longer} {begin, end} of line Creativity and Domain Knowledge Required! Good Features for Information Extraction

Landscape of ML Techniques for IE: Any of these models can be used to capture words, formatting or both. Classify Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Wrapper Induction Abraham Lincoln was born in Kentucky. Learn and apply pattern for a website PersonName

IE History Pre-Web Mostly news articles De Jong ’ s FRUMP [1982] Hand-built system to fill Schank-style “ scripts ” from news wire Message Understanding Conference (MUC) DARPA [ ’ 87- ’ 95], TIPSTER [ ’ 92- ’ 96] Most early work dominated by hand-built models E.g. SRI ’ s FASTUS, hand-built FSMs. But by 1990 ’ s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’ 97], BBN [Bikel et al ’ 98] Web AAAI ’ 94 Spring Symposium on “ Software Agents ” Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell ’ s WebKB, ‘ 96 Build KB ’ s from the Web. Wrapper Induction Initially hand-build, then ML: [Soderland ’ 96], [Kushmeric ’ 97], …

Summary Information Extraction Sliding Window From FST(Finite State Transducer) to HMM Wrapper Induction Wrapper toolkits LR Wrapper Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes:

Readings [1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999.

Thank You! Q&A