WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18.

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18

Today’s Topic: Information Extraction Information extraction: Intro Hidden markov models Evaluation and outlook

Text classification vs. information extraction ? TC IE

Information Extraction: Definition Given Unstructured text or slightly structured such as html A template with “slots” Common slots: author, date, location, company Information extraction task Analyze document Fill template slots with values extracted from document Author: Smith, date: 30. Aug, location: Rome, company: IBM etc.

Classified Advertisements (Real Estate) Background: Advertisements are plain text Text is lowest common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle 2067206v1 March 02, 1998 MADDINGTON $89,000 OPEN 1.00 - 1.45 U 11 / 10 BERTRAM ST NEW TO MARKET Beautiful 3 brm freestanding villa, close to shops & bus Owner moved to Melbourne ideally suit 1st home buyer, investor & 55 and over. Brian Hazelden 0418 958 996 R WHITE LEEMING 9332 3477

Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

‘Change of Address’ email Modify address book, etc. For email messages that communicate a change of email address: Automatically extract the new email

Product information

Product info This is valuable information that is sold by some companies How do they get most of it? Phone calls Typing

Other applications of IE Systems Job resumes: BurningGlass, MohomineBurningGlassMohomine Seminar announcements Molecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed texts Summarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc. [corporate information] from web, company reports Verification of construction industry specifications documents (are the quantities correct/reasonable?) Extraction of political/economic/business changes from newspaper articles

Why doesn’t text search (IR) work? What you search for in real estate advertisements: Location: which Suburb. You might think easy, but: Suburb not mentioned Phrases: Only 45 minutes from Parramatta Multiple properties in different suburbs in one ad Money: want a range not a textual match Multiple amounts: was $155K, now $145K Variations: offers in the high 700s [but not rents for $270] Bedrooms: similar issues (br, bdr, beds, B/R)

Why doesn’t text search (IR) work? Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] ) Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] ) These all came off the same manufacturer’s website!! And this is a very technical domain. Try sofa beds.

Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify specific pieces of information in an unstructured or semi-structured textual document. Transform this unstructured information into structured relations in a database/ontology. Suppositions: A lot of information that could be represented in a structured semantically clear format isn’t It may be costly, not desired, or not in one’s control (screen scraping) to change this.

Knowledge Extraction Vision Multi- dimensional Meta-data Extraction

Knowledge Extraction Vision The vision is to have relational metadata associated with each document / web page. A search engine could then combine the power of information retrieval with the power of an RDBMS. Some search engine queries that this would enable: Find web pages about John Russ Find web pages about books authored by Queen Elizabeth Find web pages about the person who assassinated Lee Harvey Oswald Exercise: Why are these hard queries for current search engine technology?

What is an HMM? Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies between states

What is an HMM? Green circles are hidden states Dependent only on the previous state: Order-1 Markov process “The past is independent of the future given the present.”

What is an HMM? Purple nodes are observed states Dependent only on their corresponding hidden state

HMM Formalism {S, K,     are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities A B AAA BB SSS KKK S K S K

Applying HMMs to IE Document  generated by a stochastic process modelled by an HMM Token  word State  “reason/explanation” for a given token ‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’, ‘company’, … Extraction: via the Viterbi algorithm, a dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

Create Bibliographic Entry Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, pages 237-285, May 1996. Bibliographic Entry Headers

HMM for research papers: emissions [Seymore et al., 99] authortitleinstitution Trained on 2 million words of BibTeX data from the Web... note ICML 1997... submission to… to appear in… stochastic optimization... reinforcement learning… model building mobile robot... carnegie mellon university… university of california dartmouth college supported in part… copyright...

HMM for research papers: transitions [Seymore et al., 99]

Inference for an HMM Analyze a sequence: Compute the probability of a given observation sequence Applying the model: Given an observation sequence, compute the most likely hidden state sequence Learning the model: Given an observation sequence and set of possible models, which model most closely fits the data?

oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Sequence Probability

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1

Sequence Probability oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Is this a good method?

Sequence Probability oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t + 1 length state sequences. Define:

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state t+1

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Sequence Probability Forward Procedure Backward Procedure Combination 0

oToT o1o1 otot o t-1 o t+1 Best State Sequence Find the state sequence that best explains the observations Viterbi algorithm (1967)

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

 1(t-1)  1T Best State Sequence: Viterbi Algorithm, Trellis View  11  1t  1(t+1)  2(t-1)  2T  21  2t  2(t+1)  3(t-1)  3T  31 3t3t  3(t+1)  K(t-1)  KT  K1  Kt  K(t+1)  Find biggest at last time, and then trace backwards

Learning HMMs Good news: If training data tokens are tagged with their generating states, then simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. Easy. (Use smoothing to avoid zero probs for emissions/transitions absent in the training data.) Great news: Baum-Welch algorithm trains an HMM using partially labeled or unlabelled training data. Bad news: How many states should the HMM contain? How are transitions constrained? Only semi-good answers to finding answer automatically Insufficiently expressive  Unable to model important distinctions (long distance correlations, other features) Overly expressive  sparse training data, overfitting

Learning: Supervised vs Unsupervised Supervised If you have a training set Computation of parameters is simple (but need to use smoothing) Unsupervised EM / Forward-Backward Usually need to start from a model trained in a supervised manner Unlabeled data can further improve a good model

oToT o1o1 otot o t-1 o t+1 Learning = Parameter Estimation: EM/Forward-Backward algorithm Given an observation sequence, find the model that is most likely to produce that sequence. Find parameters so P(O|  ) is maximized No analytic method, so: Given a model and observation sequence, update the model parameters to better fit the observations: hill climb so P(O|  ) goes up. A B AAA BBBB

oToT o1o1 otot o t-1 o t+1 Parameter Estimation: Baum- Welch or Forward-Backward A B AAA BBBB Expectation of traversing an arc Expectation of being in state i

oToT o1o1 otot o t-1 o t+1 Parameter Estimation: Baum- Welch or Forward-Backward A B AAA BBBB Now we can compute the new estimates of the model parameters. Use ratio of expec- tations

EM Algorithm K-Means / Gaussian Mixtures E-step (expectation): Reassignment M-step (maximization): Centroid recomputation Baum-Welch / Forward-Backward E-step: Expectation of traversal / states M-step: Recompute transition and emission probabilities

Is this all there is to it? As often with text, the biggest problem is the sparseness of observations (words) Need to use many techniques to do it well: Smoothing (as in NB) to give suitable nonzero probability to unseens Featural decomposition (capitalized?, number?, etc.) gives a better estimate Shrinkage allows pooling of estimates over multiple states of same type (e.g., prefix states) Well designed (or learned) HMM topology Partially annotated data: constrained EM

Statistical generative models Rapier uses explicit extraction patterns/rules Hidden Markov Models are a powerful alternative based on statistical token sequence generation models rather than explicit extraction patterns. Pros: Well-understood underlying statistical model makes it easy to use wide range of tools from statistical decision theory Portable, broad coverage, robust, good recall Cons: Range of features and patterns usable is limited Memory of 1 for customarily used HMMs Not necessarily as good for complex multi-slot patterns

More About Information Extraction

Three generations of IE systems Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the domain Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates and methods like ILP Require huge, labeled corpora (effort is just moved!) Statistical Generative Models [1997 – ] One decodes the statistical model to find which bits of the text were relevant, using HMMs or statistical parsers Learning usually supervised; may be partially unsupervised

Trainable IE systems Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programmers. Learning algorithms ensure full coverage of examples. Cons Hand-crafted systems perform better, especially at hard tasks. Training data might be expensive to acquire May need huge amount of training data Hand-writing rules isn’t that hard!!

MUC: the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA).

Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP helps. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) PRICE: price, amount, cost, … Extraction patterns can use POS or phrase tags. Company location Prefiller: [POS: IN, Word: “in”] Postfiller: [POS:, Semantic Category: US State]

What about XML? Don’t XML, RDF, OIL, SHOE, DAML, XSchema, the Semantic Web … obviate the need for information extraction?!??! Yes: IE is sometimes used to “reverse engineer” HTML database interfaces; extraction would be much simpler if XML were exported instead of HTML. Ontology-aware editors will make it easier to enrich content with metadata. No: Terabytes of legacy HTML. Data consumers forced to accept ontological decisions of data providers (eg, John Smith vs. ). Will you annotate every email you send? Every memo you write? Every photograph you scan?

Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Example: extract job ads from web Measure for each test document: Total number of job titles occurring in the test data: N Total number of job title candidates extracted by the system: E Number of extracted job titles that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

MUC Information Extraction: State of the Art c. 1997 NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production

Take Away Information extraction (IE) vs Text classification Text classification assigns a class to entire doc Information extraction Extracts phrases from a document Classifies the function of the phrase (author, title,…) We’ve looked at the “fragment extraction” task. Future? Better ways of using domain knowledge More NLP, e.g. syntactic parsing Information extraction beyond fragment extraction: Anaphora resolution, discourse processing,... Fragment extraction is good enough for many Web information services!

Take Away (2) Learning IE extractors with HMMs HMMs are generative models Training: Forward/Backward; Application: Viterbi HMMs can be trained on unlabeled text In practice, labeled text is usually needed Indirect / partial labels are key HMM topology can also be learned Applications: What exactly is IE good for? Is there a use for today’s “60%” results? 67% in recent KDD Cup 90% accurate IE could be a revolutionizing technology

Good Basic IE References Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/. http://www.ai.sri.com/~appelt/ie-tutorial/ Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/.Wrapper Induction for Information Extractionhttp://www.cs.ucd.ie/staff/nick/ Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233- 272 (1999)Machine Learning 34

More References Mary Elaine Califf and Raymond J. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. In AAAI 1999: 328-334. Leek, T. R. 1997, Information Extraction using Hidden Markov Models, Master’s thesis, UCSD Bikel, D. M.; Miller, S; Schwartz, R.; and Weischedel, R. 1997, Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, 194-201. [Also in MLJ 1999] Kristie Seymore, Andrew McCallum, Ronald Rosenfeld, 1999, Learning Hidden Markov Model Structure for Information Extraction, In Proceedings if the AAAI-99 Workshop on ML for IE. Dayne Freitag and Andrew McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization. AAAI-2000.Information Extraction with HMM Structures Learned by Stochastic Optimization

Rapier: A Rule-Based System

Machine Learning Approach Motivation: Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human-produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Automatic Pattern- Learning Systems Pros: Portable across domains Tend to have broad coverage Automatically finds appropriate patterns System knowledge not needed by those who supply the domain knowledge. Cons: We need annotated training data, and lots of it Isn’t necessarily better or cheaper than hand-built sol’n Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas): learn lexico-syntactic patterns from templates Trainer Decoder Model Language Input Answers Language Input

Rapier [Califf & Mooney, AAAI-99] Rapier learns three regex-style patterns for each slot:  Pre-filler pattern  Filler pattern  Post-filler pattern One of several recent trainable IE systems that incorporate linguistic constraints. (See also: SIFT [Miller et al, MUC-7]; SRV [Freitag, AAAI-98]; Whisk [Soderland, MLJ-99].) RAPIER rules for extracting “transaction price” “…paid $11M for the company…” “…sold to the bank for an undisclosed amount…” “…paid Honeywell an undisclosed price…”

Part-of-speech tags & Semantic classes Part of speech: syntactic role of a specific word noun (nn), proper noun (nnp), adjectve (jj), adverb (rb), determiner (dt), verb (vb), “.” (“.”), … NLP: Well-known algorithms for automatically assigning POS tags to English, French, Japanese, … (>95% accuracy) Semantic Classes: Synonyms or other related words “Price” class: price, cost, amount, … “Month” class: January, February, March, …, December “US State” class: Alaska, Alabama, …, Washington, Wyoming WordNet: large on-line thesaurus containing (among other things) semantic classes

Rapier rule matching example “…sold to the bank for an undisclosed amount…” POS: vb pr det nn pr det jj nn SClass: price “…paid Honeywell an undisclosed price…” POS: vb nnp det jj nn SClass: price

Rapier Rules: Details Rapier rule := pre-filler pattern filler pattern post-filler pattern pattern := subpattern + subpattern := constraint + constraint := Word - exact word that must be present Tag - matched word must have given POS tag Class - semantic class of matched word Can specify disjunction with “{…}” List length N - between 0 and N words satisfying other constraints

Rapier’s Learning Algorithm Input: set of training examples (list of documents annotated with “extract this substring”) Output: set of rules Init: Rules = a rule that exactly matches each training example Repeat several times: Seed: Select M examples randomly and generate the K most-accurate maximally-general filler-only rules (prefiller = postfiller = “true”). Grow: Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context Keep: Rules = Rules  the best of the K rules – subsumed rules

Learning example (one iteration) 2 examples: ‘… located in Atlanta, Georgia…” ‘… offices in Kansas City, Missouri…’ maximally specific rules (high precision, low recall) maximally general rules (low precision, high recall) appropriately general rule (high precision, high recall) Init Seed Grow

Rapier Conceptually simple But powerful: it is easy to incorporate linguistic and other constraints Advantages of rules Easy to understand (but can be deceptive) One can “trace” the extraction of a piece of information step by step Rules can be edited manually. Disadvantages Difficult to incorporate quantitative / probabilistic constraints “with offices in San Francisco in Northern California”

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18.

Similar presentations

Presentation on theme: "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18.

Similar presentations

Presentation on theme: "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 18."— Presentation transcript:

Similar presentations

About project

Feedback