Presentation is loading. Please wait.

Presentation is loading. Please wait.

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14.

Similar presentations


Presentation on theme: "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14."— Presentation transcript:

1 WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14

2 Today’s Topic: Information Extraction Information extraction: Intro Hidden markov models Evaluation and outlook

3 Text classification vs. information extraction ? TC IE

4 Information Extraction: Definition Given Unstructured text or slightly structured such as html A template with “slots” Common slots: author, date, location, company Information extraction task Analyze document Fill template slots with values extracted from document Author: Smith, date: 30. Aug, location: Rome, company: IBM etc.

5 Classified Advertisements (Real Estate) Background: Advertisements are plain text Text is lowest common denominator: only thing that 70+ newspapers with 20+ publishing systems can all handle 2067206v1 March 02, 1998 MADDINGTON $89,000 OPEN 1.00 - 1.45 U 11 / 10 BERTRAM ST NEW TO MARKET Beautiful 3 brm freestanding villa, close to shops & bus Owner moved to Melbourne ideally suit 1st home buyer, investor & 55 and over. Brian Hazelden 0418 958 996 R WHITE LEEMING 9332 3477

6

7 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

8 ‘Change of Address’ email Modify address book, etc. For email messages that communicate a change of email address: Automatically extract the new email

9 Product information

10 Product info This is valuable information that is sold by some companies How do they get most of it? Phone calls Typing

11 Other applications of IE Systems Job resumes: BurningGlass, MohomineBurningGlassMohomine Seminar announcements Molecular biology information from MEDLINE, e.g, Extracting gene drug interactions from biomed texts Summarizing medical patient records by extracting diagnoses, symptoms, physical findings, test results. Gathering earnings, profits, board members, etc. [corporate information] from web, company reports Verification of construction industry specifications documents (are the quantities correct/reasonable?) Extraction of political/economic/business changes from newspaper articles

12 Why doesn’t text search (IR) work? What you search for in real estate advertisements: Location: which Suburb. You might think easy, but: Suburb not mentioned Phrases: Only 45 minutes from Parramatta Multiple properties in different suburbs in one ad Money: want a range not a textual match Multiple amounts: was $155K, now $145K Variations: offers in the high 700s [but not rents for $270] Bedrooms: similar issues (br, bdr, beds, B/R)

13 Why doesn’t text search (IR) work? Image Capture Device: 1.68 million pixel 1/2-inch CCD sensor Image Capture Device Total Pixels Approx. 3.34 million Effective Pixels Approx. 3.24 million Image sensor Total Pixels: Approx. 2.11 million-pixel Imaging sensor Total Pixels: Approx. 2.11 million 1,688 (H) x 1,248 (V) CCD Total Pixels: Approx. 3,340,000 (2,140[H] x 1,560 [V] ) Effective Pixels: Approx. 3,240,000 (2,088 [H] x 1,550 [V] ) Recording Pixels: Approx. 3,145,000 (2,048 [H] x 1,536 [V] ) These all came off the same manufacturer’s website!! And this is a very technical domain. Try sofa beds.

14 Task: Information Extraction Goal: being able to answer semantic queries (a.k.a. “database queries”) using “unstructured” natural language sources Identify specific pieces of information in an un- structured or semi-structured textual document. Transform this unstructured information into structured relations in a database/ontology. Suppositions: A lot of information that could be represented in a structured semantically clear format isn’t It may be costly, not desired, or not in one’s control (screen scraping) to change this.

15 Amazon Book Description …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by Ray Kurzweil List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …. The Age of Spiritual Machines : When Computers Exceed Human Intelligence by Ray Kurzweil List Price: $14.95 Our Price: $11.96 You Save: $2.99 (20%) …

16 Extracted Book Template Title: The Age of Spiritual Machines : When Computers Exceed Human Intelligence Author: Ray Kurzweil List-Price: $14.95 Price: $11.96 :

17 Web Extraction Many web pages are generated automatically from an underlying database. Therefore, the HTML structure of pages is fairly specific and regular (semi-structured). However, output is intended for human consumption, not machine interpretation. An IE system for such generated pages allows the web site to be viewed as a structured database. An extractor for a semi-structured web site is sometimes referred to as a wrapper. Process of extracting from such pages is sometimes referred to as screen scraping.

18 Simple Extraction Patterns Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b” May require preceding (pre-filler) pattern to identify proper context. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “\ $\d+(\.\d{2})?\b ” May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: “ List Price: ” Filler pattern: “.+” Post-filler pattern: “ ”

19 Simple Template Extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price … Make patterns specific enough to identify each filler always starting from the beginning of the document.

20 Pre-Specified Filler Extraction If a slot has a fixed set of pre-specified possible fillers, text categorization can be used to fill the slot. Job category Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler.

21 Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags. Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

22 Learning for IE Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human- produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm. Rapier system learns three regex-style patterns for each slot: Pre-filler pattern Filler pattern Post-filler pattern

23 Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

24 XML and IE If relevant documents were all available in standardized XML format, IE would be unnecessary. But… Difficult to develop a universally adopted DTD format for the relevant domain. Difficult to manually annotate documents with appropriate XML tags. Commercial industry may be reluctant to provide data in easily accessible XML format. IE provides a way of automatically transforming semi- structured or unstructured data into an XML compatible format.

25 Web Extraction using DOM Trees Web extraction may be aided by first parsing web pages into DOM trees. Extraction patterns can then be specified as paths from the root of the DOM tree to the node containing the text to extract. May still need regex patterns to identify proper portion of the final CharacterData node.

26 Sample DOM Tree Extraction HTML BODY FONTB Age of Spiritual Machines Ray Kurzweil Element Character-Data HEADER by A Title: HTML  BODY  B  CharacterData Author: HTML  BODY  FONT  A  CharacterData

27 Shop Bots One application of web extraction is automated comparison shopping systems. System must be able to extract information on items (product specs and prices) from multiple web stores. User queries a single site, which integrates information extracted from multiple web stores and presents overall results to user in a uniform format, e.g. ordered by price. Several commercial systems: MySimon Cnet BookFinder

28 Shop Bots (cont.) Construct wrapper for each source web store. Accept shopping query from user. For each source web store: Submit query to web store. Obtain resulting HTML page. Extract information from page and store in local DB. Sort items in resulting DB by price. Format results into HTML and return result. Alternative is to extract information from all web stores in advance and store in a uniform global DB for subsequent query processing.

29 Information Integration Answering certain questions using the web requires integrating information from multiple web sites. Information integration concerns methods for automating this integration. Requires wrappers to accurately extract specific information from web pages from specific sites. Treat each wrapped site as a database table and answer complex queries using a database query language (e.g. SQL).

30 Information Integration Example Question: What is the closest theater to my home where I can see both Monsters Inc. and Harry Potter? From austin360.com, extract theaters and their addresses where Harry Potter and Monster’s Inc. are playing. Intersect the two to find the theaters playing both. Query mapquest.com for driving directions from your home address to the address of each of these theaters. Extract distance and driving instructions for each. Sort results by driving distance. Present driving instructions for closest theater.

31 Knowledge Extraction Vision Multi- dimensional Meta-data Extraction

32 Knowledge Extraction Vision The vision is to have relational metadata associated with each document / web page. A search engine could then combine the power of information retrieval with the power of an RDBMS. Some search engine queries that this would enable: Find web pages about John Russ Find web pages about books authored by Queen Elizabeth Find web pages about the person who assassinated Lee Harvey Oswald Exercise: Why are these hard queries for current search engine technology?

33 HMMs

34 What is an HMM? Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies between states

35 What is an HMM? Green circles are hidden states Dependent only on the previous state: Order-1 Markov process “The past is independent of the future given the present.”

36 What is an HMM? Purple nodes are observed states Dependent only on their corresponding hidden state

37 HMM Formalism {S, K,     are the initial state probabilities A = {a ij } are the state transition probabilities B = {b ik } are the observation state probabilities A B AAA BB SSS KKK S K S K

38 Applying HMMs to IE Document  generated by a stochastic process modelled by an HMM Token  word State  “reason/explanation” for a given token ‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’, ‘company’, … Extraction: via the Viterbi algorithm, a dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

39 Create Bibliographic Entry Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, pages 237-285, May 1996. Bibliographic Entry Headers

40 HMM for research papers: emissions [Seymore et al., 99] authortitleinstitution Trained on 2 million words of BibTeX data from the Web... note ICML 1997... submission to… to appear in… stochastic optimization... reinforcement learning… model building mobile robot... carnegie mellon university… university of california dartmouth college supported in part… copyright...

41 HMM for research papers: transitions [Seymore et al., 99]

42 Inference for an HMM Analyze a sequence: Compute the probability of a given observation sequence Applying the model: Given an observation sequence, compute the most likely hidden state sequence Learning the model: Given an observation sequence and set of possible models, which model most closely fits the data?

43 oToT o1o1 otot o t-1 o t+1 Given an observation sequence and a model, compute the probability of the observation sequence Sequence Probability

44 Learning HMMs Good news: If training data tokens are tagged with their generating states, then simple frequency ratios are a maximum-likelihood estimate of transition/emission probabilities. Easy. (Use smoothing to avoid zero probs for emissions/transitions absent in the training data.) Great news: Baum-Welch algorithm trains an HMM using partially labeled or unlabelled training data. Bad news: How many states should the HMM contain? How are transitions constrained? Only semi-good answers to finding answer automatically Insufficiently expressive  Unable to model important distinctions (long distance correlations, other features) Overly expressive  sparse training data, overfitting

45 Learning: Supervised vs Unsupervised Supervised If you have a training set Computation of parameters is simple (but need to use smoothing) Unsupervised EM / Forward-Backward Usually need to start from a model trained in a supervised manner Unlabeled data can further improve a good model

46 Statistical generative models Rapier uses explicit extraction patterns/rules Hidden Markov Models are a powerful alternative based on statistical token sequence generation models rather than explicit extraction patterns. Pros: Well-understood underlying statistical model makes it easy to use wide range of tools from statistical decision theory Portable, broad coverage, robust, good recall Cons: Range of features and patterns usable is limited Memory of 1 for customarily used HMMs Not necessarily as good for complex multi-slot patterns

47 More About Information Extraction

48 Three generations of IE systems Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the domain Iterative guess-test-tweak-repeat cycle Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates and methods like ILP Require huge, labeled corpora (effort is just moved!) Statistical Generative Models [1997 – ] One decodes the statistical model to find which bits of the text were relevant, using HMMs or statistical parsers Learning usually supervised; may be partially unsupervised

49 Trainable IE systems Pros Annotating text is simpler & faster than writing rules. Domain independent Domain experts don’t need to be linguists or programmers. Learning algorithms ensure full coverage of examples. Cons Hand-crafted systems perform better, especially at hard tasks. Training data might be expensive to acquire May need huge amount of training data Hand-writing rules isn’t that hard!!

50 MUC: the genesis of IE DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual event/competition where results were presented. Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes Information extraction of particular interest to the intelligence community (CIA, NSA).

51 Natural Language Processing If extracting from automatically generated web pages, simple regex patterns usually work. If extracting from more natural, unstructured, human-written text, some NLP helps. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) PRICE: price, amount, cost, … Extraction patterns can use POS or phrase tags. Company location Prefiller: [POS: IN, Word: “in”] Postfiller: [POS:, Semantic Category: US State]

52 What about XML? Don’t XML, RDF, OIL, SHOE, DAML, XSchema, the Semantic Web … obviate the need for information extraction?!??! Yes: IE is sometimes used to “reverse engineer” HTML database interfaces; extraction would be much simpler if XML were exported instead of HTML. Ontology-aware editors will make it easier to enrich content with metadata. No: Terabytes of legacy HTML. Data consumers forced to accept ontological decisions of data providers (eg, John Smith vs. ). Will you annotate every email you send? Every memo you write? Every photograph you scan?

53 Evaluating IE Accuracy Always evaluate performance on independent, manually- annotated test data not used during system development. Example: extract job ads from web Measure for each test document: Total number of job titles occurring in the test data: N Total number of job title candidates extracted by the system: E Number of extracted job titles that are correct (i.e. in the solution template): C Compute average value of metrics adapted from IR: Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

54 MUC Information Extraction: State of the Art c. 1997 NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production

55 Take Away Information extraction (IE) vs Text classification Text classification assigns a class to entire doc Information extraction Extracts phrases from a document Classifies the function of the phrase (author, title,…) We’ve looked at the “fragment extraction” task. Future? Better ways of using domain knowledge More NLP, e.g. syntactic parsing Information extraction beyond fragment extraction: Anaphora resolution, discourse processing,... Fragment extraction is good enough for many Web information services!

56 Take Away (2) Learning IE extractors with HMMs HMMs are generative models Training: Forward/Backward; Application: Viterbi HMMs can be trained on unlabeled text In practice, labeled text is usually needed Indirect / partial labels are key HMM topology can also be learned Applications: What exactly is IE good for? Is there a use for today’s “60%” results? 67% in recent KDD Cup 90% accurate IE could be a revolutionizing technology

57 Good Basic IE References Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/. http://www.ai.sri.com/~appelt/ie-tutorial/ Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/.Wrapper Induction for Information Extractionhttp://www.cs.ucd.ie/staff/nick/ Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233- 272 (1999)Machine Learning 34

58 More References Mary Elaine Califf and Raymond J. Mooney: Relational Learning of Pattern-Match Rules for Information Extraction. In AAAI 1999: 328-334. Leek, T. R. 1997, Information Extraction using Hidden Markov Models, Master’s thesis, UCSD Bikel, D. M.; Miller, S; Schwartz, R.; and Weischedel, R. 1997, Nymble: a high-performance learning name-finder. In Proceedings of ANLP-97, 194-201. [Also in MLJ 1999] Kristie Seymore, Andrew McCallum, Ronald Rosenfeld, 1999, Learning Hidden Markov Model Structure for Information Extraction, In Proceedings if the AAAI-99 Workshop on ML for IE. Dayne Freitag and Andrew McCallum, 2000, Information Extraction with HMM Structures Learned by Stochastic Optimization. AAAI-2000.Information Extraction with HMM Structures Learned by Stochastic Optimization

59 Rapier: A Rule-Based System

60 Machine Learning Approach Motivation: Writing accurate patterns for each slot for each domain (e.g. each web site) requires laborious software engineering. Alternative is to use machine learning: Build a training set of documents paired with human-produced filled extraction templates. Learn extraction patterns for each slot using an appropriate machine learning algorithm.

61 Automatic Pattern- Learning Systems Pros: Portable across domains Tend to have broad coverage Automatically finds appropriate patterns System knowledge not needed by those who supply the domain knowledge. Cons: We need annotated training data, and lots of it Isn’t necessarily better or cheaper than hand-built sol’n Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas): learn lexico-syntactic patterns from templates Trainer Decoder Model Language Input Answers Language Input

62 Rapier [Califf & Mooney, AAAI-99] Rapier learns three regex-style patterns for each slot:  Pre-filler pattern  Filler pattern  Post-filler pattern One of several recent trainable IE systems that incorporate linguistic constraints. (See also: SIFT [Miller et al, MUC-7]; SRV [Freitag, AAAI-98]; Whisk [Soderland, MLJ-99].) RAPIER rules for extracting “transaction price” “…paid $11M for the company…” “…sold to the bank for an undisclosed amount…” “…paid Honeywell an undisclosed price…”

63 Part-of-speech tags & Semantic classes Part of speech: syntactic role of a specific word noun (nn), proper noun (nnp), adjectve (jj), adverb (rb), determiner (dt), verb (vb), “.” (“.”), … NLP: Well-known algorithms for automatically assigning POS tags to English, French, Japanese, … (>95% accuracy) Semantic Classes: Synonyms or other related words “Price” class: price, cost, amount, … “Month” class: January, February, March, …, December “US State” class: Alaska, Alabama, …, Washington, Wyoming WordNet: large on-line thesaurus containing (among other things) semantic classes

64 Rapier rule matching example “…sold to the bank for an undisclosed amount…” POS: vb pr det nn pr det jj nn SClass: price “…paid Honeywell an undisclosed price…” POS: vb nnp det jj nn SClass: price

65 Rapier Rules: Details Rapier rule := pre-filler pattern filler pattern post-filler pattern pattern := subpattern + subpattern := constraint + constraint := Word - exact word that must be present Tag - matched word must have given POS tag Class - semantic class of matched word Can specify disjunction with “{…}” List length N - between 0 and N words satisfying other constraints

66 Rapier’s Learning Algorithm Input: set of training examples (list of documents annotated with “extract this substring”) Output: set of rules Init: Rules = a rule that exactly matches each training example Repeat several times: Seed: Select M examples randomly and generate the K most-accurate maximally-general filler-only rules (prefiller = postfiller = “true”). Grow: Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller context Keep: Rules = Rules  the best of the K rules – subsumed rules

67 Learning example (one iteration) 2 examples: ‘… located in Atlanta, Georgia…” ‘… offices in Kansas City, Missouri…’ maximally specific rules (high precision, low recall) maximally general rules (low precision, high recall) appropriately general rule (high precision, high recall) Init Seed Grow

68 Rapier Conceptually simple But powerful: it is easy to incorporate linguistic and other constraints Advantages of rules Easy to understand (but can be deceptive) One can “trace” the extraction of a piece of information step by step Rules can be edited manually. Disadvantages Difficult to incorporate quantitative / probabilistic constraints “with offices in San Francisco in Northern California”


Download ppt "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 14."

Similar presentations


Ads by Google