Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS246 Extracting Structured Information from the Web.

Similar presentations


Presentation on theme: "CS246 Extracting Structured Information from the Web."— Presentation transcript:

1 CS246 Extracting Structured Information from the Web

2 Junghoo "John" Cho (UCLA Computer Science)2 A Story of Nightmare Spam Inc Task from your boss 10M Web pages Find all [person name, email] Big salary cut unless you collect 100,000 “quality records” in a week

3 Junghoo "John" Cho (UCLA Computer Science)3 How? Any idea? Why such a task? Information is already there…  To use it for other programs: Use the addresses to send emails For now let us ignore the techniques in the papers and see how we can approach the problem

4 Junghoo "John" Cho (UCLA Computer Science)4 Solution 1 Manual approach 10 sec/record 8640 records/day 60480 records/week Okay if 5 sec/record

5 Junghoo "John" Cho (UCLA Computer Science)5 Solution 2 Write an “extraction rule” Regular expression Email: [A-Za-z]+@([A-Za-z]+.)+[A-Za-z] Name: [A-Z][a-z]* [A-Z][a-z]* Find all matches using the rule Maybe “filter out” manually

6 Junghoo "John" Cho (UCLA Computer Science)6 Question Do we have to construct an “extraction rule” for every task? Can we automate “rule construction”?

7 Junghoo "John" Cho (UCLA Computer Science)7 General Problem Extraction Rule or Pattern (John, john@cs) (Eric, eric@cs) (James, james@cs) Web pages or Plain text Structured data How to generate it?

8 Junghoo "John" Cho (UCLA Computer Science)8 Basic Idea Users provide small “examples” or a “training set” Tag some [name, email] pairs from the data

9 Junghoo "John" Cho (UCLA Computer Science)9 Tagging Name Email

10 Junghoo "John" Cho (UCLA Computer Science)10 Basic Idea Users provide small “examples” or a “training set” Tag some [name, email] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs

11 Junghoo "John" Cho (UCLA Computer Science)11 Pattern Generation Chu chu@cs … Cong cong@cs.. Cho cho@cs … …  #Name #Email !

12 Junghoo "John" Cho (UCLA Computer Science)12 Basic Idea Users provide small “examples” or a “training set” Tag some [name, email] pairs from the data System “generalize” the examples & derive a “rule” or “patterns” Find common patterns among the tagged pairs Use the rule to extract other instances.

13 Junghoo "John" Cho (UCLA Computer Science)13 Fundamental Questions How to generalize? Examples  patterns: how? Pattern construction algorithm How to express “patterns” or “rules” Regular expression? Context-free grammar? Pattern language How to select the right pattern? Many possible patterns. Which one to choose? Evaluation function

14 Junghoo "John" Cho (UCLA Computer Science)14 Dual Questions What kind of sources? Unstructured vs. Regular Plain text vs. Table Noisy vs. Clean What kind of data to extract? Difficult to identify vs. Easy to describe Name vs. Email Single occurrences vs. Multiple occurrences Email vs. Song title

15 Junghoo "John" Cho (UCLA Computer Science)15 Questions?

16 Junghoo "John" Cho (UCLA Computer Science)16 Book and Author paper How many people understood it? What is the problem? What is the basic idea? How many people got it? How many people liked it? What did you like/hate about the paper?

17 Junghoo "John" Cho (UCLA Computer Science)17 Basic Algorithm (1) Start with a small example (Issac Asimov, The Robots of Dawn) (David Brin, Startide Rising) Find all matches from Web pages (with surrounding text) … Startide Rising by David Brin (2 nd … …book The Robots of Dawn by Isaac Asimov (19… Derive common patterns among matches  #Book by #Author (

18 Junghoo "John" Cho (UCLA Computer Science)18 Basic Algorithm (2) Find more examples using the pattern #Book by #Author (  … The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (…  (H.G. Wells, The Time Machine) (H.P. Lovedraft, The Lurker at the Threshold)

19 Junghoo "John" Cho (UCLA Computer Science)19 Basic Algorithm (3) Find more occurrences of the new examples …book The Time Machine by H.G. Wells (… … The Lurker at the Threshold by H.P. Lovedraft (… Derive more rules based on the matches  #Book by #Author Repeat the process

20 Junghoo "John" Cho (UCLA Computer Science)20 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)

21 Junghoo "John" Cho (UCLA Computer Science)21 Basic Algorithm (Summary) Examples (Asimov, Dawn) Matching Strings Dawn by Asimov ( Patterns #Book by #Author ( More Examples (Brin, Star)

22 Junghoo "John" Cho (UCLA Computer Science)22 Result 23M Web pages 5 examples 5 Iterations 1 Manual filtering  15,257 pairs with few errors

23 Junghoo "John" Cho (UCLA Computer Science)23 What’s New? No tagging. Simple examples (Pattern, Relation) duality Conceptually elegant Feedback loop Why don’t we use learned examples? Small initial sample Promising results

24 Junghoo "John" Cho (UCLA Computer Science)24 Problems of Feedback Loop What if there are erroneous examples? Expand to meaningless data?

25 Junghoo "John" Cho (UCLA Computer Science)25 What Did the Author Do? Manual filtering in 4 th iteration Stopped iteration after 5 iterations Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| Adopt a pattern if it has a long prefix, suffix and/or mid-string Limit rules to a very specific URL space Rule includes URL prefix

26 Junghoo "John" Cho (UCLA Computer Science)26 Divergence? Another experiment Initial examples: Baseball team names Data: Newspaper articles Results: All sports team names Given a set of examples, where would it converge?

27 Junghoo "John" Cho (UCLA Computer Science)27 How to Control Divergence? Example  Pattern More than k examples Pattern  Example More than k patterns

28 Junghoo "John" Cho (UCLA Computer Science)28 Matrix Interpretation Rows: Examples (Items) We assume a hypothetical set of all examples occurring in the data Columns: Patterns We assume a hypothetical set of all patterns that can be derived Cell[ i, j ] = 1 iff j th pattern matches i th example Row[ i ] = (Book of worm, Asimov) Column[ j ] = #Book by #Author Cell[ i, j ] = 1 if “ Book of worm by Asimov” exists

29 Junghoo "John" Cho (UCLA Computer Science)29 Matrix Example 11010 10010 11001 10010 11010 00100 (A, B) (C, D) (C, A) (D, E) (S, L) (N, U) …. … Patterns Items

30 Junghoo "John" Cho (UCLA Computer Science)30 How to Control Divergence? Example  Pattern More than k examples Pattern  Example More than k patterns Fix the matrix!

31 Junghoo "John" Cho (UCLA Computer Science)31 How to Change Matrix? Change Row? Filter out noise from data Use only the pages mentioning “books” Classify pages based on word frequency Identify only “relevant” part of pages Identify only “structured” part of pages List? Tables?

32 Junghoo "John" Cho (UCLA Computer Science)32 How to Change Matrix? Change Column? Use different pattern language E.g., the author used “url prefix” Context-free grammar? What will be a good pattern space?

33 Junghoo "John" Cho (UCLA Computer Science)33 Fundamental Questions How to express “patterns” or “rules” Pattern language How to examples  patterns? Pattern construction algorithm How to select the right one? Evaluation function

34 Junghoo "John" Cho (UCLA Computer Science)34 Pattern Language? Very limited regular expression With URL filter URL filter seems to be important to minimize noise [prefix] #book [midstring] #author [suffix]

35 Junghoo "John" Cho (UCLA Computer Science)35 Pattern Construction Algorithm? 1. Group matching strings based on “mid- string” 2. Find longest prefix, suffix and URL-prefix 3. If the pattern is long enough, adopt it

36 Junghoo "John" Cho (UCLA Computer Science)36 Evaluation Function? The longer, the better. Specificity factor |middle| x |prefix| x |suffix| x |urlprefix| To minimize noise

37 Junghoo "John" Cho (UCLA Computer Science)37 Dual Question Regular vs. Unstructured source Relatively regular source required Noisy vs. Clean source General noise okay Single vs. Multiple occurrences Multiple occurrence

38 Junghoo "John" Cho (UCLA Computer Science)38 Would It Work? [name, phone number]

39 Junghoo "John" Cho (UCLA Computer Science)39 Would It Work? [name, phone number]? No: [mid-string] not fixed More expressive pattern language HTML parse-tree based?

40 Junghoo "John" Cho (UCLA Computer Science)40 Any Other Questions?

41 Junghoo "John" Cho (UCLA Computer Science)41 RoadRunner What is the problem? What is the main idea?

42 Junghoo "John" Cho (UCLA Computer Science)42 Key Observation Many Web pages generated from structured database These pages are based on “templates”, thus follow extremely regular structure We can extract data by identifying “different parts”

43 Junghoo "John" Cho (UCLA Computer Science)43 Key Idea Compare two pages Extract different parts

44 Junghoo "John" Cho (UCLA Computer Science)44 Simplest Case Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Mismatch!

45 Junghoo "John" Cho (UCLA Computer Science)45 Simplest Case Books of: Title: Books of: Title: Template

46 Junghoo "John" Cho (UCLA Computer Science)46 Simplest Case John Smith DB Primer Paul Jones XML at Work Data

47 Junghoo "John" Cho (UCLA Computer Science)47 What Other Cases?

48 Junghoo "John" Cho (UCLA Computer Science)48 Repeated Items (from Amazon)

49 Junghoo "John" Cho (UCLA Computer Science)49 Missing Items (from Amazon) No Image!

50 Junghoo "John" Cho (UCLA Computer Science)50 Varying Items (from Amazon) Item varies!

51 Junghoo "John" Cho (UCLA Computer Science)51 Other Cases Repeated items Number of items may vary Missing items Optional Varying items Multiple choices How can we express these cases? Pattern language

52 Junghoo "John" Cho (UCLA Computer Science)52 Pattern language What patterns can express the previous cases? Regular expression? Repeated items (+) Optional items (?) Varying items ( | ) Why not context-free grammar? More expressive, but not necessary

53 Junghoo "John" Cho (UCLA Computer Science)53 One Step Back What are we doing here? How can we formalize the problem? Given a set of strings (instances), Find a regular language/grammar that includes the strings Grammar inference problem (One of the most important contribution of the paper)

54 Junghoo "John" Cho (UCLA Computer Science)54 Grammar Inference T: All possible strings Example strings Which one?

55 Junghoo "John" Cho (UCLA Computer Science)55 Minimal Regular Language Pick the minimal language Conservative approach May minimize bogus tuples Is it the right choice? May not match the actual semantic. But easier to solve and looks fancy! Do the authors actually pick minimal language? No. They prefer list over optional. List is larger than optional.

56 Junghoo "John" Cho (UCLA Computer Science)56 Why Union free? Union is ugly Major source of exponential blow-up (a|b)(c|d)(e|f)(g|h): 2 x 2 x 2 x 2 Limited expressive power, but easier to work with

57 Junghoo "John" Cho (UCLA Computer Science)57 Pattern Space of RoadRunner Minimal Union-free regular expression List (+) and Optional (?) No Choice ( | ) List has precedence to optional Not exactly minimal

58 Junghoo "John" Cho (UCLA Computer Science)58 Language Inference Algorithm String mismatches Replace with #PCDATA Tag mismatches Try list first and then optional Heavily depends on Tag mismatch

59 Junghoo "John" Cho (UCLA Computer Science)59 String Mismatch Books of: John Smith Title: DB Primer Books of: Paul Jones Title: XML at Work Books of: #PCDATA Title: #PCDATA

60 Junghoo "John" Cho (UCLA Computer Science)60 Tag Mismatches Try to generalize by list If it does not work, consider optional

61 Junghoo "John" Cho (UCLA Computer Science)61 List Identification Title DB Primer 1 Title DB Primer 2 Title DB Primer 3 Title XML Primer 1 Title XML Primer 2 Missing! Search for previous tag to identify end of item Verify it by matching with previous one

62 Junghoo "John" Cho (UCLA Computer Science)62 Recursive Mismatch Title DB Primer 1 1 st Edition, 1996 Title DB Primer 2 1 st Edition, 2000 2 nd Edition, 2001 Title XML Primer 1 1 st Edition, 1996 Missing! Apply matching algorithm recursively

63 Junghoo "John" Cho (UCLA Computer Science)63 Optional If list does not work, use optional For multiple choices, what to choose? Many different choices to consider The authors do not explain… Some heuristic pruning criteria

64 Junghoo "John" Cho (UCLA Computer Science)64 Multiple Choices Mismatch! Potential wrappers (( )? )+ (( )? )+… and many others

65 Junghoo "John" Cho (UCLA Computer Science)65 Fundamental Questions Pattern space Union-free regular expression Example  Pattern algorithm Just described Evaluation function Supposedly minimal language… but not really Exact evaluation function not explained…

66 Junghoo "John" Cho (UCLA Computer Science)66 Dual Question Regular vs. Unstructured source Very regular Noisy vs. Clean source Very clean Single vs. Multiple occurrences Does not matter

67 Junghoo "John" Cho (UCLA Computer Science)67 Limitations Heavily dependent on HTML tags Cannot extract data in free text, even if the format is regular e.g., John is the author of Great Book Very fragile to noise Of course, limitations from Union-free: Regular expression No recursive items: …

68 Junghoo "John" Cho (UCLA Computer Science)68 Potential Improvements? Consider multiple pages simultaneously May provide more evidence to select one choice over the other

69 Junghoo "John" Cho (UCLA Computer Science)69 One More Consideration Is Section 3 necessary? Read the paper without Section 3 Is it still as impressive? Generalization and theoretical background study is always helpful to make a paper more “impressive”


Download ppt "CS246 Extracting Structured Information from the Web."

Similar presentations


Ads by Google