Presentation is loading. Please wait.

Presentation is loading. Please wait.

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Similar presentations


Presentation on theme: "David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge."— Presentation transcript:

1 David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

2 A Web of Pages  A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%

3 Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge

4 Existence  asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology

5 The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology

6 Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer

7

8 Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work  How? Fact Annotation … …

9 Turning Raw Symbols into Knowledge Symbols: $ 11,500 117K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance

10 Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

11 Data Extraction Demo

12 Semantic Annotation Demo

13 Free-Form Query Demo

14 Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation

15 Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

16 Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

17 Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology

18 Semantic Annotation

19 Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data

20 Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator

21 Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

22 Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression

23 For Let Where Return Formulate Query Expression

24 Run Query Over Semantically Annotated Data

25 How do we create extraction ontologies? – Manual creation requires several dozen person hours – Semi-automatic creation TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Nested Schemas with Regular Expressions Synergistic Bootstrapping Form-based Information Harvesting How do we scale up? – Practicalities of technology transfer and usage – Millions of queries over zillions of facts for thousands of ontologies Great! But Problems Still Need Resolution

26 Manual Creation

27

28 -Library of instance recognizers -Library of lexicons

29 Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

30 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables

31 Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2

32 Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1212

33 Locate Table Values Value

34 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1212

35 Interpretation Technique: Sibling Page Comparison

36 Same

37 Interpretation Technique: Sibling Page Comparison Almost Same

38 Interpretation Technique: Sibling Page Comparison Different Same

39 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

40 Generated RDF

41 WoK Demo (via TISP)

42 Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

43 Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

44 Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

45 Discover Mappings

46 Merge

47 Bootstrapping Cost-effective and Accurate Extraction Focus on semi-structured elements first Bootstrap synergistically – Extract from semi-structured elements – Learn extraction ontologies – Extract from plain text

48 ListReader: Wrapper Induction for Lists

49 Part I: Semi-supervised

50 OCR newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline newline Captain Donald "Dude" Bakken............... Right Half Back newline LeRoy "Sonny' Johnson..................,.... Lcft Half Back newline Orley Bakken...........,...........,.......... Quarter Back newline Roger Myhrum................................... Full Back newline Bill "Schnozz" Krohg.............................. Center newline Howard "Little Huby" Megorden................ Right Guard newline Royce "Shorty" Norgaard....................... Left Guard newline Eugene "Mad Russian" Easthind............... Right Tackle newline Alvin "Stuben" Hagen......................... Left Tackle newline Richard "Dick" Nienabcr........................ Right End newline James "Oakie" Wogsland.......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

51 Hand Form Creation & Labeling

52 Hand Form Creation & Labeling √

53 Hand Form Creation & Labeling Donald√

54 Hand Form Creation & Labeling DonaldBakken√

55 Hand Form Creation & Labeling DonaldBakkenDude√

56 Hand Form Creation & Labeling DonaldBakkenDude Right Half Back √

57 Generate Wrapper for First Record Captain Donald "Dude" Bakken............... Right Half Back newline LeRoy "Sonny' Johnson..................,.... Lcft Half Back newline Orley Bakken...........,...........,.......... Quarter Back newline Roger Myhrum................................... Full Back newline Bill "Schnozz" Krohg.............................. Center newline Howard "Little Huby" Megorden................ Right Guard newline Royce "Shorty" Norgaard....................... Left Guard newline Eugene "Mad Russian" Easthind............... Right Tackle newline Alvin "Stuben" Hagen......................... Left Tackle newline Richard "Dick" Nienabcr........................ Right End newline James "Oakie" Wogsland.......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 1. Captain, 2. Given Name, 3. Nickname, 4. Surname, 5. Position (Captain) (\w{6,6}) "(\w{4,4})" (\w{6,6}) \.{14,14} ((\w{4,5}){3,3})\n

58 Update Wrapper & Annotate Records Captain Donald "Dude" Bakken............... Right Half Back newline LeRoy "Sonny' Johnson..................,.... Lcft Half Back newline Orley Bakken...........,...........,.......... Quarter Back newline Roger Myhrum................................... Full Back newline Bill "Schnozz" Krohg.............................. Center newline Howard "Little Huby" Megorden................ Right Guard newline Royce "Shorty" Norgaard....................... Left Guard newline Eugene "Mad Russian" Easthind............... Right Tackle newline Alvin "Stuben" Hagen......................... Left Tackle newline Richard "Dick" Nienabcr........................ Right End newline James "Oakie" Wogsland.......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 2. Captain, 3. Given Name, 5. Nickname, 6. Surname, 7. Position ((Captain) )?(\w{5,6})( "(\w{4,5}) ['"] )? (\w{6,7}) [\.,]{14,34} ((\w{4,7} ){2,3})\n

59 Final Wrapper and Annotation Captain Donald "Dude" Bakken............... Right Half Back newline LeRoy "Sonny' Johnson..................,.... Lcft Half Back newline Orley Bakken...........,...........,.......... Quarter Back newline Roger Myhrum................................... Full Back newline Bill "Schnozz" Krohg.............................. Center newline Howard "Little Huby" Megorden................ Right Guard newline Royce "Shorty" Norgaard....................... Left Guard newline Eugene "Mad Russian" Easthind............... Right Tackle newline Alvin "Stuben" Hagen......................... Left Tackle newline Richard "Dick" Nienabcr........................ Right End newline James "Oakie" Wogsland.......................... Lcft End newline 2. Captain, 3. Given Name, 5. Nickname, 7. Surname, 8. Position ((Captain) )?(\w{4,7})( “((\w{4,7}){1,2})['"] )? (\w{5,8} ) [\.,]{14,34} ((\w{4,7} ){1,3})\n

60 Part II: Weakly-supervised

61 Apply Extraction Ontologies

62 Find List and Generate Wrapper Base list finding on whether a wrapper can be generated. Base wrapper generation on best-labeled record.

63 Extract Synergistically from Text

64

65 Form Creation Basic form-construction facilities: single-entry field multiple-entry field nested form …

66 Created Sample Form

67 Generated Ontology View

68 Source-to-Form Mapping

69

70

71

72 Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection

73 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

74 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

75 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

76 Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

77 Can Now Harvest Name

78 Can Now Harvest Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E

79 Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

80 Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

81 Harvesting Populates Ontology

82 Also helps adjust ontology constraints

83 Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

84 Automating Extraction Ontology Creation Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …

85 Automating Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases

86 Automatic Source-to-Form Mapping

87 Automatic Semantic Annotation Recognize and annotate with respect to an ontology

88 Advanced free-form queries with disjunction and negation Form-based query language Table-based query languages Graphical query languages Practicalities: WoK Query Interfaces (Future Work)

89 Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Topical Blogs Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)

90 Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)

91 Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation www.deg.byu.edu


Download ppt "David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge."

Similar presentations


Ads by Google