David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
TANGO Table ANalysis for Generating Ontologies Yuri A. Tijerino*, David W. Embley*, Deryle W. Lonsdale* and George Nagy** * Brigham Young University **
Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,
FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.
Semi-automatic Ontology Creation through Conceptual-Model Integration David W. Embley Brigham Young University ER2008.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Principled Pragmatism: A Guide to the Adaptation of Philosophical Disciplines to Conceptual Modeling David W. Embley, Stephen W. Liddle, & Deryle W. Lonsdale.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
A Tool to Support Ontology Creation Based on Incremental Mini- Ontology Merging Zonghui Lian Data Extraction Research Group Supported by Spring Conference.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Ontology-Based Free-Form Query Processing for the Semantic Web Thesis proposal by Mark Vickers.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Two-Level Semantic Annotation Model BYU Spring Conference 2007 Yihong Ding Sponsored by NSF.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
A Tool to Support Ontology Creation Based on Incremental Mini-Ontology Merging Zonghui Lian Data Extraction Research Group Supported by.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of.
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Semantic Web Queries by Mark Vickers Funded by NSF.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
Thesis Proposal Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.
Theoretical Foundations for Enabling a Web of Knowledge David W. Embley Andrew Zitzelberger Brigham Young University
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Artificial intelligence project
A Web of Knowledge for Historical Documents David W. Embley.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6,
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
Cost-Effective Information Extraction from Lists in OCRed Historical Documents Thomas Packer and David W. Embley Brigham Young University FamilySearch.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
“Automating Reasoning on Conceptual Schemas” in FamilySearch — A Large-Scale Reasoning Application David W. Embley Brigham Young University More questions.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
An Aspect of the NSF CDI Initiative CDI: Cyber-Enabled Discovery and Innovation.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
David W. Embley Brigham Young University Provo, Utah, USA.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Semantic Database Builder
David W. Embley Brigham Young University Provo, Utah, USA
Presentation transcript:

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

A Web of Pages  A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%

Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge

Existence  asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology

The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology

Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer

Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work  How? Fact Annotation … …

Turning Raw Symbols into Knowledge Symbols: $ 11, K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance

Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Data Extraction Demo

Semantic Annotation Demo

Free-Form Query Demo

Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation

Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology

Semantic Annotation

Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data

Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator

Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression

For Let Where Return Formulate Query Expression

Run Query Over Semantically Annotated Data

How do we create extraction ontologies? – Manual creation requires several dozen person hours – Semi-automatic creation TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Nested Schemas with Regular Expressions Synergistic Bootstrapping Form-based Information Harvesting How do we scale up? – Practicalities of technology transfer and usage – Millions of queries over zillions of facts for thousands of ontologies Great! But Problems Still Need Resolution

Manual Creation

-Library of instance recognizers -Library of lexicons

Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

Recognize Tables Data Table Layout Tables (discard) Nested Data Tables

Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2

Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s)

Locate Table Values Value

Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE

Interpretation Technique: Sibling Page Comparison

Same

Interpretation Technique: Sibling Page Comparison Almost Same

Interpretation Technique: Sibling Page Comparison Different Same

Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

Generated RDF

WoK Demo (via TISP)

Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

Discover Mappings

Merge

Bootstrapping Cost-effective and Accurate Extraction Focus on semi-structured elements first Bootstrap synergistically – Extract from semi-structured elements – Learn extraction ontologies – Extract from plain text

ListReader: Wrapper Induction for Lists

Part I: Semi-supervised

OCR newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline newline Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Hand Form Creation & Labeling

Hand Form Creation & Labeling √

Hand Form Creation & Labeling Donald√

Hand Form Creation & Labeling DonaldBakken√

Hand Form Creation & Labeling DonaldBakkenDude√

Hand Form Creation & Labeling DonaldBakkenDude Right Half Back √

Generate Wrapper for First Record Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 1. Captain, 2. Given Name, 3. Nickname, 4. Surname, 5. Position (Captain) (\w{6,6}) "(\w{4,4})" (\w{6,6}) \.{14,14} ((\w{4,5}){3,3})\n

Update Wrapper & Annotate Records Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline 2. Captain, 3. Given Name, 5. Nickname, 6. Surname, 7. Position ((Captain) )?(\w{5,6})( "(\w{4,5}) ['"] )? (\w{6,7}) [\.,]{14,34} ((\w{4,7} ){2,3})\n

Final Wrapper and Annotation Captain Donald "Dude" Bakken Right Half Back newline LeRoy "Sonny' Johnson ,.... Lcft Half Back newline Orley Bakken , , Quarter Back newline Roger Myhrum Full Back newline Bill "Schnozz" Krohg Center newline Howard "Little Huby" Megorden Right Guard newline Royce "Shorty" Norgaard Left Guard newline Eugene "Mad Russian" Easthind Right Tackle newline Alvin "Stuben" Hagen Left Tackle newline Richard "Dick" Nienabcr Right End newline James "Oakie" Wogsland Lcft End newline 2. Captain, 3. Given Name, 5. Nickname, 7. Surname, 8. Position ((Captain) )?(\w{4,7})( “((\w{4,7}){1,2})['"] )? (\w{5,8} ) [\.,]{14,34} ((\w{4,7} ){1,3})\n

Part II: Weakly-supervised

Apply Extraction Ontologies

Find List and Generate Wrapper Base list finding on whether a wrapper can be generated. Base wrapper generation on best-labeled record.

Extract Synergistically from Text

Form Creation Basic form-construction facilities: single-entry field multiple-entry field nested form …

Created Sample Form

Generated Ontology View

Source-to-Form Mapping

Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection

Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

Almost Ready to Harvest … Need reading path: DOM-tree structure Need to resolve mapping problems – Split/Merge – Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

Can Now Harvest Name

Can Now Harvest Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E

Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

Harvesting Populates Ontology

Also helps adjust ontology constraints

Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

Automating Extraction Ontology Creation Lexicons Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …

Automating Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases

Automatic Source-to-Form Mapping

Automatic Semantic Annotation Recognize and annotate with respect to an ontology

Advanced free-form queries with disjunction and negation Form-based query language Table-based query languages Graphical query languages Practicalities: WoK Query Interfaces (Future Work)

Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Topical Blogs Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)

Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)

Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation