Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Slides:



Advertisements
Similar presentations
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Advertisements

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science.
Ontology Aware Software Service Agents: Meeting Ordinary User Needs on the Semantic Web Muhammed Al-Muhammed Supported in part by NSF.
1 Concepts, Ontologies, and Project TANGO Deryle Lonsdale BYU Linguistics and English Language
Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu.
Conceptual Model Based Semantic Web Services Muhammed J. Al-Muhammed David W. Embley Stephen W. Liddle Brigham Young University Sponsored in part by NSF.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Resolving Under Constrained and Over Constrained Systems of Conjunctive Constraints for Service Requests Muhammed J. Al-Muhammed David W. Embley Brigham.
Toward Making Online Biological Data Machine Understandable Cui Tao.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Ontology Aware Software Service Agents: Meeting Ordinary User Needs on the Semantic Web Muhammed Al-Muhammed April 19, 2005.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
Case-based Reasoning System (CBR)
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Recognition and Satisfaction of Constraints in Free-Form Task Specification Muhammed Al-Muhammed.
Semantic Web Queries by Mark Vickers Funded by NSF.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
Ontology-Based Constraint Recognition in Free-Form Service Requests Muhammed J. Al-Muhammed Brigham Young University Sponsored in part by NSF (#
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
January 2004 ADC’ What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Dimitrios Skoutas Alkis Simitsis
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
An Aspect of the NSF CDI Initiative CDI: Cyber-Enabled Discovery and Innovation.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
An Ontological Approach to Financial Analysis and Monitoring.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Extracting and Structuring Web Data
What Do You Want—Semantic Understanding?
David W. Embley Brigham Young University Provo, Utah, USA
Automating Schema Matching for Data Integration
Source Page Understanding for Heterogeneous Molecular Biological Data
Context-Aware Internet
Presentation transcript:

Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National Science Foundation

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

Grand Challenge Semantic Understanding Can we quantify & specify the nature of this grand challenge?

Grand Challenge Semantic Understanding “If ever there were a technology that could generate trillions of dollars in savings worldwide …, it would be the technology that makes business information systems interoperable.” (Jeffrey T. Pollock, VP of Technology Strategy, Modulant Solutions)

Grand Challenge Semantic Understanding “The Semantic Web: … content that is meaningful to computers [and that] will unleash a revolution of new possibilities … Properly designed, the Semantic Web can assist the evolution of human knowledge …” (Tim Berners-Lee, …, Weaving the Web)

Grand Challenge Semantic Understanding “20 th Century: Data Processing “21 st Century: Data Exchange “The issue now is mutual understanding.” (Stefano Spaccapietra, Editor in Chief, Journal on Data Semantics)

Grand Challenge Semantic Understanding “The Grand Challenge [of semantic understanding] has become mission critical. Current solutions … won’t scale. Businesses need economic growth dependent on the web working and scaling (cost: $1 trillion/year).” (Michael Brodie, Chief Scientist, Verizon Communications)

What is Semantic Understanding? Understanding: “To grasp or comprehend [what’s] intended or expressed.’’ Semantics: “The meaning or the interpretation of a word, sentence, or other language form.” - Dictionary.com

Can We Achieve Semantic Understanding? “A computer doesn’t truly ‘understand’ anything.” But computers can manipulate terms “in ways that are useful and meaningful to the human user.” - Tim Berners-Lee Key Point: it only has to be good enough. And that’s our challenge and our opportunity! …

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

Information Value Chain Meaning Knowledge Information Data Translating data into meaning

Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

Foundational Definitions  Meaning: knowledge that is relevant or activates  Knowledge: information with a degree of certainty or community agreement (ontology)  Information: data in a conceptual framework  Data: attribute-value pairs - Adapted from [Meadow92]

Data  Attribute-Value Pairs Fundamental for information Thus, fundamental for knowledge & meaning

Data  Attribute-Value Pairs Fundamental for information Thus, fundamental for knowledge & meaning  Data Frame Extensive knowledge about a data item ̶ Everyday data: currency, dates, time, weights & measures ̶ Textual appearance, units, context, operators, I/O conversion Abstract data type with an extended framework

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

? Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

? Olympus C-750 Ultra Zoom Sensor Resolution4.2 megapixels Optical Zoom10 x Digital Zoom4 x Installed Memory16 MB Lens ApertureF/8-2.8/3.7 Focal Length min6.3 mm Focal Length max63.0 mm

Digital Camera Olympus C-750 Ultra Zoom Sensor Resolution:4.2 megapixels Optical Zoom:10 x Digital Zoom:4 x Installed Memory:16 MB Lens Aperture:F/8-2.8/3.7 Focal Length min:6.3 mm Focal Length max:63.0 mm

? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

? Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

Car Advertisement Year 2002 MakeFord ModelThunderbird Mileage5,500 miles FeaturesRed ABS 6 CD changer keyless entry Price$33,000 Phone(916)

? Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

? Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

Airline Itinerary Flight # Class From Time/Date To Time/Date Stops Delta 16 Coach JFK 6:05 pm CDG 7:35 am Delta 119 Coach CDG 10:20 am JFK 1:00 pm

? Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

? Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

World Cup Soccer Monday, October 13, 2003 Group AWLTGFGAPts. USA Sweden North Korea Nigeria Group BWLTGFGAPts. Brazil …

? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

? Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

Treadmill Workout Calories250 cal Distance2.50 miles Time23.35 minutes Incline1.5 degrees Speed5.2 mph Heart Rate125 bpm

? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

? PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,000 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

Maps PlaceBonnie Lake CountyDuchesne StateUtah TypeLake Elevation10,100 feet USGS QuadMirror Lake Latitude40.711ºN Longitude ºW

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

Information Extraction Ontologies SourceTarget Information Extraction Information Exchange

What is an Extraction Ontology?  Augmented Conceptual-Model Instance Object & relationship sets Constraints Data frame value recognizers  Robust Wrapper (Ontology-Based Wrapper) Extracts information Works even when site changes or when new sites come on-line

CarAds Extraction Ontology [1-9]\d{0,2}[kK] … [1-9]\d{0,2}[kK] …

Extraction Ontologies: An Example of Semantic Understanding  “Intelligent” Symbol Manipulation  Gives the “Illusion of Understanding”  Obtains Meaningful and Useful Results

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

A Variety of Applications  Information Extraction  Semantic Web Page Annotation  Free-Form Semantic Web Queries  Task Ontologies for Free-Form Service Requests  High-Precision Classification  Schema Mapping for Ontology Alignment  Record Linkage  Accessing the Hidden Web  Ontology Discovery and Generation  Challenging Applications (e.g. BioInformatics)

Application #1 Information Extraction

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Constant/Keyword Recognition Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

Heuristics  Keyword proximity  Subsumed and overlapping constants  Functional relationships  Nonfunctional relationships  First occurrence without constraint violation

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Keyword Proximity   '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

Subsumed/Overlapping Constants '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Functional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

Nonfunctional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

First Occurrence without Constraint Violation '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Database-Instance Generator insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “ ”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”)

Application #2 Semantic Web Page Annotation

Annotated Web Page (Demo)

OWL CarAds …… …… Mileage …… …… 2 …… ……. ……

Application #3 Free-Form Semantic Web Queries

Find Ontology “Tell me about cruises on San Francisco Bay. I’d like to know scheduled times, cost, and the duration of cruises on Friday of next week.”

Formulate Query Friday, Oct. 29thcost duration Selection Constants San Francisco Bay scheduled times Projection = Result   () Join Path

StartTimePriceDurationSource 10:45 am, 12:00 pm, 1:15, 2:30, 4:00$20.00, $16.00, $ :00 am, 10:45 am, 11:15 am, 12:00 pm, 12:30 pm, 1:15 pm, 1:45 pm, 2:30 pm, 3:00 pm, 3:45 pm, 4:15 pm, 5:00 pm $17.00, $16.00, $ Hour2

Application #4 Task Ontologies for Free-Form Service Requests

Basic Idea  Service Request  Match with Task Ontology Domain Ontology Process Ontology  Complete, Negotiate, Finalize I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

Domain Ontology

Appointment … context keywords/phrase: “appointment |want to see a |…” Dermatologist … context keywords/phrases: “([D|d]ermatologist) | …” I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

Appointment … context keywords/phrase: “appointment |want to see a |…” Dermatologist … context keywords/phrases: “([D|d]ermatologist) | …” I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

Appointment … context keywords/phrase: “appointment |want to see a |…” Dermatologist … context keywords/phrases: “([D|d]ermatologist) | …” I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

Appointment … context keywords/phrase: “appointment |want to see a |…” Dermatologist … context keywords/phrases: “([D|d]ermatologist) | …” I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance.

Appointment … context keywords/phrase: “appointment |want to see a |…” Dermatologist … context keywords/phrases: “([D|d]ermatologist) | …” I want to see a dermatologist next week; any day would be ok for me, at 4:00 p.m. The dermatologist must be within 20 miles from my home and must accept my insurance. Date … NextWeek(d1: Date, d2: Date) returns (Boolean{T,F}) context keywords/phrases: next week | week from now | … Distance internal representation : real; input (s: String) context keywords/phrases: miles | mile | mi | kilometers | kilometer | meters | meter | centimeter | … Within(d1: Distance, “20”) returns (Boolean {T or F}) context keywords/phrases: within | not more than |  | … return (d1  d2) … end;

Process Ontology

Specification Satisfaction Date(“28 Dec 04”) and NextWeek(“28 Dec 04”, “5 Jan 05”) Dermatologist(Dermatologist0) is at Address(“Orem 600 State St.”) and Within(DistanceBetween(“Provo 300 State St.”, “Orem 600 State St.”), “22”)  i 2 (Dermatologist(Dermatologist0) accepts Insurance(i 2 ) and Equal(“IHC”, i 2 ))

Application #5 High-Precision Classification

An Extraction Ontology Solution

Document 1: Car Ads Document 2: Items for Sale or Rent Density Heuristic

Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 Expected Values Heuristic Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4

Vector Space of Expected Values OV______ D1D2 Year Make Model Mileage Price Feature PhoneNr D1: D2: ov D1 D2

Grouping Heuristic Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {

Grouping Car Ads Year Make Model Price Year Model Year Make Model Mileage Year Model Mileage Price Year … Grouping: Sale Items Year Mileage Mileage Year Price Year Price Year Price … Grouping: Expected Number in Group = floor(∑ Ave ) = 4 (for our example) Sum of Distinct 1-Max Object Sets in each Group Number of Groups * Expected Number in a Group 1-Max *4 = *4 = 0.500

Application #6 Schema Mapping for Ontology Alignment

Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto

Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,

Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto

Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table

Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,

Application #7 Record Linkage

“Kelly Flanagan” Query

 Gather evidence from each of several different facets Attributes Links Page Similarity  Combine the evidence A Multi-faceted Approach

 Phone number, address, state, city, zip code  Data-frame recognizers Attributes

Links

“adjacent cap-word pairs”: Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word. Page Similarity

C1C1 C 2 …..C i …..C j …CnCn C1C1 1C 12 C 1i C 1j C 1n C2C2 1C 2i C 2j C 2n :: : : CiCi 1C ij C in : : : CjCj 1C jn : : CnCn 1 P(C i and C j refer to a same person | evidence for a facet f ) 0 if no evidence for a facet f C ij = Training set to compute the conditional probabilities Confidence Matrix for Each Facet

* * * * 0 * 0.78 = Confidence Matrix for AttributesConfidence Matrix for LinksConfidence Matrix for Page Similarity Final Matrix

 Input: final confidence matrix  Output: citations grouped by same person  The idea: {C i, C j } and {C j, C k } then {C i, C j, C k } The threshold we use for “highly confident” is 0.8. Grouping Algorithm

Experimental Results

Application #8 Accessing the Hidden Web

Obtaining Data Behind Forms Web information is stored in databases Databases are accessed through forms Forms are designed in various ways

Hidden Web Extraction System Input Analyzer Retrieved Page(s) User Query Site Form Output Analyzer Extracted Information Application Extraction Ontology “Find green cars costing no more than $9000.”

Application #9 Ontology Discovery & Generation

TANGO: Table Analysis for Generating Ontologies  Recognize and normalize table information  Construct mini-ontologies from tables  Discover inter-ontology mappings  Merge mini-ontologies into a growing ontology

Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 30%

Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 30%

Discover Mappings

Merge

Application #10 Challenging Applications (e.g. BioInformatics)

Large Extraction Ontologies

Complex Semi-Structured Pages

Additional Analysis Opportunities  Sibling Page Comparison  Semi-automatic Lexicon Update  Seed Ontology Recognition

Sibling Page Comparison

Attributes

Sibling Page Comparison

Semi-automatic Lexicon Update Additional Protein Names Additional Source Species or Organisms

nucleu s; zinc ion binding; nucleic acid binding; zinc ion binding; nucleic acid binding; linear; NP_079345; 9606; Eukaryota; Metazoa; Chorata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo; NP_079345; Homo sapiens; human; GTTTTTGTGTT……….ATAAGTGCATTAAC GGCCCACATG; FLJ14299 msdspagsnprtpessgs gsgg………tagpyyspy alygqrlasasalgyq; hypothetical protein FLJ14299; 8; eight; “8:?p\s?12”; “8:?p11.2”; “8:?p11.23”; : “37,?612,?680”; “37,?610,?585”; Seed Ontology Recognition

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

Limitations and Pragmatics  Data-Rich, Narrow Domain  Ambiguities ~ Context Assumptions  Incompleteness ~ Implicit Information  Common Sense Requirements  Knowledge Prerequisites  …

Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.) Ambiguous Whom do we trust? (How do they count?)

Busiest Airport in 2003? Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.) Atlanta - 58,875,694 Passengers (Sep., latest numbers available) Memphis - 2,494,190 Metric Tons (Airports Council Int’l.) Important qualification

Dow Jones Industrial Average High Low Last Chg 30 Indus Transp Utils Stocks , Graphics, Icons, …

Dow Jones Industrial Average High Low Last Chg 30 Indus Transp Utils Stocks , Reported on same date Weekly Daily Implicit information: weekly stated in upper corner of page; daily not stated.

Presentation Outline  Grand Challenge  Meaning, Knowledge, Information, Data  Fun and Games with Data  Information Extraction Ontologies  Applications  Limitations and Pragmatics  Summary and Challenges

Some Key Ideas  Data, Information, and Knowledge  Data Frames Knowledge about everyday data items Recognizers for data in context  Ontologies Resilient Extraction Ontologies Shared Conceptualizations  Limitations and Pragmatics

Some Research Issues  Building a library of open source data recognizers  Precisely finding and gathering relevant information Subparts of larger data Scattered data (linked, factored, implied) Data behind forms in the hidden web  Improving concept matching Indirect matching Calculations, unit conversions, alternative representations, …  …

Some Research Challenges  Web Page Understanding Suppose extraction is ~85% accurate Generate a page grammar ̶ Increased recall (more extracted) ̶ Increased precision (fewer false positives) ̶ Fast extraction from same-site sibling pages  Universal Rules for Schema Matching Must rules be domain-specific? Can some rules be “universal”?  Boundaries of Usefulness: When should machine learning not be used?  Application to Significant Problems Like those above Many more … (Machine Learning)