Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.

Similar presentations


Presentation on theme: "Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."— Presentation transcript:

1 Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

2 2 Introduction Wrapper-driven data extraction –Pros: data-source-specified, high performance –Cons: lack of resiliency and scalability Ontology-driven data extraction –Pros: application-domain-specified, resilient and scalable –Cons: hard to create Objective –Generating data-extraction ontologies

3 3 Generation Architecture Data Extraction Ontology Integrated Knowledge Base training documents interact if necessary Results Storage Concept Selection Extraction Processing pre-processing clean records Relation Retrieval Constraint Discovery test documents Knowledge Sources pre-processing Result Evaluation Knowledge Preparation Application Specification Domain Allocation Ontology Generation

4 4 Knowledge Base Construction Knowledge Sources –Mikrokosmos (  K) Ontology –Data-Frame Library –Additional Lexicons –WordNet Integration of Knowledge Base Data-Frame Library  K Ontology Synonym Dictionary (WordNet) Lexicons KNOWLEDGE BASE

5 5 Application Specification Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 Only $12,695. 221-1250 R ecord 3: 02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1-877-228-9486, OREM Utah

6 6 Domain Allocation: concept selection Select concepts using string-matching with object values Resolve conflict by context or semantic meanings 02 Buick Century Pwr Seat, Nada Retail 13,695. Data Frame Library retail by keyword identification

7 7 Domain Allocation: relationship retrieval Record 1: 00 GrandAM SE, Sunfire Red, CD, AC, PW, PL Great Condition, $ 10,800, Call 798-3446 Record 2: 02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 Only $ 12,695. 221-1250 Record 3: 02 Buick Century, lo mi, mint cond, $ 11,999. 373-4445 dlr# 2755 Record 4: 00 Buick Century Stk# HU7159 Green $ 9,319, 714-2200 To Apply By Phone, 1-877-228-9486, OREM Utah Find paths among selected concept nodes Retrieve cluster representing application domain

8 8 Domain Allocation: constraint discovery Discover participation times for each object values Specify discovered values to be participation constraints 02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755 00 Buick Century Stk# HU7159 Green $9,319, 714-2200 To Apply By Phone, 1- 877-228-9486, OREM Utah AUTOMOBILE [0:1] has MAKE [1:*] AUTOMOBILE [0:*] has FEATURE [1:*] AUTOMOBILE [0:1] has PRICE [1:1]

9 9 Ontology Generation Initial ontology: automatically generated Updated ontology: user tuning Expectation –Rejecting existence much easier than adding new –Modification as less as possible

10 10 Evaluation and Results Evaluation –Compare: Generated vs. Expert-created –POG (Precision of Ontology Generation) –PROG (Pseudo-Recall of Ontology Generation) –EPROG (Effective-PROG) Results –Three testing domains: Apt-Rental, Used-Auto-Ads, Nation- Essence –Average POG less than 0.23 –Lowest EPROG is around 0.70, highest is almost 1.0

11 11 Conclusion Exploits existing knowledge Specifies application domain Allocates domain inside the knowledge base Generates a data-extraction ontology Shows effective recall of more than 70% on average


Download ppt "Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."

Similar presentations


Ads by Google