Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Similar presentations


Presentation on theme: "Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding."— Presentation transcript:

1 Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding

2 Querying the Web (Two Approaches) Enhanced query language –Examples: WebSQL, WebOQL –Sources: structured, or restructured before parsing Wrapper –Enables querying in a database-like fashion –Depends on source format not resilient same topic with different formats need different wrappers

3 Data-Extraction Ontology Beyond the wrapper approach –Extraction technique for data-rich, unstructured, multiple-record Web documents –Does not depend on source format resilient Same topic with different formats uses same ontology Good experimental results

4 Main Difficulty (Creating the Data-Extraction Ontology) Users must be experts –database theory –regular expression generation Manual creation is impractical –Very large information sources –Frequently added sources of interest –Many varying text formats

5 Semiautomatic Data-Extraction Generation Generation & Updating Process Input Knowledge Sources Generated Data-Extraction Ontology Training Document(s) Validation Documents

6 Generation Process For this research, three steps are expected: –Gathering Knowledge –Generating Initial Ontology –Validation & Updating Strategy Ontology Generation Performance Evaluation

7 Example: Extract Information from Country Library Web Site (http://www.tradeport.org/ts/countries/) Car Advertisement XML Base CIA Factbook XML Base

8 Learning & Discovering Algorithm  All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country CIA Factbook XML Base Car Advertisement XML Base

9 Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country

10 Performance Evaluation Measure precision and recall for each lexical object set in generated extraction ontology Measure was generated with respect to could have been generated Measure was generated with respect to should not have been generated

11 Delimitation Will not … Consider all storage formats for existing knowledge –XML Consider all document formats –HTML –Plain Text Let users update the input knowledge source at run- time

12 Contribution Semi-automatically generate a data-extraction ontology Exploit the existing knowledge Link existing data-extraction tools Create a partial library of regular expression recognizers


Download ppt "Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding."

Similar presentations


Ads by Google