Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.

Similar presentations


Presentation on theme: "Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."— Presentation transcript:

1 Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

2 Data Extraction Ontology Goal: extract data from web pages Components concepts relations between the concepts participation constraints Resilient Difficulty: manual ontology generation is costly

3 Generation Procedure Knowledge Sources Data-ExtractionOntology Knowledge Selection Processing Extraction Processing Database TrainTest

4 Knowledge Collection Assumptions about knowledge base general contains meaningful relationships pre-existing XML or easy to transfer to XML Current input Mikrokosmos ontology [Mik] auxiliary data frame library

5 Selection of Concepts PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); MANY ISSUES selection strategies, conflict resolution, …

6 Basic Selection Strategy Select from Mikrokosmos Ontology Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

7 Basic Selection Strategy Select from Mikrokosmos Ontology concept names and their synonyms Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population :17.7 million. Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

8 Basic Selection Strategy Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population :17.7 million. Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

9 Basic Selection Strategy Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Select from Data Frame Libraries Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

10 Basic Selection Strategy Select from Mikrokosmos Ontology concept names and their synonyms concept values and their synonyms Select from Data Frame Libraries extract result based on the data frames Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar- e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

11 Document-Level Conflict Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital --Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

12 Concept-Level Conflict Afghanistan smaller than Texas. Area : 648,000 sq. km. Capital--Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population : 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.

13 Relation Retrieval Theoretical solution all paths in the subgraph too expensive: NP-Complete Heuristic solution find the shortest path between any two nodes set a threshold distance

14 Participation Constraints Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital—Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]

15 Participation Constraints (cont.) Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities --Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*]

16 Performance Evaluation Speed of generation Precision and recall of the generation process Precision and recall of the generated ontology

17 Generation Time with Distance Threshold

18 P&R of Generation Process

19 Conclusion Data Extraction Ontology generated Knowledge sources exploited Many issues applied Many more to explore


Download ppt "Semiautomatic Generation of Resilient Data Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF."

Similar presentations


Ads by Google