Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.

Similar presentations


Presentation on theme: "1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages."— Presentation transcript:

1 1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages

2 2 Motivation Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? US states with property crime rates above 1%

3 3 Search by Search Engine

4 4 Search the Hidden Web The Hidden Web: – Hidden behind forms – Hard to query “cdk-4"

5 5 Query for Data The Hidden Web: – Hidden behind forms – Hard to query Find the protein and the animo-acids information for gene “cdk-4"

6 6 A Web of Pages  A Web of Knowledge Web of Knowledge – Machine-“understandable” – Publicly accessible – Queriable by standard query languages Semantic annotation – Domain ontologies – Populated conceptual model Problems to resolve – How do we create ontologies? – How do we annotate pages for ontologies?

7 Contributions of Dissertation Work Web of Pages  Web of Knowledge – Knowledge & meta-knowledge extraction – Reformulation as machine-“understandable” knowledge Automatic & semi-automatic solutions via: – Sibling tables (TISP/TISP++) – User-created forms (FOCIH) 7

8 8 Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

9 9 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables

10 10 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1212

11 11 Interpretation Technique: Sibling Page Comparison

12 12 Interpretation Technique: Sibling Page Comparison Same

13 13 Interpretation Technique: Sibling Page Comparison Almost Same

14 14 Interpretation Technique: Sibling Page Comparison Different Same

15 15 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

16 16 Table Unnesting

17 17 Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible. Table Structure Patterns

18 18 ( {L}) n ( ( {V}) n ) + Table Structure Patterns

19 19 Pattern Usage

20 20 Dynamic Pattern Adjustment

21 21 TISP++ Automatic ontology generation Automatic information annotation

22 22 Ontology Generation – OSM Object set: table labels – Lexical: labels that associate with actual values – Non-lexical: labels that associate with other tables Relationship set: table nesting Constraints: updates based on observation

23 23 Ontology Generation – OWL Object set: OWL class Relationship set: OWL object property Lexical object set: – OWL data type property – Different annotation properties to keep track of the provenance

24 Generated Ontology

25

26 26 RDF Graph

27 27 Query the Data Find the protein and the animo-acids information for gene “cdk-4"

28 28 TISP Evaluation Applications – Commercial: car ads – Scientific: molecular biology – Geopolitical: US states and countries Data: > 2,000 tables in 35 sites Evaluation – Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition? – Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?

29 29 Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments  all correct

30 30 TISP++ Performance Performance depends on TISP TISP test set – Generates all ontologies correctly – Annotates all information in tables correctly

31 31 Form-based Ontology Creation and Information Harvesting (FOCIH) Personalized ontology creation by form – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Automated ontology creation Automated information harvesting

32 32 Form Creation

33 33 Created Sample Form

34 34 Generated Ontology View

35 35 Source-to-Form Mapping

36 36 Source-to-Form Mapping

37 37 Source-to-Form Mapping

38 38 Source-to-Form Mapping

39 39 Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Pattern recognition – Instance recognition

40 40 Reading Path

41 41 Pattern & Instance Recognition

42 42 Pattern & Instance Recognition

43 43 Pattern & Instance Recognition regular expression for decimal number left context right context

44 44 Pattern & Instance Recognition list pattern, delimiter is “,”

45 45 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

46 46 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

47 47 Can Now Harvest

48 48 Can Now Harvest

49 49 Can Now Harvest

50 50 Semantic Annotation

51 51 Semantic Annotation

52 52 Semantic Annotation

53 53 Semantic Annotation

54 54 Semantic Annotation

55 55 Semantic Query

56 56 FOCIH Performance Ontology creation Semantic annotation – Depends on TISP performance – Depends on pattern and instance recognition performance

57 57 FOCIH Performance Pattern and instance recognition: – Works with highly regular data – Tested 71 mappings – 25 full-string values (25/25 correct) – 38 substring values (29/38 correct) – 8 list patterns (6/8 correct)

58 58 FOCIH Difficulties

59 59 FOCIH Difficulties

60 60 FOCIH Difficulties No selection

61 61 WoK via TISP

62 62 WoK via TISP

63 63 WoK via FOCIH

64 64 WoK via FOCIH

65 65 Contributions TISP: automatic sibling table interpretation TISP++: – Automatic ontology generation based on interpreted tables – Automatic semantic annotation for interpreted tables FOCIH: – Semi-automatic personalized ontology creation – Automatic personalized information harvesting and semantic annotation All together: contributes to turning the current web of pages into a web of Knowledge

66 66 Future Work Sibling pages in addition to sibling tables Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.


Download ppt "1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages."

Similar presentations


Ads by Google