Presentation is loading. Please wait.

Presentation is loading. Please wait.

David W. Embley Brigham Young University Provo, Utah, USA.

Similar presentations


Presentation on theme: "David W. Embley Brigham Young University Provo, Utah, USA."— Presentation transcript:

1 David W. Embley Brigham Young University Provo, Utah, USA

2 Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%

3 Fundamental questions What is knowledge? What are facts? How does one know? Philosophy Ontology Epistemology Logic and reasoning

4 Existence  asks “What exists?” Concepts, relationships, and constraints

5 The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model

6 Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer

7 Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Fact Annotation … …

8 Symbols: $ 11,500 117K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: Car(C 123 ) has Price($11,500) Car(C 123 ) has Mileage(117,000) Car(C 123 ) has Make(Nissan) Car(C 123 ) has Feature(AC) Knowledge “Correct” facts Provenance

9 Find me the price and mileage of all red Nissans – I want a 1990 or newer.

10

11

12

13 Extraction Ontologies Semantic Annotation Free-Form Query Interpretation

14 Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

15 External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

16 Generality: assumptions about web pages Data rich Narrow domain Document types Simple multiple-record documents (easiest) Single-record documents (harder) Records with scattered components (even harder) Resiliency: declarative Still works when web pages change Works for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology

17

18 Parse Free-Form Query (wrt data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data

19 “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator

20 “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

21 Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. Color = “red” Make = “Nissan” Year >= 1996 >= Operator Formulate Query Expression

22 For Let Where Return Formulate Query Expression

23

24 Several dozen person-hours Oodles of extraction ontologies needed How can we resolve this problem?

25 Forms – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Instance recognizers – Some pre-existing instance recognizers – Lexicons Automated extraction ontology creation?

26 Basic form-construction facilities: single-entry field multiple-entry field nested form …

27

28

29

30

31

32

33 Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection

34 Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

35 Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name

36 Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

37 Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

38 Name

39 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E

40 Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

41 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

42

43 Also helps adjust ontology constraints

44 Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

45 Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …

46 Instance Recognizers Number Patterns Context Keywords and Phrases

47

48 Recognize and annotate with respect to an ontology

49 Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training “Effortlessly” generate WoK content

50 Extraction-ontology generation Auto-enhancement of extraction ontologies Form-based specification Auto-generation based on table interpretation Sophisticated conceptualization with TANGO Automated annotation Extraction ontologies Form-based information harvesting Generated pattern-based annotation Simple query specification Free-form queries Generated form-based queries www.deg.byu.edu


Download ppt "David W. Embley Brigham Young University Provo, Utah, USA."

Similar presentations


Ads by Google