Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas.

Similar presentations


Presentation on theme: "Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas."— Presentation transcript:

1 Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas 1

2 Knowledge Enabled Information and Services Science Overview Knowledge Representation –GlycO – Complex Carbohydrates domain ontology Information Extraction –Taxonomy creation (Doozer/Taxonom.com) –Fact Extraction (Doozer++) Validation 2

3 Knowledge Enabled Information and Services Science Circle of knowledge on the Web 3

4 Knowledge Enabled Information and Services Science 4

5 Circle of knowledge on the Web 5 What is knowledge? How do we turn propositions/belie fs into knowledge? How do we acquire knowledge?

6 Knowledge Enabled Information and Services Science Background Knowledge [15] Christopher Thomas and Amit Sheth, “On the Expressiveness of the Languages for the Semantic Web–Making a Case for ‘A Little More,’”in Fuzzy Logic and the Semantic Web, Eli Sanchez (Ed.), Elsevier, 2006. [11] Amit Sheth, Cartic Ramakrishnan, and Christopher Thomas, “Semantics for The Semantic Web: the Implicit, the Formal and the Powerful,”International Journal on Semantic Web & Information Systems, 1 (no. 1), 2005, pp. 1–18. 6

7 Knowledge Enabled Information and Services Science Different Angles Social construction –Large scale creation of knowledge vs. –Small communities define their domains Normative vs. Descriptive =Top-Down vs. Bottom-Up Formal vs. Informal =Machine-readable vs. human-readable 7

8 Knowledge Enabled Information and Services Science Community-created knowledge Descriptive Bottom-up Formally less rigid May contain false information If a statement in the world is in conflict with the Ontology, both may be wrong or both may be right Good for broad, shallow domains Good for human processing and IR tasks 8

9 Knowledge Enabled Information and Services Science Wikipedia and Linked Open Data Created by large communities Constantly growing Domains within the linked data are not always easily discernible Contain few axioms and restrictions –Little value to evaluation using logics 9

10 Knowledge Enabled Information and Services Science Formal - Modeling deep domains Prescriptive / Normative Top-down Contains “true knowledge” If a statement in the world is in conflict with the Ontology, the statement is false Good for scientific domains Good for computational reasoning/inference Usually created by small communities of experts Usually static, little change is expected 10

11 Knowledge Enabled Information and Services Science Example: GlycO Created in collaboration with the Complex Carbohydrate Research Center at the University of Georgia on an NCRR grant. Deep modeling of glycan structures and metabolic pathways [6] Christopher Thomas, Amit P. Sheth, and William S. York, “Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain,”in Formal Ontology in Information Systems (FOIS 2006) [5] Satya S. Sahoo, Christopher Thomas, Amit P. Sheth, William York, and Samir Tartir, “Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies,”15th International World Wide Web Conference (WWW2006), 11

12 Knowledge Enabled Information and Services Science GlycO 12

13 Knowledge Enabled Information and Services Science N-Glycosylation metabolic pathway GNT-I attaches GlcNAc at position 2 UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 GNT-V attaches GlcNAc at position 6 UDP-N-acetyl-D-glucosamine + G00020 UDP + G00021 N-acetyl-glucosaminyl_transferase_V N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 13

14 Knowledge Enabled Information and Services Science N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251 b-D-Manp-(1-6)+ | b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ Glycan Structures for the ontology Import structures from heterogeneous databases Possible connections modeled in the form of GlycoTree Match structures to archetypes 14

15 Knowledge Enabled Information and Services Science Interplay of extraction and evaluation Errors in the source databases are propagated through various new databases  comparing multiple sources fails for error correction Less than 2% of incorrect information makes a database useless for automatic validation of hypotheses The ontology contains rules on how carbohydrate structures are known to be composed By mapping information in databases to the ontology and analyzing how successful the mapping was, we can identify possible errors. 15

16 Knowledge Enabled Information and Services Science Database Verification using GlycO N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251 b-D-Manp-(1-6)+ | a-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ a-D-Manp-(1-4) is not part of the identified canonical structure for N-Glycans, hence it is likely that the database entry is incorrect 16

17 Knowledge Enabled Information and Services Science Pathway Steps - Reaction Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia 17

18 Knowledge Enabled Information and Services Science 18

19 Knowledge Enabled Information and Services Science Summary - GlycO The amount of accuracy and detail that can be found in ontologies such as GlycO could most likely not be acquired automatically Only a small community of experts has the depth of knowledge to model such scientific ontologies 19

20 Knowledge Enabled Information and Services Science Summary - GlycO However, the automatic population shows that a highly restrictive, expert-created rule set allows for automation or involvement of larger communities.  Frame-based population of knowledge  The formal knowledge encoded in the ontology serves to acquire new knowledge  The circle is completed 20

21 Knowledge Enabled Information and Services Science Summary Background Knowledge Large amounts of information and knowledge are available Some machine readable by default Others need specific algorithms to extract information The more available information we can use, the better the extraction of new information will be. 21

22 Knowledge Enabled Information and Services Science Circle of knowledge on the Web 22 What is knowledge? How do we turn propositions into knowledge? Part 2 How do we acquire knowledge?

23 Knowledge Enabled Information and Services Science Model Creation [3] Christopher Thomas, Pankaj Mehra, Roger Brooks and Amit Sheth. Growing Fields of Interest -Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence 2008, pp. 496-502 [2] Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, WebScience 2010 [1] Christopher Thomas, Pankaj Mehra, Wenbo Wang, Amit Sheth, Gerhard Weikum and Victor Chana Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report. Knowledge Acquisition through [3] [2] [1] 23

24 Knowledge Enabled Information and Services Science First create a domain hierarchy Example: a hierarchy for the domain of Human Performance and Cognition 24

25 Knowledge Enabled Information and Services Science Connect with learned facts 25

26 Knowledge Enabled Information and Services Science Example: strongly connected component 26

27 Knowledge Enabled Information and Services Science Excerpt: strongly connected component 27

28 Knowledge Enabled Information and Services Science Expert evaluation of facts in the ontology 7-9: Correct Information not commonly known 1-2: Information that is overall incorrect 3-4: Information that is somewhat correct 5-6: Correct general Information 28

29 Knowledge Enabled Information and Services Science Technical Details 29

30 Knowledge Enabled Information and Services Science Domain hierarchy creation Input terms e.g. related to Human Performance and Cognition Hierarchy is automatically carved from articles and categories on Wikipedia Step 1 30

31 Knowledge Enabled Information and Services Science Overview - conceptual Expand and Reduce approach –Start with ‘high recall’ methods Exploration - Full text search Exploitation – Node Similarity Method Category growth –End with “high precision” methods Apply restrictions on the concepts found Remove unwanted terms and categories 31

32 Knowledge Enabled Information and Services Science Graph-based expansion 32 Expand - conceptually Full text search on Article texts Delete results with low confidence score

33 Knowledge Enabled Information and Services Science 33 Collecting Instances

34 Knowledge Enabled Information and Services Science 34 Creating a Hierarchy

35 Knowledge Enabled Information and Services Science 35

36 Knowledge Enabled Information and Services Science Extracting from Plain text or hypertext Informal, human-readable presentation of information Vast amounts of information available –Web –Scientific publications –Encyclopediae Need sophisticated algorithms to extract information 36

37 Knowledge Enabled Information and Services Science Pattern-based Fact Extraction Learn textual patterns that express known relationship types Search the text corpus for occurrences of known entities (e.g. from domain hierarchy) Semi-open –Types are known and limited –Types are automatically expanded when LOD grows Vector-Space Model Probabilistic representation 37

38 Knowledge Enabled Information and Services Science Training Relationship data in the UMLS Metathesaurus or the Wikipedia Infobox-data provide a large set of facts in RDF Triple format –Limited set of relationships that can be arranged in a schema –Semi-open Types are known and limited Types are automatically expanded when LOD grows 38

39 Knowledge Enabled Information and Services Science Training procedure Iterate through all facts (S->P->O triples) Find evidence for the fact in a corpus –Wikipedia, WWW, PubMed or any other collection –If triple subject and triple object occur in close proximity in text, add the pattern in-between to the learned patterns Combined evidence from many different patterns increases the certainty of a relationship between the entities 39

40 Knowledge Enabled Information and Services Science Overview – initial computations Fact Collection Text Corpus EntropySVD/LSI CP2PCP2P mod CP2P  R2P Modifications * Pertinence R2P Matrix Computations * R2P mod 40

41 Knowledge Enabled Information and Services Science 41 Training procedure cont’d Canberra::Australia Canberra, the Australian capital city Canberra, capital of the Commonwealth of Australia Canberra, the Australian capital Canberra, the Australian capital city, the capital city, capital of the Commonwealth of, the capital 111

42 Knowledge Enabled Information and Services Science Relationship Patterns X, the Y capital city X, capital of the Commonwealth of Y X, the Y capital Capital_of111 X, the Y capital city X, capital of YX, the Y capital Capital_of111 Extracted Synonyms X, the Y capital *X, capital of Y Capital_of21 Generalize 42

43 Knowledge Enabled Information and Services Science Relationship Patterns X, the Y capital * X, capital of Y X, * * YX, predecessor of Y Capital_of2220 predecessor0022 X, the Y capital * X, capital of Y X, * * YX, predecessor of Y Capital_of1.0 0.50 predecessor000.51.0 43

44 Knowledge Enabled Information and Services Science Resolve Relationships X, the Y capital * X, capital of Y X, * * YX, predeces- sor of Y Capital_ of 1.0 0.50 predece ssor 000.51.0 0.5X, the Y capital * 0.25X, capital of Y 0.25X, * * Y 0X, predecesso r of Y x 44

45 Knowledge Enabled Information and Services Science Resolve Relationships X, the Y capital * X, capital of Y X, * * YX, predecessor of Y Capital_ of 1.0 0.50 predece ssor 000.51.0 0.5X, the Y capital * 0.25X, capital of Y 0.25X, * * Y X, predecesso r of Y x Capital_ofpredecessor 0.8750.125 45

46 Knowledge Enabled Information and Services Science Advanced Computations Fact Collection Text Corpus EntropySVD/LSI CP2PCP2P mod CP2P  R2P Modifications * Pertinence R2P Matrix Computations * R2P mod 46

47 Knowledge Enabled Information and Services Science Advanced Computations EntropySVD/LSIPertinence R2P Matrix Computations * R2P mod LSI to determine relationship similarities Reduces sparsity in the matrix and makes relationship rows more comparable Allows better use of pertinence computation Entropy Increase weights for more unique patterns Pertinence Smoothing of pattern occurrence frequencies 47

48 Knowledge Enabled Information and Services Science Example Output (DBPedia) 48

49 Knowledge Enabled Information and Services Science Pertinence for Relations Looking at fact extraction as a classification of concept pairs into classes of relations Class boundaries are not clear cut E.g. has_physical_part  has_part  don’t punish the occurrence of the same pattern with relationship types that are similar 49

50 Knowledge Enabled Information and Services Science Relationship Patterns X, the Y capital * X, capital of Y X, * * YX, located in Y Capital_of2222 Located_in0024 X, the Y capital * X, capital of Y X, * * YX, located in Y Capital_of1.0 0.20.5 Located_in000.20.9 50

51 Knowledge Enabled Information and Services Science Resolve Relationships X, the Y capital * X, capital of Y X, * * YX, located in Y Capital_ of 1.0 0.20.5 Located _in 000.20.9 0.4X, the Y capital * 0.1X, capital of Y 0.3X, * * Y 0.2X, located in Y x Capital_ofLocated_in 0.660.24 51

52 Knowledge Enabled Information and Services Science Evaluation of the fact extraction - DBPedia 52 Precision / Recall Confidence Threshold Strict evaluation: Only 1 st ranked extracted relation is compared to gold-standard. Averaged over relation types. 60% training set, 40% testing, DBPedia Infobox fact corpus, Wikipedia text corpus

53 Knowledge Enabled Information and Services Science Evaluation of the fact extraction - UMLS 53 Precision / Recall Confidence Threshold Strict evaluation: Only 1 st ranked extracted relation is compared to gold-standard. Averaged over relation types. 60% training set, 40% testing, UMLS fact corpus, MedLine text corpus

54 Knowledge Enabled Information and Services Science Manual Evaluation strategy (DBPedia) 54

55 Knowledge Enabled Information and Services Science Manual Evaluation strategy (UMLS) poisoning, fluoride::teeth[finding_site_of]finding_site_of1 polyneuritis, endemic::vitamin b 1[associated_with]has_form0 polyp of cervix nos (disorder)::768 polyps[associated_with]associated_with1 polyp of cervix nos (disorder)::neck of uterus[location_of]finding_site_of1 polyp of colon::benign neoplasms[related_to]associated_with0.5 brain::brain contusion [has_location] associated_ morphology_of0.25 brain::brain ischemia [has_finding_site]location_of0.5 polyp of colon::gastrointestinal tract, nos[is_primary_anatomic_site_of_disease]location_of0.5 polyvesicular vitelline tumor::gamete structure (cell structure)[is_normal_cell_origin_of_disease] is_normal_cell_ origin_of_disease1 proptosis::apert syndrome[has_manifestation]has_manifestation1 55

56 Knowledge Enabled Information and Services Science Manually evaluated precision for different confidence values 56

57 Knowledge Enabled Information and Services Science Manually evaluated precision, confidence > 0.5 (on UMLS – MedLine corpus) 57

58 Knowledge Enabled Information and Services Science Summary Model Creation Using background knowledge in the form of a fact corpus and a text corpus, we can suggest new facts/propositions Possible to try all combinations of known concepts (e.g. Read-the-Web project), but huge validation backlog Letting users drive the model creation focuses the creation on the parts that are of common interest  Willingness to help validate facts 58

59 Knowledge Enabled Information and Services Science Circle of knowledge on the Web 59 What is knowledge? How do we turn propositions/belie fs into knowledge? How do we acquire knowledge? Part 3

60 Knowledge Enabled Information and Services Science Evaluation and Use Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, Pankaj Mehra and Amit Sheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, to appear in WebScience 2010 60 Current Work

61 Knowledge Enabled Information and Services Science Explicit evaluation “Evaluate for evaluation’s sake” –Domain-experts rank the value of a proposition –Committees of experts and/or laymen vote on the correctness of propositions 61

62 Knowledge Enabled Information and Services Science Explicit evaluation in the Semantic Browser The user can vote on facts Some facts are presented randomly Most facts are presented after the user (by browsing) showed interest in –The full triple –Subject/Object of the triple 62

63 Knowledge Enabled Information and Services Science Implicit evaluation Evaluation that does not explicitly involve a vote on the extracted information Use the Wisdom of the Crowds Users show support for a proposition by performing an action Every action taken on a piece of information is recorded and analyzed The cumulative behavior of the users gives an indication of which propositions are correct or interesting 63

64 Knowledge Enabled Information and Services Science Implicit evaluation in the Semantic Browser The user simply searches and browses The search history and the click-stream provide information about whether a page transition using an extracted triple was successful Assumption: on average, a successful trail- browsing session includes valid triples Problem: requires extensive use 64

65 Knowledge Enabled Information and Services Science Implicit evaluation in the Semantic Browser 65 1 st triple2 nd triple Triples

66 Knowledge Enabled Information and Services Science Conclusion Creating domain models gives a way of selectively adding knowledge to a system We showed that it is possible to automatically create such models with high accuracy The models immediately impact users  Willingness to help evaluate Evaluation becomes integral part in knowledge lifecycle 66

67 Knowledge Enabled Information and Services Science ? 67

68 Knowledge Enabled Information and Services Science Thank you 68


Download ppt "Knowledge Enabled Information and Services Science Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas."

Similar presentations


Ads by Google