Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Semantic Web: there and back again

Similar presentations


Presentation on theme: "The Semantic Web: there and back again"— Presentation transcript:

1 The Semantic Web: there and back again
Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi

2 LOD 123: Making the semantic web easier to use
Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi

3 Semantic Web: then and now
Ten years ago the we developed complex ontologies used to encode and reason over small datasets of 1000s of facts Recently the focus has shifted to using simple ontologies and minimal reasoning over very large datasets of 100s of millions of facts Major companies are moving: Google Know-ledge Graph, Facebook Open Graph, Microsoft Satori, Apple Siri KB, IMB Watson KB  Linked open data or “Things, not strings”

4 Linked Open Data (LOD) Linked data is just RDF data, lots of it, with a small schema RDF data is a graph of triples (subject, predicate URI URI String: dbr:Barack_Obama dbo:spouse “Michelle Obama” URI URI URI: dbr:Barack_Obama dbo:spouse dbpedia:Michelle_Obama Best linked data practice prefers 2nd pattern, using nodes rather than strings for “entities” Things, not strings! Linked open data is just linked data freely acces-sible on the Web along with their ontologies

5 Semantic Web Semantic Web beginning
Use Semantic Web Technology to publish shared data & knowledge Semantic web technologies allow machines to share data and knowledge using common web language and protocols. ~ 1997 Semantic Web beginning

6 Semantic Web => Linked Open Data
Use Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge 2007 LOD beginning

7 Semantic Web => Linked Open Data
Use Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge 2008 LOD growing

8 Semantic Web => Linked Open Data
Use Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge 2009 … and growing

9 Linked Open Data …growing faster
Use Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge LOD is the new Cyc: a common source of background knowledge 2010 …growing faster

10 Linked Open Data LOD is the new Cyc: a common source of background
Use Semantic Web Technology to publish shared data & knowledge Data is inter- linked to support inte- gration and fusion of knowledge LOD is the new Cyc: a common source of background knowledge 2011: 31B facts in 295 datasets interlinked by 504M assertions on ckan.net

11 Exploiting LOD not (yet) Easy
Publishing or using LOD data has inherent difficulties for the potential user It’s difficult to explore LOD data and to query it for answers It’s challenging to publish data using appropriate LOD vocabularies & link it to existing data Problem: O(104) schema terms, O(1011) instances I’ll describe two ongoing research projects that are addressing these problems

12 GoRelations: Intuitive Query System for Linked Data Research with Lushan Han

13 Dbpedia is the Stereotypical LOD
DBpedia is an important example of Linked Open Data Extracts structured data from Infoboxes in Wikipedia Stores in RDF using custom ontologies Yago terms The major integration point for the entire LOD cloud Explorable as HTML, but harder to query in SPARQL DBpedia

14 Browsing DBpedia’s Mark Twain

15 Why it’s hard to query LOD
Querying DBpedia requires a lot of a user Understand the RDF model Master SPARQL, a formal query language Understand ontology terms: 320 classes & 1600 properties ! Know instance URIs (>2M entities !) Term heterogeneity (Place vs. PopulatedPlace) Querying large LOD sets overwhelming Natural language query systems still a research goal

16 Goal Let users with a basic understanding of RDF query DBpedia and other LOD collections Explore what data is in the system Get answers to question Create SPARQL queries for reuse or adaptation Desiderata Easy to learn and to use Good accuracy (e.g., precision and recall) Fast

17 Key Idea Structured keyword queries reduce problem complexity:
User enters a simple graph, and Annotates the nodes and arcs with words and phrases

18 Structured Keyword Queries
Nodes are entities and links binary relations Entities described by two unrestricted terms: name or value and type or concept Outputs marked with ? Compromise between a natural language Q&A system and formal query Users provide compositional structure of the question Free to use their own terms to annotate structure

19 Translation – Step One finding semantically similar ontology terms
For each graph concept/relation, generate k most semantically similar ontology classes/properties Lexical similarity metric based on distributional similarity, LSA, and WordNet

20 Semantic similarity: http://bit.ly/SEMSIM

21 Semantic Textual Similarity task
2013 lexical and computational semantics conference Do two sentences have same meaning (0…5) 1: “The woman is playing the violin” vs. “The young lady enjoys listening to the guitar” 4: "In May 2010, the troops attempted to invade Kabul” vs. "The US army invaded Kabul on May 7th last year, 2010" 2012: 35 teams, 88 runs, 2013: 36 teams, 89 runs 2250 sentence pairs from four domains Our three runs #1, #2 and #4 Dataset PairingWords Galactus Saiyan Headlines (750 pairs) (3) (7) (1) OnWN (561 pairs) (5) (12) (36) FNWN (189 pairs) (1) (3) (2) SMT (750 pairs) (8) (11) (16) Weighted mean (1) (2) (4)

22 Another Example Football players who were born in the same place as their team’s president

23 Translation – Step Two disambiguation algorithm
Assemble best interpretation using statistics of the data Use pointwise mutual informa-tion (PMI) between RDF terms in the LOD collection Measures degree to which two RDF terms co-occur in knowledge base In a good interpretation, ontology terms associate like their corresponding user terms connect in the query

24 Translation – Step Two disambiguation algorithm
Three aspects are combined to derive an overall goodness measure for each candidate interpretation Joint disam- biguation Resolving direction Link reason- ableness

25 Translation result Concepts: Place => Place, Author => Writer, Book => Book Properties: born in => birthPlace, wrote => author (inverse direction)

26 SPARQL Generation The translation of a semantic graph query to SPARQL is straightforward given the mappings Concepts Place => Place Author => Writer Book => Book Relations born in => birthPlace wrote => author

27 Evaluation 33 test questions from 2011 Workshop on Question Answering over Linked Data answerable using DBpedia Three human subjects unfamiliar with DBpedia translated the test questions into semantic graph queries Compared with two top natural language QA systems: PowerAqua and True Knowledge

28

29

30

31 Current work Baseline system works well for Dbpedia; we’re testing a second use case now Current work Better entity matching Relaxing the need for type information A better Web interface with user feedback & advice See for more information & try our alpha version at

32 Generating Linked Data by Inferring the Semantics of Tables Research with Varish Mulwad

33 Early work Mapping tables to RDF led to early tools
D2RQ (2006) relational tables to RDF RDF 123 (2007) spreadsheet to RDF And a recent W3C standard R2RML (2012) a W3C recommendation But none of these can automatically generate high-quality linked data They don’t link to LOD classes and properties nor recognize entity mentions

34 Player height in meters
Goal: Table => LOD* dbprop:team Name Team Position Height Michael Jordan Chicago Shooting guard 1.98 Allen Iverson Philadelphia Point guard 1.83 Yao Ming Houston Center 2.29 Tim Duncan San Antonio Power forward 2.11 Player height in meters * DBpedia

35 Goal: Table => LOD* All this in a completely automated way Name
Team Position Height Michael Jordan Chicago Shooting guard 1.98 Allen Iverson Philadelphia Point guard 1.83 Yao Ming Houston Center 2.29 Tim Duncan San Antonio Power forward 2.11 @prefix dbpedia: < . @prefix dbo: < . @prefix yago: < . is rdfs:label of dbo:BasketballPlayer . is rdfs:label of yago:NationalBasketballAssociationTeams . "Michael is rdfs:label of dbpedia:Michael Jordan . dbpedia:Michael Jordan a dbo:BasketballPlayer . "Chicago is rdfs:label of dbpedia:Chicago Bulls . dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams . RDF Linked Data All this in a completely automated way * DBpedia

36 Tables are everywhere !! … yet …
The web – 154 million high quality relational tables Why do we want to interpret tables Fewer than 1% of the 400K tables at data.gov have rich semantic schemas

37 2010 Preliminary System Class prediction for column: 77% accuracy
T2LD framework pipeline Predict Class for Columns Linking the table cells Identify and Discover relations Class prediction for column: 77% accuracy Entity Linking for table cells: 66% accuracy Examples of class label prediction results: Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace

38 Sources of Errors Sequential approach lets errors percolate from one phase to the next System biased toward predicting overly general classes over more appropriate specific ones Heuristics largely drive the system Although we consider multiple sources of evidence, we did not use joint assignment

39 A Domain Independent Framework
Pre-processing modules Sampling Acronym detection Query and generate initial mappings 2 1 Joint Inference/Assignment Generate Linked RDF Verify (optional) Store in a knowledge base & publish as LOD

40 Query and Rank Rank(String Similarity, Popularity) Chicago Boston
Allen Iverson 1. Chicago_Bulls 2. Chicago 3. Judy_Chicago possible entities for Chicago Can be replaced by Domain Specific / other LOD knowledge bases

41 Generating candidate ‘types’ for Columns
Class {dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams } 1. Chicago Bulls 2. Chicago 3. Judy Chicago Team Chicago Philadelphia Houston San Antonio {dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. } {……………………………………………………………. } dbpedia-owl:Place, dbpedia-owl:City, yago:WomenArtist, yago:LivingPeople, yago:NationalBasketballAssociationTeams, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film …. Instance

42 PMI as an association measure
We use pointwise mutual information (pmi) to measure the association between two RDF resources (nodes) In text, pmi measures for word association by com-paring how often two words occur together to their expected co-occurrence if independent pmi(x,y)=0 means x & y are independent, >0 means they’re associated and tend to occur together

43 PMI for RDF instances For text, the co-occurrence context is usually a window of some number of words (e.g, 50) For RDF instances, we count three graph patterns as instances of the co-occurrence of N1 and N2 N1 N1 N1 N2 N2 N2 Other graph patterns can be added, but we’ve not evaluated their utility or cost to compute

44 PMI for RDF types We also want to measure the association strength between RDF types, e.g., a dbo:Actor associated with a dbo:Film vs. a dbo:Place We can also measure the association of an RDF property and types, e.g. dbo:author used with a dbo:Film vs. a dbo:Book Such simple statistics can be efficiently computed for large RDF collections in parallel prefix dbo: <

45 Joint Inference / Assignment

46 A graphical model for tables Joint inference over evidence in a table
Class C1 C2 C3 Team Chicago Philadelphia Houston San Antonio R21 R31 R11 R12 R22 R32 R13 R23 R33 Instance

47 Parameterized Graphical Model
Captures interaction between row values R11 R12 R13 R21 R22 R23 R31 R32 R33 Factor Node Row value C1 C2 C3 Function capturing affinity between column headers and row values Variable Node: Column header Captures interaction between column headers 𝝍𝟓

48 Standard message passing
P(C1, C2, C3, R11, R12 ,R13, R21, R22, R23, R31, R32, R33) Joint Assignment : Graphical Models : Exploit Conditional Independences P(C1, R11, R12 ,R13) P(C2,R21, R22, R23) P(C3, R31, R32, R33) P(R31, R32, R33) C1 R11 R12 R13 Val Still … 4 Variables; Each having 25 options ,625 entries !

49 Semantic message passing
“No Change” “Change” Related to Chicago_Bulls “No Change” R11:[Michael_I_Jordan] R12: [Yao_Ming] R13:[Allen_Iverson] R21:[Chicago_Bulls] …… R31:[Shooting_Guard] Michael_I_Jordan Yao_Ming Allen_Iverson Shooting_guard Chicago_Bulls “Change” BasketBall Player “No Change” “No Change” “No Change” NBATeam BasketBallPositions C1:[BasketballPlayer] C2:[NBATeam] C3:[BasketBallPositions] BasketballPlayer “No Change” “No Change” “No Change”

50 R11:[Michael_I_Jordan]
Inference – Example R11:[Michael_I_Jordan] R12:[Allen_Iverson] R13:[Yao_Ming] Allen_Iverson Yao_Ming Michael_I_Jordan R11 “No Change” “No Change” “Change” “BasketBallPlayer” Michael Jordan 1. Michael_I_Jordan (Professor) 2. … Michael_Jordan (BasketballPlayer) …. (Michael_I_Jordan, Yao_Ming, Allen_Iverson) C1:[Name] “BasketBallPlayer”

51 Column header – row value agreement
[Michael_I_Jordan, Allen_Iverson, Yao_Ming] 1: Majority Voting LivingPeople GeoPopulatedPlace BasketBallPlayer Art Work WomenArtist City PopulatedPlace Athlete Film Name Michael_I_Jordan Allen_Iverson Yao_Ming LivingPeople AI_Researchers +1 +1 BasketballPlayer Athelete LivingPeople +1 +1 1.LivingPeople 2. BasketBallPlayer 3.GeoPopulatedPlace …. Athlete City +1 2: Choose the top Class Yago Tie-breaker/Re-order : Choose more ‘descriptive’ class. E.g. BasketBallPlayer better than LivingPeople ClassGranularityScore = 1-[] Top Yago : BasketBallPlayer TopDBpedia : Athelete

52 Column header – row value agreement
topClassScore = (numberOfVotes)/(numberofRows) Compute topScoreYago & topScoreDBpedia (topScoreYago || topScoreDBpedia) >= Threshold (both)Score < Threshold Check for Alignment Is Athelete sub/superClass of BasketBallPlayer ? Columnn Header Annotation = BasketBallPlayer, Athlete Update Column Header Annotation = “No-Annotation” Name Michael_I_Jordan Allen_Iverson Yao_Ming LivingPeople AI_Researchers Change BasketballPlayer Athelete LivingPeople No - Change

53 Update row value entity annotations
“CHANGE” 𝝍𝟑 Entity Class : BasketBallPlayer or Athlete Michael Jordan 1. Michael_I_Jordan 2. … Michael_Jordan …. LivingPeople AI_Researchers Candidate Entities for R11 BasketBallPlayer Athlete R11 Michael Jordan Michael_Jordan

54 Evaluation Dataset of 80 tables (Wikipedia tables; part of larger dataset released by IIT-Bombay) Evaluated Column Header Annotation Accuracy How good was the mapping Team to NationalBasketballAssociationTeams Evaluated Entity Linking Accuracy Mapping Michael Jordan to Michael_Jordan

55 Column Header Annotation Accuracy
System produced a ranked a list of Yago & DBpedia classes Human judges evaluated each class For precision, judges scored each class 1 if the class was accurate 0.5 if the class ok, but not best (e.g., Place vs. City) 0 if it was incorrect For Recall, score 1 if accurate/correct, 0 for incorrect Incorrect Okay Accurate

56 Top K classes, F-measure

57 Entity linking accuracy
Allen Iverson Correctly Linked Entities 3022 Incorrectly Linked Entities 959 Total Entities 3981 Accuracy : %

58 Challenge: Interpreting Literals
Many columns have literals, e.g., numbers Population 690,000 345,000 510,020 120,000 Age 75 65 50 25 Age in years? Percent? Population? Profit in $K ? Predict properties based on cell values Cyc had hand coded rules: humans don’t live past 120 We extract value distributions from LOD resources Differ for subclasses: age of people vs. political leaders vs. athletes Represent as measurements: value + units Metric: possibility/probability of values given distribution

59 Other Challenges Using table captions and other text is associated documents to provide context Size of some data.gov tables (> 400K rows!) makes using full graphical model impractical Sample table and run model on the subset Achieving acceptable accuracy may require human input 100% accuracy unattainable automatically How best to let humans offer advice and/or correct interpretations?

60 Final Conclusions Linked data great for sharing structured and semi-structured data Backed by machine-understandable semantics Uses successful Web languages and protocols Generating and exploring linked data resources is challenging Schemas are too large, too many URIs New tools mapping tables to linked data and translating structured natural language queries reduce the barriers

61


Download ppt "The Semantic Web: there and back again"

Similar presentations


Ads by Google