Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011.

Similar presentations


Presentation on theme: "Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011."— Presentation transcript:

1 Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad (@varish) University of Maryland, Baltimore County November 15, 2011

2 What ? 2

3 State FIPS County FIPS GroupLabelValue Alabama1Macon87Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona….Navajo…. Arkansas5Union 139Farms with women principal Operators Total value of agricultural products sold (farms) 56 California6Humboldt23…….19 http://dbpedia.org/class/ AdministrativeRegion http://dbpedia.org/resource/Arizona Map literals as values of properties dbpedia-owl:state Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 3

4 State FIPS County FIPS GroupLabelValue Alabama1Macon87Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona….Navajo…. Arkansas5Union 139Farms with women principal Operators Total value of agricultural products sold (farms) 56 California6Humboldt23…….19 @prefix dbpedia:. @prefix dbpedia-owl:. @prefix dbpprop:. @prefix dgtwc:. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix dbpprop:. @prefix dgtwc:. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. All this in a completely automated way !! Contribution Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 4

5 Why ? 5

6 Tables are everywhere !! … yet … The web – 154 million high quality relational tables [1] Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 6

7 Evidence–based medicine Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010 The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … # of Clinical trials published in 2008 # of meta analysis published in 2008 7

8 > 400,000 raw and geospatial datasets ~ < 1 % in RDF Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 8

9 Current Systems – Require users to have knowledge of the Semantic Web – Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud – RDF data in some cases is as useless as raw data – Majority of the work focused on relational data where schema is available – Web tables systems use ‘semantically poor knowledge bases’ Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 9

10 How ? 10

11 Preliminary work / Baseline system Analysis and Evaluation of baseline “Domain Independent” Framework grounded in graphical models and probabilistic reasoning 11 Building a table interpretation framework Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

12 The System’s Brain (Knowledgebase) Yago Wikitology 1 – A hybrid knowledgebase where structured data meets unstructured data 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation Syed, Z., and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer. 12

13 The Baseline System 13

14 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 14 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

15 Predicting Class Labels for column State Alabama Arizona Arkansas California Class Instance Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 1. Alabama 2.Alabama_(band) 3.Alabama_(people) 1. Alabama 2.Alabama_(band) 3.Alabama_(people) {dbpedia-owl:Place, dbpedia- owl:AdministrativeRegion,yago:S tatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes …} {dbpedia-owl:Place, yago:StatesOfTheUnitedStates, dbpedia-owl:Film, …. ….. ….. } {……………………………………………… ……………. } dbpedia-owl:Place, dbpedia- owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia- owl:Band, yago:NativeAmericanTribes,dbpedia-owl:Film... 15

16 Linking table cells to entities Macon + County + Alabama + 1 + 87 + Farms with Black or African American operators +... + dbpedia- owl:AdministrativeRegio n Macon + County + Alabama + 1 + 87 + Farms with Black or African American operators +... + dbpedia- owl:AdministrativeRegio n 1. Macon County, Alabama 2. Macon County, Illinois 1. Macon County, Alabama 2. Macon County, Illinois Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 2 – SVM (Computes Confidence) Link to the top ranked entity Don’t link 16 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

17 Identify Relations State Alabama Arizona Arkansas California County Macon Navajo Union Humboldt Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’ 17 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

18 Generating a linked RDF representation @prefix dbpedia:. @prefix dbpedia-owl:. @prefix dbpprop:. @prefix dgtwc:. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix dbpprop:. @prefix dgtwc:. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. 18 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

19 Evaluation of the baseline system 19

20 Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers 20 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

21 Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human–ranked list of labels Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries] Commonly used in the Information Retrieval domain to compare two ranked sets 21 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

22 Evaluation # 1 (MAP) MAP = 0.411 System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder 22 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

23 Accuracy for Entity Linking Overall Accuracy: 66.12 % 23 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

24 Lessons Learnt Sequential System – Error percolated from one phase to the next Current system favors general classes over specific ones (MAP score = 0.411) Largely, a system driven by “heuristics” Although we consider evidence, we don’t do assignment jointly Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 24 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

25 KB a,b,c,… m,n,o,… x,y,z,… Probabilistic Graphical Model / Joint Inference Model KB Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt. Domain Query Linked Data A “Domain Independent” Framework 25

26 Joint Inference over evidence in a table Probabilistic Graphical Models 26

27 Parameterized graphical model C1 C2 C3 R11R12R13R21R22R23R31R32 R33 Function that captures the affinity between the column headers and row values Row value Variable Node: Column header Captures interaction between column headers Captures interaction between row values Factor Node Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 27

28 Challenges 28

29 Challenges - Literals Population 690,000 345,000 510,020 120,000 Age 75 65 50 25 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion Population / Profit ? Age / Percentage ? Use evidence from the rest of the table to decide 29

30 Challenges - Metadata Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 30

31 More Challenges ! Sampling and Interpretation – Data set 1425 has > 400,000 rows ! Human in the Loop Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 31

32 Conclusion Presented a framework for inferring the semantics of tables and generating Linked data Evaluation of the baseline system show feasibility in tackling the problem Work in progress for building framework grounded in graphical models and probabilistic reasoning Working on tackling challenges posed by tables from domains such as the medical and open government data Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

33 References 1.Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y. 2008. Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549 2.M. Hurst. Towards a theory of tables. IJDAR,8(2-3):123-131, 2006. 3.D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164-175, 2006. 4.Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010. 5.Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), 2011. 6.Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010 33

34 Thank You ! Questions ? varish1@cs.umbc.edu @varish http://ebiq.org/h/Varish/Mulwad Project Page: http://ebiq.org/j/96 finin@cs.umbc.edujoshi@cs.umbc.edu 34

35 Backup slides 35

36 Evidence–based medicine Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010 The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … # of Clinical trials published in 2008 # of meta analysis published in 2008 36

37 Evaluation # 2 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels 37 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

38 Evaluation # 2 (Correctness) Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace Overall Accuracy: 76.92 % 38 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

39 Querying Wikitology 39

40 A graphical model for tables C1 C2C3 R11 R12 R13 R21 R22 R23 R31 R32 R33 State Alabama Arizona Arkansas California Class Instance Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 40

41 Dataset 1425 6444 Number of Farms Farms with women principal operators 000 01 Alabama <rdf:type rdf:resource=“http://data-gov.tw.rpi.edu/2009 /data-gov-twc.rdf#DataEntry”/> 6444 Number of Farms Farms with women principal operators 000 01 Alabama <rdf:type rdf:resource=“http://data-gov.tw.rpi.edu/2009 /data-gov-twc.rdf#DataEntry”/> Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 41


Download ppt "Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011."

Similar presentations


Ads by Google