Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

Similar presentations


Presentation on theme: "Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!"— Presentation transcript:

1 Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!

2 2 TABEL TABEL – Domain Independent and Extensible Framework to Infer the Semantics of Tables Varish Mulwad Ph.D. Dissertation Defense Adviser: Dr. Tim Finin January 8, 2015

3 3

4 Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi, "Exploiting a Web of Semantic Data for Interpreting Tables", In 2nd Web Science Conference (WebSci 2010), Raleigh, NC, USA, Apr. 2010 Semantics of a Table 4 NameTeamPositionHeight Michael Jordan ChicagoShooting Guard 1.98 Allen IversonPhiladelphiaPoint Guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower Forward 2.11 NationalBasketballAssociationTeams http://dbpedia.org/resource/Allen_Iverso n Map literals as property values playsFor

5 Semantics of a Table 5 NameTeamPositionHeight Michael Jordan ChicagoShooting Guard 1.98 Allen IversonPhiladelphiaPoint Guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower Forward 2.11 Linked Data tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer. tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan. All this in a completely automated way!

6 6 TABEL TABEL – Domain Independent & Extensible Framework to Infer the Semantics of Tables

7 Thesis Statement It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud. 7

8 Contributions o Probabilistic Graphical Model to jointly infer the semantics + a novel inference technique Semantic Message Passing o An proof of concept user–interactive application to generate meta-analysis reports automatically o Develop & Explore Human in the Loop paradigm o A novel technique to generate candidate properties from literal values 8

9 9 Why How Evaluatio n Applicatio n Wrap up

10 Tables are everywhere! 154 million high quality relational tables on the web ~400,000 CSVs on data.gov Healthcare, Financial and other domains 10

11 The Semantic Web & the Web Spreadsheets/CSVs to RDF/OWL 11

12 Evidence Based Medicine 12 Combine: All studies that compare organic milk v/s grass fed cow milk Produce Unified report: Organic Milk is better! Meta – Analysis report Correlation between Cardio vascular risk factors and Venous Thrombosis Duration of proton pump inhibitors as first line of treatment for Helicobacter pylori eradication

13 Tables are valuable 13

14 Meta – Analysis: Today 14 Correlation between Cardio vascular risk factors and Venous Thrombosis 1 1949 Initial Search >> 1949 studies 22 Final # of studies selected >> 22! 1 - W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen,”Cardiovascular risk factors and venous thromboembolism a meta-analysis,” Circulation, vol. 117, no. 1, pp. 93–102, 2008. Keyword based search Initial search yields large # of results Manually filter out irrelevant results

15 Not restricted to healthcare … 15

16 Related Work Databases & Spreadsheets to RDF: Existing solutions: Largely manual or semi-automatic Number of Ontologies, classes, relations Automatic solutions: “Row as RDF node”; local mappings No links to existing classes, properties, entities 16

17 Related Work Semantics of Table: Infer semantics for only parts of the table [header cells; relation between headers; data cell values or a combination of the two] Fail to generate RDF Linked Data representation Poor support for literals 17

18 Related Work Limaye et al. [Sep. 2010] [Soumen Chakrabarti’s group @ IIT-B] RDF Linked Data representation Literal values 18 Knoblock et al. [May 2012] [Craig Knoblock’s group @ USC – ISI] Largely focuses on header cell semantics & relation between headers Requires initial user input before automatic predictions from the system Venetis et al. [Sep. 2011] [Alon Halevy’s group @ Google] Column header and Relation semantics Literal values; RDF Linked Data

19 What TABEL brings to the “table” Infers the complete semantics of a table Generates a RDF Linked Data representation Supports tables with different structures over a variety of domains [medical tables] Incorporates user feedback to improve the quality of inferred semantics Infers the semantics of literal values* [numerical values] 19

20 20 Why How Evaluatio n Applicatio n Wrap up

21 TABEL TABEL – TABle Extracted as Linked Data 21 DECODE AAD Pre-processing modules Query and Rank 1 Generate RDF Linked Data Verify (optional) Store / Publish Joint Inference NameTeamPositionHeight Michael JordanChicagoShooting Guard1.98 Allen IversonPhiladelphiaPoint Guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower Forward2.11 Your module here! Varish Mulwad, Tim Finin and Anupam Joshi, “A Domain Independent Framework for Extracting Linked Semantic Data from Tables”, In Search Computing, ISBN 978-3-642-34212-7, vol. 7538, 2012.

22 Query – Candidate Entities 22 Chicago + Context {Team} + Context {Michael Jordan, Shooting Guard, 1.98} 1. Chicago 2. Judy_Chicago 3. Chicago_Bulls 1. Chicago_Bulls 2. Chicago 3. Judy_Chicago 1. Chicago 2. Judy_Chicago 3. Chicago_Bulls Re-rank – Classifier (String Similarity, Popularity) Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi, “Using linked data to interpret tables”, In 1st Int. Workshop on Consuming Linked Data, held at the 9th Int. Semantic Web Conf. (ISWC 2010), Shanghai, China, Nov. 2010.

23 Query – Candidate Classes 23 Class Instance 1. Chicago_Bulls 2. Chicago 3. Judy_Chicago {Place,City, WomenArtist, LivingPeople, NationalBasketballAssociationTeam s } {Place, PopulatedPlace, Film, NationalBasketballAssociationTea ms, …, … } {…………………………………… ………………………. } Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film …. Team Chicago Philadelphia Houston San Antonio

24 Query – Candidate Relations 24 Name Michael Jordan Allen Iverson Yao Ming Tim Duncan Team Chicago Philadelphia Houston San Antonio 1. Chicago_Bulls 2. Chicago 3. Judy_Chicago 1. Michael_Jordan 2. Michael_I_Jordan 3. Jordan_River playsFor livesIn …. …… playsFor, livesIn,born, …….

25 Query – Literals * [numeral data] 25 Team Chicago Philadelphia Houston San Antonio Place, City, WomenArtist, LivingPeople, NationalBasketballAssociati onTeams, PopulatedPlace, Film Chicago

26 Query – Literals 26 ?

27 NumKB 27 Population Income Height Person BasketBallPlayer (?) NumKB:  Encodes distributional features for Linked Data properties  Allows query using literal values (and optionally property name)  Provides information on property domains 250,0001.95

28 Identify property domains 28 seatingCapacity Get Instances Get Instance Types Order by frequency Queen's_Film_Theatre Restaurant_Gordon_Ram say M&T_Bank_Stadium Theatre Stadium Restaurant 1. seatingCapacity_Stadium [1] 2. seatingCapacity_Theatre [0.70] 3. seatingCapacity_Restaurant [0.57]

29 Identify property domain duplet values 29 Property, domain [seatingCapacity,Stadiu m] Get Property Values Sort; Trim front & back tails; Compute µ & σ 17777 20767 500 -212 : 25743 [86.66 %] -13190 :38721 [6.56 %] -26168 : 51699 [4.67 %] -39146 : 64677 [2.08 %] Compute Ranges µ - σ : µ + σ µ - 2σ : µ + 2σ

30 Query – Literals 30 1.98, height NumKB 1.height 2.diameter 3.minimumElevation minRange < 1.98 < maxRange Fuzzy string match (ColHeaderString, PropertyName)

31 Graphical Model for Tables 31 C1 C2C3 R1 1 R1 2 R1 3 R2 1 R2 2 R2 3 R3 1 R3 2 R3 3 Team Chicago Philadelphia Houston San Antonio Class Instance Name Vice-PresidentOffice Held Beetle Red Gasoline

32 Parameterized Graphical Model 32 C1 C2 C3 R1 1 R1 2 R1 3 R2 1 R2 2 R2 3 R3 1 R3 2 R3 3 Function that captures the affinity between the column headers and row values Row value Variable Node: Column header Captures interaction between column headers Captures interaction between row values

33 Semantic Message Passing 33 Michael_I_Jordan Chicago_Bulls “Change” playsFor “No Change” C1:[BasketballPlay er] C2:[NBATeam] C3:[BasketBallPosition s] Yao_MingAllen_Iverson BasketballPlayer “Change” BasketBall Player “No Change” … …

34 Semantic Message Passing [V] Pick new value [V] Send current values [F] Identify Outliers [F] Send semantics 34 V – Variable Nodes F – Factor Nodes Semantically Aware Factor Nodes Varish Mulwad, Tim Finin and Anupam Joshi, "Semantic Message Passing for Generating Linked Data from Tables", In 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, Oct. 2013.

35 [Michael_I_Jordan, Allen_Iverson, Yao_Ming] GeoPopulatedPlace BasketBallPlayer Art Work Name Michael_I_Jordan Allen_Iverson Yao_Ming Athelete BasketballPlayer ArtificialIntelligenceResearcher s 1. BasketBallPlayer 2.GeoPopulatedPlac e …. Top Class: BasketBallPlayer 35

36 Use the topClass in Message Passing process Send topClassScore as confidence score Name Michael_I_Jorda n Allen_Iverson Yao_Ming Change No - Change Update Column Header Annotation = “No-Annotation” topClassScore < threshold class ? BasketBallPlayer 36

37 4 – Relation between Columns 37 [Michael_I_Jordan, Chicago_Bulls] [Allen_Iverson, Philadelphia_76ers] [Yao_Ming, Houston_Rockets] Team Chicago_Bulls Philadelphia_76er s Houston_Rockets Name Michael_I_Jorda n Allen_Iverson Yao_Ming playsFor livesIn …. No – rel playsFor 1. playsFor 2. livesIn …. Top relation: playsFor

38 4 – Relation between Columns 38 Use the topRel in Message Passing process Send topRelScore as confidence Update Rel Annotation = “No- Annotation” topRelScore < threshold relation ? Name Michael_I_Jord an Allen_Iverson Yao_Ming Change playsFor No - Change

39 Variable Node Update R11 Michael Jordan Change [BasketBallPlayer, 0.8] Change [playsFor, 0.6] No- Change[0.5 5] (Team ) (Chicago ) (Shootin g Guard) 39 avgChangeConfidenceScore > avgNoChangeConfidenceScore ? [0.55]

40 Variable Node Update [Class: BasketBallPlayer, 0.8] [Relation: playsFor, 0.6] R11 Michael Jordan (1)BasketBallPlay er (2)playsFor Michael_I_Jordan …….. Michael_Jordan …….. 40 Satisfy constraints: [1, 2, 3] Satisfy constraints: [1, 2] Satisfy constraints: [1,3] Satisfy constraints: [2,3] Satisfy constraints: [1] Satisfy constraints: [2] Satisfy constraints: [3] Choose “No Annotation”

41 Halting Condition Ideal Case – No variable node receives a ‘CHANGE’ message Practical Case – Fraction of variable nodes that receive ‘CHANGE’ message < threshold Change 41

42 Tables Ontology 42 dbpedia- owl:BasketBallTeam dbpedia:Michael_Jord an dbpedia- owl:playsFor

43 RDF Linked Data Representation 43 tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer. tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan. tab:HeaderRelation_12 a tab:TableRelation; tab:relFromColumn tab:cell_01; tab:relToColumn tab:cell_02; tab:relLabel dbpedia-owl:team.

44 Human in the loop 44 AAD DECODE Generate RDF Linked Data Verify (optional) Store / Publish Query and Rank 2 1 Joint Inference NameTeamPositionHeight Michael JordanChicagoShooting Guard1.98 Allen IversonPhiladelphiaPoint Guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower Forward2.11 AAD DECODE Joint Inference Generate RDF Linked Data Verify (optional) Store / Publish During After Before

45 Human in the loop – Before 45 No.NameTeamPositionHeight 1Michael JordanChicagoShooting Guard1.98 2Allen IversonPhiladelphiaPoint Guard1.83 3Yao MingHoustonCenter2.29 4Tim DuncanSan AntonioPower Forward2.11

46 Human in the loop – Before 46 Team WomenArtist BasketBallTeam City PopulatedPlace SportsTeam …. Michael Jordan Michael_I_Jordan Michael_Jordan Michael_Jackson Michael_Wodruff …. Name, Team livesIn team …. Assignments treated as “true values”

47 Human in the loop – During 47 Team [0.2]Name, Team [0.1] WomenArtist BasketBallTeam City SportsTeam ….

48 Human in the Loop – Impact on Joint Inference 48 Name Michael_I_Jorda n Allen_Iverson Yao_Ming Change No - Change BasketBallPlay er Name [BasketballPlayer] [Class: BasketBallPlayer, 1.0] [Fixed] [Relation: playsFor, 0.6] R11 Michael Jordan Name,Team [playsFor] [Class: BasketBallPlayer, 0.8] [Relation: playsFor, 1.0] [Fixed] Name Michael_I_Jorda n Allen_Iverson Yao_Ming Change No - Change playsFor

49 Human in the Loop – Impact on Joint Inference 49 R11 Chicago [Chicago_Bulls] WomenArtist BasketBallTeam City PopulatedPlace SportsTeam …. livesIn team …. Candidate classesCandidate relations

50 50 Why How Evaluati on Applicatio n Wrap up

51 Datasets Dataset# of tables used in Col. And Rel Annotations # of tables used in Data Cell Annotations Average number of columns and rows Web_Manual1503712, 36 Web_Relation28–4, 67 Wiki_Manual25394, 35 Wiki_Links–803, 16 Subset of the IIT-B dataset 51 Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. 36th VLDB (2010)

52 Ground Truth Human annotators marked each class, relation as ‘vital’, ‘okay’, ‘incorrect’ To compute precision, assign scores to class & relation predicted by the system 1 – If the class was vital 0.5 – If the class was okay, but could have been better (e.g. Place v/s City) 0 – if it was incorrect To compute recall assign score of 1 if vital or okay, 0 for incorrect Ground truth for data cell value annotations from the IIT – B dataset 52

53 Column Header Annotations 53 Web_ManualWeb_Relatio n Wiki_Manual okay vital % of Relevant labels at Rank 1

54 Column Header Annotations 54 % of Relevant labels at different ranks Web_ManualWeb_Relatio n Wiki_Manual

55 Column Header Annotations 55 Precision, Recall and F-score at rank 1 Precisio n Recall F-score

56 Column Header Annotations 56 Web_Manual Web_Relatio n Wiki_Manual Precision v/s Recall at ranks 1-10

57 Column Header Annotations 57 SM P IIT- B GOOG SM P IIT- B GOOG Web_Manual Wiki_Manual Semantic Message Passing v/s the rest F-scores

58 Example Column Header Predictions 58 Column: Constituency Predicted: N.A. DBpedia classes [Ranks 2- 10]: OfficeHolder PrimeMinister Politician Election Event AdministrativeRegion PopulatedPlace University EducationalInstitution Column: Name of Elected M.P. Predicted: OfficeHolder DBpedia classes [Ranks 2- 10]: Election Event PrimeMinister Politician Country PopulatedPlace Settlement University EducationalInstitution

59 Relation Annotations 59 oka y vital % of relevant relations at rank 1

60 Relation Annotations 60 Web_Manual Web_Relation Wiki_Manual DBpedia Yago % of relevant relations at rank 1-10

61 Relation Annotations 61 SM P IIT- B Web_Manual Wiki_Manual Web_Relation SM P Semantic Message Passing v/s the rest F-scores

62 Example Relation Predictions 62 Column: President – Birth state Predicted: N.A. DBpedia rels [Ranks 2-10]: location deathPlace locatedInArea birthPlace isPartOf largestCity almaMater region state Column Pair: Name of Elected M.P. -- Party Affiliation Predicted: party DBpedia rels [Ranks 2-8]: affiliation otherParty primeMinister deathPlace birthPlace region NA

63 Data Cell Value Annotations 63

64 How long did it run ? 64 Line represents a table Number of variables that received a “change” message at the end of a iteration

65 Literals – Experimental Setup Subset of 16 tables [17 literal value columns] from the Wiki_Link Dataset Generate property candidate set by querying against NumKB Manually annotated each literal column with an appropriate DBpedia property 65

66 Header Cell Annotations for Literals 66 Percentage of correct properties at ranks 1- 10

67 Human in the loop – Experimental Setup Subset of 11 tables from the Wiki_Link dataset User feedback: Correct column header class [1 column in 9 tables and 2 for the remaining 2 tables] Rest of the experimental setup same. 67

68 Data Cell Annotations 68 No HILHIL Human in the Loop (HIL) v/s No Human in the Loop Correct Entities Total% HIL28640271.14 No – HIL24540260.95

69 69 Why How Evaluatio n Applicatio n Wrap up

70 Interpreting Medical Tables as Linked Data for Generating Meta–Analysis Reports 70

71 TABEL TABEL – TABle Extracted as Linked Data 71 AAD DECODE Pre-processing modules Query and Rank 2 1 Generate RDF Linked Data Verify (optional) Store / Publish Joint Inference NameTeamPositionHeight Michael JordanChicagoShooting Guard1.98 Allen IversonPhiladelphiaPoint Guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower Forward2.11 Your module here! Normalize Varish Mulwad, Tim Finin and Anupam Joshi, "Interpreting Medical Tables as Linked Data to Generate Meta– Analysis Reports", In 15th IEEE Int. Conf. on Information Reuse and Integration (IRI 2014), San Francisco, USA, Aug. 2014.

72 Preprocessing – Normalize 72

73 Preprocessing – Normalize 73 Patients with Secondary Thrombosis N = 146 no. --> 49; % -->33.6 no. (%) Smoker Split header cells into Query String and Metadata Normalize data cells; identify types or units

74 Query – Candidate Classes * [DBpedia] 74 Hypertensio n (1) Idiopathic intracranial hypertension (2) Pulmonary hypertension (3) Hypertension (1) Idiopathic intracranial hypertension (2) Pulmonary hypertension (3) Hypertension Re-rank – Classifier (String Similarity, Popularity) (1) Hypertension (2) Pulmonary hypertension (3) Idiopathic intracranial hypertension Also evaluated against SNOMED CT & UMLS

75 Query – Candidate Classes [Hybrid] 75 Hypertensio n (1) Hypertension (2) Pulmonary hypertension (3) Idiopathic intracranial hypertension No results? SNOMED CT (1) Hypertension (2) Pulmonary hypertension (3) Idiopathic intracranial hypertension API

76 Modeling Medical Tables as RDF 76 PatientGroup xsd:integer owl:Thing numberOf Individual s hasGroup Attribute 146 umls:Secondar y_Thrombosis Value xsd:String hasType xsd:double hasRawValu e % 33.6

77 Interactive tool to generate Meta – Analysis reports 77 User interface to define meta- analysis parameters and select studies Tool automatically generates relevant SPARQL queries

78 Evaluation 78

79 Header Cell Annotations 79 Distribution of header cell concepts at different ranks SNOMED CTUML S HYBRI D DBPEDI A NF: Correct concept not found in the candidate set 12-56-1011-2526- 101 NF 12-56-1011-2526- 101 NF 12-56-1011-2526- 101 NF12-56-1011-2526- 101 NF Dataset: 7 tables (122 header cells)

80 Retrieval (Find) Evaluation Experimental Setup Generated Linked Data from four tables Executed Retrieval SPARQL queries to find tables that included correlation between venous thrombosis for four different cardio vascular risk factors Average Precision: 0.79; Average Recall: 0.75 80

81 81 Why How Evaluatio n Applicatio n Wrap up

82 Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud. 82 I claimed: “ ’’

83 Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics 83 TABEL jointly inferred the semantics; thorough evaluation showed promising results … the semantics of column headers, values (string and literal) in table cells, and relations between columns A novel technique to generate candidate properties from literal values

84 Conclusions It is possible to generate high quality linked data from tables 84 Tables ontology to represent the inferred semantics Demonstrated domain independence and extensibility and support for tables with different structures Explored different models for Human in the loop

85 Future Work Schema + Data driven approach Build on the work on inferring literals; NumKB Further develop Human in the loop Tool to generate meta-analysis reports 85

86 Acknowledgements 86 Dr. Tim Finin Dr. Anupam Joshi Dr. Tim Oates Dr. Yun Peng Dr. L V Subramaniam Dr. Indrajit Bhattacharya Lab mates & Friends!

87 Thank You! Our papers on this research topic have garnered 93 citations!


Download ppt "Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!"

Similar presentations


Ads by Google