Presentation on theme: "Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon."— Presentation transcript:
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute Carnegie Mellon University Paper ID : 02 1 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.
Motivation Entities on the Web can be present in multiple datasets. E.g. HTML tables, text documents etc. Traditional systems : Entities as sparse vector of document Ids in which it occurs. We propose a low-dimensional representation for such entities. Helps to efficiently perform different tasks with a small number of primitive operations : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA) 2
Entities in HTML tables 3 TC-2 TC-3 CountrySports IndiaHockey UKCricket USATennis CountryCapital City IndiaDelhi USAWashington DC CanadaOttawa FranceParis USA India Hockey Cricket Tennis TC-1 TC-2 TC-3 TC-4 Entity Table-column Entity-Column Bi-partite Graph
Entities in unstructured text 4 USA India Hockey Cricket Tennis Country Location Sports Suchas Entity “Such as” Bi-partite Graph Countries such as India are developing rapidly in terms of infrastructure. Outdoor sports include Tennis and Cricket.
Resultant Tri-partite Graph 5 USA India Hockey Cricket Tennis Country Location Sports TC-1 TC-2 TC-3 TC-4 Suchas Entity Table-column “Such as” Bi-partite Graph Entity-Column Bi-partite Graph
Encoding the graph 6 “Entity-Column” Bi-partite Graph EntityX1X2 USA0.430.66 India0.410.69 Hockey0.360.80 Cricket0.350.82 Tennis0.340.79 Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) USA India Hockey Cricket Tennis TC-1 TC-2 TC-3 TC-4 Entity Table-column Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence
Encoding the graph 7 USA India Hockey Cricket Tennis Country Location Sports Suchas Entity “Such as” Bi-partite Graph EntityY1Y2 USA0.230.76 India0.210.79 Hockey0.660.35 Cricket0.160.92 Tennis0.140.89 Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010) Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence
Low-dimensional PIC3 embedding n * t entity-tableColumn Bipartite graph n * s entity-suchas Bipartite graph n * m PIC embedding m << t n * m PIC embedding m << s n * 2m PIC3 embedding PIC Concatenate EntityX1X2 USA0.430.66 India0.410.69 Hockey0.360.80 Cricket0.350.82 Tennis0.340.79 Y1Y2 0.230.76 0.210.79 0.660.35 0.160.92 0.140.89
Using PIC3 Representation Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points. Set Expansion : Given a set of seed entities, find more entities similar to seed entities. Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept. 9
Quantitative Evaluation: Datasets DatasetToy_AppleDelicious_Sports #entities14,996438 # table-columns156925 #entity-table column edges176,5989,192 #suchas concepts2,3481,649 #entity-suchas edges7,6834,799 #general entity classes (NELL KB)11 3 #entities in general classes419 39 #hand-coded column types31 30 #columns in labeled types156 925 Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_onlinehttp://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online
11 TaskTrainingTesting Semi- Supervised Learning PIC3 + Train SVM classifier Predict using learnt SVM model SSL using PIC3 Input : Few seed examples for each class label Output : Class-labels for unlabeled data-points PIC clusters similar entities together better SVM classifier on unlabeled data (use of background data)
14 TaskTrainingTesting Set Expansion PIC3Centroid(entity set) + K-NN (centroid) Set Expansion using PIC3 Input : Few seed entities e.g. Football, Hockey, Tennis Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf …. K-NN operation is extremely efficient using KD-trees.
Query Times PIC3 preprocessing : 0.02 sec # SE queries = 881 Precision Recall Curve : K-NN+PIC3 consistently beats K-NN- Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time. 15 MethodTotal Query Time (s) K-NN + PIC312.7 K-NN-Baseline80.1 MAD38.2 Modified Adsorption : Graph based label propagation algorithm
16 TaskTrainingTesting Automatic Set Instance Acquisition PIC3 + Inverted index (suchasConcept entities) seeds = top-k-entities (lookup concept in index) + Set Expansion (seeds) Automatic Set Instance Acquisition (ASIA) : using PIC3 Input : Class label e.g. Country Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan ….. Previously described Set Expansion algorithm is used as a subroutine here.
Query Times PIC3 preprocessing : 0.02 sec # ASIA queries = 25 Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time. 17 MethodTotal Query Time (s) K-NN + PIC30.5 K-NN-Baseline1.4 MAD150.0
Conclusions & Future Work Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC). Simple primitive operations on PIC3 to perform following tasks : Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition 18
Thank You !! 19 This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. Please visit our poster ID : 02