Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.

Similar presentations


Presentation on theme: "Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie."— Presentation transcript:

1 Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie Mellon University Motivation Experiments Entities on the Web Experiments II Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. Conclusions  Entities on the Web can be present in multiple datasets.  We propose a low-dimensional representation for such entities.  With a small number of primitive operations on this representation we can do : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA) Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. CountryCapital City IndiaDelhi USAWashington DC CanadaOttawa FranceParis CountryNational Sport USABaseball IndiaHockey SwedenFootball TC-2  Datasets : Publicly available semi-structured datasets (http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online) PropertyDescriptionDataset Toy_AppleDelicious_Sports |X|# Entities14,996438 |C|# table columns156925 |(x,c)|# (x, c) edges176,5989,192 |Ys|# suchas concepts2,3481,649 |(x, Ys)|# (x, Ys) edges7,6834,799 |Yn|# NELL classes113 |(x, Yn)|# (x, Yn) edges41939 |Yc|# manual column labels3130 |(c, Yc)|# (c, Yc) pairs156925 HyponymConcept:count USACountry:1000, Location:500 IndiaCountry:450 HockeySports:100 BaseballSports:60 USA India Football Hockey Baseball Country Location Sports TC-1 TC-2 TC-3 TC-4 TC-3 Example : Table columnsExample : Hyponym Concept Dataset Entity-suchas bipartite graph Entity-column bipartite graph n * m PIC embedding, m << t n * t Entity – tableColumn Bipartite graph n * s Entity – suchas Bipartite graph PIC n * m PIC embedding, m << s concatenate n * 2m PIC3 embedding CountryX1X2 USA 0.230.76 India 0.210.79 Football 0.360.80 Hockey 0.350.82 Baseball 0.340.79 Y1Y2 0.430.66 0.410.69 0.660.35 0.160.92 0.140.89 PIC3 Representation Example : PIC3 embedding, m = 2 TaskTrainingTesting Semi-Supervised Learning PIC3 + train SVM classifierPredict using learnt SVM model Set ExpansionPIC3Centroid(entity set) + K-NN (centroid) Automatic Set Instance Acquisition PIC3 + Index HCDseeds = top-k-entities(lookup concept in HCD) + Set Expansion (seeds) MethodTotal Query Time (sec) Set ExpansionASIA K-NN + PIC312.70.5 K-NN-Baseline80.11.4 MAD38.2150.0 Set Expansion Input : PIC3 embedding, Set of seed entities Output : Expanded set of entities Automatic Set Instance Acquisition Input : PIC3 embedding, Hyponym Concept Dataset, Query concept ‘q’ Output : Set of entities of type ‘q’  Presented a novel low-dimensional PIC3 representation for entities on the Web using Power Iteration Clustering (PIC).  Simple primitive operations on PIC3 to perform following tasks : Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition  Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition # Set Expansion Queries = 881 # ASIA Queries = 25 Creating PIC3 representation = 0.02 sec Semi-Supervised Learning Input : PIC3 embedding, Few labeled entities per class Output : Labels for unlabeled entities Hypothesis : PIC3 embeddings will cluster similar entities (entities belonging to same class) together.


Download ppt "Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie."

Similar presentations


Ads by Google