Presentation is loading. Please wait.

Presentation is loading. Please wait.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.

Similar presentations


Presentation on theme: "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation."— Presentation transcript:

1 Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation by Sangkeun Lee Center for E-Business Technology Seoul National University Seoul, Korea

2 Copyright  2009 by CEBT Introduction  Social tags? Keyword annotations  Tags are poorly understood  Given a set of objects, and a set of tags applied to those objects by users Can we predict whether a given tag could/should be applied to a particular object?  We call this the ‘Social Tag Prediction’ Semantic Tech & Context - 2

3 Copyright  2009 by CEBT What Can we use Social Tag Prediction for?  Increase Recall of Single Tag Queries/Feeds  Inter-User Agreement Sharing objects despite vocabulary differences  Tag Disambiguation What’s ‘Apple’? The Company or the fruit?  Bootstrapping The way users use tags is determined by previous usage Pre-seed  System Suggestion Semantic Tech & Context - 3

4 Copyright  2009 by CEBT Preliminaries  A social tagging system consists of users u ∈ U, tags t ∈ T, and objects o ∈ O.  A Post A set of tags on an object by a user Is made up of one or more (t i, u j, o k ) triples  R p : a set of tags that describe object the object A set of (t, o) pairs where each pair means that tag t positively describes object o.  R n : a set of tags that do not describe object the object A set of (t, o) pairs where each pair means that tag t negatively describes object o.  R a : a set of tags that users annotates the object. A set of (t, u, o) triples where each triple means that user u annotated object o with tag t. Semantic Tech & Context - 4

5 Copyright  2009 by CEBT Preliminaries (cont’d)  Examples Rp = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) Rn = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza )... Ra = (t pizzeria, u sally, o pizza )  Projection & Selection If we want to know all users who have tagged o pizza, we would write π u (σ o pizza (R a )) and the result would be (u sally ). Semantic Tech & Context - 5

6 Copyright  2009 by CEBT Dataset  The Stanford Tag Crawl Dataset URLs Gathered from del.icio.us recent feed – Pages linked from that URL – Inlinks to the URL – 3,630,250 posts – 2,549,282 unique URLs – 301,499 active unique usernames and about 2TB of crawled data  T100 Top 100 Tags in the dataset by frequency Semantic Tech & Context - 6

7 Copyright  2009 by CEBT Tradeoff  We only know R a Constructing a dataset approximating R p and R n for experiments  Heymann et al. [8] suggest that if (t i, o k ) ∈ π(t, o) (R a ) then (t i, o k ) ∈ R p. However, (t i, o k ) ̸∈ π(t, o) (R a ) and (t i, o k ) ∈ R p occurs su ffi ciently often measures of precision, recall, and accuracy can be heavily skewed.  Author’s method Semantic Tech & Context - 7

8 Copyright  2009 by CEBT Tag Prediction  Using Page Information Page text Anchor Text Surrounding hosts  Using Tags Tag prediction based on other tags Semantic Tech & Context - 8

9 Copyright  2009 by CEBT Using Page Information  Prediction as a binary classification task for each tag t i ∈ T 100  Evaluate prediction accuracy using page information on the Top 100 tags  2,145,593 of 9,414,275 triples ( 22.7% )  Split Train/Test Full/Full – Randomly select 11/16 of the positive examples and 11/16 of the negative examples to be the training set – 5/16 for each become the test set 200/200 – Randomly select 200/200 for the training set – Same for the test set Semantic Tech & Context - 9

10 Copyright  2009 by CEBT #1: Using Page Information  Page text All text present at the URL  Anchor text All text within fifteen words of inlinks to the URL  Surrounding hosts The sites linked to and from the URL As well as the site of the URL itself  Penn Tree Tokenizer Make texts bags of words – Token & The number of token occurred  Support vector machine for classification Semantic Tech & Context - 10

11 Copyright  2009 by CEBT Evaluation  PRBEP (Precision Recall Break-even Point) A good single number measurement of how we can tradeoff precision for recall  For Full/Full PRBEP – for page text was about 60% – for surrounding hosts was about 51% This is pretty good – We can get about 2/3 of the URLs labeled within a particular tag with about 1/3 erroneous URLs in our resulting set Prec@10% (Set Recall = 10%) – About 90% Semantic Tech & Context - 11

12 Copyright  2009 by CEBT #2: Using Tags  Using Association Rules Software, tool, osx -> add ‘mac’  Market-basket data mining Support – The number of baskets containing both X and Y Confidence – How likely is X given X? – P(Y|X) Interest – How much more common is X&Y than expected by chance? – P(Y|X)-P(Y)  Here, Baskets : URLs Items : Tags Support>500 and length 2 Support >1000 and length 3 Support >2000 of any length Semantic Tech & Context - 12

13 Copyright  2009 by CEBT Found Association Rules : Top 30 Semantic Tech & Context - 13

14 Copyright  2009 by CEBT Found Association Rules : Random Sample Semantic Tech & Context - 14

15 Copyright  2009 by CEBT Tag Application Simulation  Rules Generating association rules based on 50,000 URLs as training set  Tags Sampling n bookmarks from 10,000 URLs as test set  Stopping applying association rules once reaching a particular minimum confidence c Semantic Tech & Context - 15

16 Copyright  2009 by CEBT Experimental Results  For n = 1,2,3,5, c = 0.5, 0.75, 0.9  Estimated Precision : Average of Applied Confidence Value Semantic Tech & Context - 16

17 Copyright  2009 by CEBT How useful are Predicted Tags?  Increasing Recall for Single Tag Query  Let a Tag in Top 100 Tags a Query  Recall Increases!  Precision Decreases but High! Semantic Tech & Context - 17

18 Copyright  2009 by CEBT Discussion  This paper presents a large scale experiments on real data. Shows us what we can use social prediction for Social Prediction Methods – Using Page Information – Using Tags – Both show quite reasonable results  Well organized and written  Scientific Experiment design  Not a new idea Application of SVM, Basket-item data mining  What can we crawl and do some experiment like the authors have done? Naver Knowin? Blogs & Tags? Any interesting idea? Semantic Tech & Context - 18


Download ppt "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation."

Similar presentations


Ads by Google