Presentation is loading. Please wait.

Presentation is loading. Please wait.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Similar presentations


Presentation on theme: "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008."— Presentation transcript:

1 Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008

2 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

3 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

4 Introduction Social tag allows users to contribute metadata to large and dynamic corpora Social tag allows users to contribute metadata to large and dynamic corpora Social tag prediction problem Social tag prediction problem –Given a set of objects and a set of tags, can we predict whether a given tag could/should be applied to a particular object?

5 Benefits of Predicting Social Tags At a fundamental level, we gain insights into the “ information content ” of tags At a fundamental level, we gain insights into the “ information content ” of tags –If tags are easy to predict from other content, they add little value At a practical level, a tag predictor can enhance a social tagging site in a variety of forms At a practical level, a tag predictor can enhance a social tagging site in a variety of forms –Increase recall of single tag queries/feeds –Inter-user agreement –Tag disambiguation –Bootstrapping –System suggestion

6 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

7 Preliminaries A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags Imagining a tag do or do not describe an object, there are three 3 relations: Imagining a tag do or do not describe an object, there are three 3 relations: –R p = a set of (t, o) pairs where each pair means that tag t positively describes object o –R n = a set of (t, o) pairs where each pair means that tag t negatively describes object o –R a = a set of (t, u, o) triples where each triple means that user u annotated object o with tag t T 100 = the 100 most frequent tags T 100 = the 100 most frequent tags

8 Operators and Examples Two standard relational algebra operators Two standard relational algebra operators –σ c selects tuples from a relation where a particular condition c holds (WHERE in SQL) –π p projects a relation into a smaller number of attributes (SELECT in SQL) Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, R p = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) (t pizza, o pizza ), (t pizzeria, o pizza ) R n = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza ) … π t (σ O bagels (R p )) = tags which positively describe o bagels = (t bagels, t shop, t downtown ) = (t bagels, t shop, t downtown )

9 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

10 Dataset The base: Stanford Tag Crawl Dataset The base: Stanford Tag Crawl Dataset –Gathered from del.icio.us –Consist of 2,549,282 unique URLs with their posts –Anchor text and Link information for each URL Experimental dataset construction Experimental dataset construction –Aiming to approximate R p and R n –Assume that if (t i, o k ) π (t, o) (R a ) then (t i, o k ) R p The reverse is not true The reverse is not true –Filter the dataset by postcount(o k ) = |π u (σ Ok (R a ))| Assume as postcount(o k ) increases, R p is approximated by R a Assume as postcount(o k ) increases, R p is approximated by R a Filtering threshold = 100 Filtering threshold = 100 –62,000 URLs in the filtered set

11 Probability of Adding “ New ” Tags Figure: Average new tags (in T 100 ) versus number of posts

12 Comparison between Popular Tags Table: The top/bottom tags in T 100 to be added after the 100th bookmark. The top 15 tags are relatively ambiguous and personal.

13 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

14 Features for SVM Page text features Page text features –Bag of words Anchor text Anchor text –Bag of words –Text within 15 words of inlinks to the URL –Use only URLs with at least 100 inlinks as examples Surrounding hosts Surrounding hosts –Hosts/domains of backlinks –Hosts/domains of the URL –Hosts/domains of forward links For each feature type, the top 1000 features selected by mutual information are used For each feature type, the top 1000 features selected by mutual information are used

15 Experiment Setup Binary tag classification by SVM for T 100 Binary tag classification by SVM for T 100 –SVMlight and SVMperf with a linear kernel Data splits Data splits –Full/Full: 11/16 positive/negative examples for training and the rest for testing Evaluated by precision-recall BEP (PRBEP) instead of accuracy Evaluated by precision-recall BEP (PRBEP) instead of accuracy –200/200: randomly select 200 positive/negative examples for training and the same for testing Evaluated by accuracy Evaluated by accuracy Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system

16 Order of Predictability Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200) Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200) Figure: Tags in T 100 in increasing order of predictability from left to right.

17 Discussions What precision can we get at the PRBEP? What precision can we get at the PRBEP? –60% for page text, 58% for anchor text, and 51% for surrounding hosts –Much better than chance given a majority of tags in T 100 occur on less than 15% of documents What precision can we get with low recall? What precision can we get with low recall? –90% for all features and 92.5% for page text in Prec@10% (Full/Full) Which page information is best for predicting tags? Which page information is best for predicting tags? –Page text > anchor text > surrounding hosts

18 What makes a tag predictable? (1/2) Entropy measure: Entropy measure:

19 What makes a tag predictable? (2/2) Figure: Tag popularity positively correlated to PRBEP in the Full/Full split

20 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

21 Tag Prediction Using Tags Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks –Recall for single tag queries will be low The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? –Similar to market-basket data mining A large set of items and a large set of baskets each of which contains a small set of items A large set of items and a large set of baskets each of which contains a small set of items The goal is to find correlations between sets of items The goal is to find correlations between sets of items –The baskets are URLs and the items are tags

22 Association Rules Suport: the number of baskets containing both X and Y Suport: the number of baskets containing both X and Y Confidence: P(Y |X ) (How likely is Y given X ?) Confidence: P(Y |X ) (How likely is Y given X ?) Interest: P(Y |X ) - P(Y ) (How much more common is X &Y than expected by chance?) Interest: P(Y |X ) - P(Y ) (How much more common is X &Y than expected by chance?)

23 Found Association Rules Observed relations: type-of, various forms, translations … etc Observed relations: type-of, various forms, translations … etc

24 Found Association Rules Random sampling of the top 8000 rules of length 3 or less Random sampling of the top 8000 rules of length 3 or less

25 Simulation of Tag Expansion About 50,000 URLs for training and 10,000 URLs for testing About 50,000 URLs for training and 10,000 URLs for testing

26 Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

27 Conclusions Our tag prediction results suggest three insights: Our tag prediction results suggest three insights: –Many tags on the web do not contribute substantial additional information beyond page text, anchor text, and surrounding hosts. –The predictability of a tag is negatively correlated with its entropy, when our classifiers are given balanced training data. When considering tags in their natural distributions, data sparsity issues tend to dominate. –Association rules can increase recall on single tag queries. We found association rules linking languages, super/subconcepts, and other relationships.


Download ppt "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008."

Similar presentations


Ads by Google