Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008

Outline Introduction Introduction Preliminaries Preliminaries Dataset Dataset Tag Prediction Using Page Information Tag Prediction Using Page Information Tag Prediction Using Tags Tag Prediction Using Tags Conclusions Conclusions

Introduction Social tag allows users to contribute metadata to large and dynamic corpora Social tag allows users to contribute metadata to large and dynamic corpora Social tag prediction problem Social tag prediction problem –Given a set of objects and a set of tags, can we predict whether a given tag could/should be applied to a particular object?

Benefits of Predicting Social Tags At a fundamental level, we gain insights into the “ information content ” of tags At a fundamental level, we gain insights into the “ information content ” of tags –If tags are easy to predict from other content, they add little value At a practical level, a tag predictor can enhance a social tagging site in a variety of forms At a practical level, a tag predictor can enhance a social tagging site in a variety of forms –Increase recall of single tag queries/feeds –Inter-user agreement –Tag disambiguation –Bootstrapping –System suggestion

Preliminaries A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags A post is a set of triples (t i, u j, o k ) indicating a user u j annotated object o k by a set of tags Imagining a tag do or do not describe an object, there are three 3 relations: Imagining a tag do or do not describe an object, there are three 3 relations: –R p = a set of (t, o) pairs where each pair means that tag t positively describes object o –R n = a set of (t, o) pairs where each pair means that tag t negatively describes object o –R a = a set of (t, u, o) triples where each triple means that user u annotated object o with tag t T 100 = the 100 most frequent tags T 100 = the 100 most frequent tags

Operators and Examples Two standard relational algebra operators Two standard relational algebra operators –σ c selects tuples from a relation where a particular condition c holds (WHERE in SQL) –π p projects a relation into a smaller number of attributes (SELECT in SQL) Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, Example: for a web o bagels about a downtwon bagel shop and a web page o pizza about a pizzeria, R p = (t bagels, o bagels ), (t shop, o bagels ), (t downtown, o bagels ), (t pizza, o pizza ), (t pizzeria, o pizza ) (t pizza, o pizza ), (t pizzeria, o pizza ) R n = (t pizzeria, o bagels ), (t pizza, o bagels ), (t bagels, o pizza ) … π t (σ O bagels (R p )) = tags which positively describe o bagels = (t bagels, t shop, t downtown ) = (t bagels, t shop, t downtown )

Dataset The base: Stanford Tag Crawl Dataset The base: Stanford Tag Crawl Dataset –Gathered from del.icio.us –Consist of 2,549,282 unique URLs with their posts –Anchor text and Link information for each URL Experimental dataset construction Experimental dataset construction –Aiming to approximate R p and R n –Assume that if (t i, o k ) π (t, o) (R a ) then (t i, o k ) R p The reverse is not true The reverse is not true –Filter the dataset by postcount(o k ) = |π u (σ Ok (R a ))| Assume as postcount(o k ) increases, R p is approximated by R a Assume as postcount(o k ) increases, R p is approximated by R a Filtering threshold = 100 Filtering threshold = 100 –62,000 URLs in the filtered set

Probability of Adding “ New ” Tags Figure: Average new tags (in T 100 ) versus number of posts

Comparison between Popular Tags Table: The top/bottom tags in T 100 to be added after the 100th bookmark. The top 15 tags are relatively ambiguous and personal.

Features for SVM Page text features Page text features –Bag of words Anchor text Anchor text –Bag of words –Text within 15 words of inlinks to the URL –Use only URLs with at least 100 inlinks as examples Surrounding hosts Surrounding hosts –Hosts/domains of backlinks –Hosts/domains of the URL –Hosts/domains of forward links For each feature type, the top 1000 features selected by mutual information are used For each feature type, the top 1000 features selected by mutual information are used

Experiment Setup Binary tag classification by SVM for T 100 Binary tag classification by SVM for T 100 –SVMlight and SVMperf with a linear kernel Data splits Data splits –Full/Full: 11/16 positive/negative examples for training and the rest for testing Evaluated by precision-recall BEP (PRBEP) instead of accuracy Evaluated by precision-recall BEP (PRBEP) instead of accuracy –200/200: randomly select 200 positive/negative examples for training and the same for testing Evaluated by accuracy Evaluated by accuracy Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system Provided as an imperfect indication of how predictable a tag is due to its “ information content ” rather than the distribution of examples in the system

Order of Predictability Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200) Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200) Figure: Tags in T 100 in increasing order of predictability from left to right.

Discussions What precision can we get at the PRBEP? What precision can we get at the PRBEP? –60% for page text, 58% for anchor text, and 51% for surrounding hosts –Much better than chance given a majority of tags in T 100 occur on less than 15% of documents What precision can we get with low recall? What precision can we get with low recall? –90% for all features and 92.5% for page text in Prec@10% (Full/Full) Which page information is best for predicting tags? Which page information is best for predicting tags? –Page text > anchor text > surrounding hosts

What makes a tag predictable? (1/2) Entropy measure: Entropy measure:

What makes a tag predictable? (2/2) Figure: Tag popularity positively correlated to PRBEP in the Full/Full split

Tag Prediction Using Tags Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks Between about 30 and 50 percent of URLs posted to del.icio.us have only 1 or 2 bookmarks –Recall for single tag queries will be low The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? The question: given a small number of tags, how much can we expand this set of tags in a high precision manner? –Similar to market-basket data mining A large set of items and a large set of baskets each of which contains a small set of items A large set of items and a large set of baskets each of which contains a small set of items The goal is to find correlations between sets of items The goal is to find correlations between sets of items –The baskets are URLs and the items are tags

Association Rules Suport: the number of baskets containing both X and Y Suport: the number of baskets containing both X and Y Confidence: P(Y |X ) (How likely is Y given X ?) Confidence: P(Y |X ) (How likely is Y given X ?) Interest: P(Y |X ) － P(Y ) (How much more common is X &Y than expected by chance?) Interest: P(Y |X ) － P(Y ) (How much more common is X &Y than expected by chance?)

Found Association Rules Observed relations: type-of, various forms, translations … etc Observed relations: type-of, various forms, translations … etc

Found Association Rules Random sampling of the top 8000 rules of length 3 or less Random sampling of the top 8000 rules of length 3 or less

Simulation of Tag Expansion About 50,000 URLs for training and 10,000 URLs for testing About 50,000 URLs for training and 10,000 URLs for testing

Conclusions Our tag prediction results suggest three insights: Our tag prediction results suggest three insights: –Many tags on the web do not contribute substantial additional information beyond page text, anchor text, and surrounding hosts. –The predictability of a tag is negatively correlated with its entropy, when our classifiers are given balanced training data. When considering tags in their natural distributions, data sparsity issues tend to dominate. –Association rules can increase recall on single tag queries. We found association rules linking languages, super/subconcepts, and other relationships.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Similar presentations

Presentation on theme: "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

Similar presentations

Presentation on theme: "Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008."— Presentation transcript:

Similar presentations

About project

Feedback