Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tag-based Social Interest Discovery

Similar presentations


Presentation on theme: "Tag-based Social Interest Discovery"— Presentation transcript:

1 Tag-based Social Interest Discovery
Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search

2 Internet Social Networks Are Emerging!
Internet social networks are self-organized by online users Del.icio.us, facebook, flickr, MySpace, YouTube Users are driven by their interests Fetch and bookmark contents Create new contents Share contents Interest discovery is crucial to a social network Discover interests of users in different contents Locate users with similar interests Link people with similar interests to form communities

3 Important Features of Social Networks
Organize users and contents Cluster users into communities Categorize contents into interesting topics Provide search functions Given a topic, locate all matching contents and all users that are interested in the topic Given a user, locate all his fetched/created contents and the topics of his interests Given a user, locate all other users that have similar interests

4 The Problem: Social Interest Discovery
Questions to answer How to discover a user’s interests based on his fetched/created contents? How to use individual users’ interests to find interesting topics shared by users? How to use the topics to create interest-based user communities?

5 Existing Solutions and Limitations
User-centric Using social network graph to discover users with common interests Problem: online/offline user connections are hard to identify Object-centric Detect common interests based on the common objects fetched by users Problem: discovered interests are object-base, non-descriptive and implicit Predefined categorization Not flexible, cannot catch most recent popular or hot user interests Cannot reflect various user interest groups which may keep changing over time

6 Our approach Leverage user-generated tags
Compute frequent co-occurrences of tag patterns Use the tag patterns as topics of interests Cluster users and content around the topics to build communities

7 Overview Motivation and Problem Analysis of tags in a social network
ISID system design Evaluation Conclusion

8 Tags in Social Networks
User-generated labels for annotating the contents Descriptive, summary, reflecting human judgment Meta data between users and contents Widely used in social networks Del.icio.us: Youtube: Facebook:

9 del.icio.us Social Network
A pioneer social bookmark system Our Data Set Dump for a limited period of time 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs

10 URL Popularity Follows Power Law
The distribution of URL bookmarking frequency. Most URLs are unpopular.

11 User Activity Follows Heavy-tail
The distribution of user bookmarking frequency. Most users are less active.

12 Tags vs. Keywords URL http://ka1fsb.home.att.net/resolve.html
Top tf keywords domain,name,file,resolver,server,conf,network,nameserver,ip,org,ampr Top tfidf keywords ampr,domain,jnos,nameserver,conf, ka1fsb,resolver,ip,file,name,server All tags linux,howto,network,sysadmin,dns

13 Tag Vocabulary Tag coverage for tf keywords
Tag coverage for tf-idf keywords User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs. Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.

14 Tag Convergence The total number of different tags users can use for a given document is limited no matter how popular the URL is.

15 Tags Capture Concepts of Contents
Nearly 50% of all URLs have tag match ratio 1 70% of all URLs have a tag match ratio > 0.5 Only 10% of the URLs have no matched tags

16 From Tags to User Interests
Bookmarks reflect user interests Tags summarize/describe bookmarked contents Meta data between users and contents Connect users and bookmarked contents Frequently used tag patterns reflect user interests The key is the co-occurrences of tags

17 Overview Motivation and Problem Analysis of tags in a social network
ISID system design Evaluation Conclusion

18 System Design Find topics of interests Clustering Indexing
For a given set of tagged bookmarks, find all topics of interests, i.e., frequent co-occurrences of tags Clustering For each topic, find all the URLs and the users such that those users have labeled each of the URLs with all the tags in the topic. Indexing Import the topics and their user and URL clusters into an indexing system for application queries.

19 ISID Architecture Data Source Topic Discovery Posts Topics, posts
Posts = (user, content, tags) Topics, Clusters Indexing Clustering

20 Topic Discovery Use the association rule algorithms to discover co-occurring tag patterns Was invented for identifying frequently bought items in supermarkets E.g., bread and milk Use a support number to define the frequency threshold Efficient in finding frequent patterns out of a large set transactions for given support number (threshold) The rule building part is not used One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number To remove duplicate clusters

21 Clustering

22 Indexing Find all URLs that contain a topic, i.e. tagged with same sets of tags Find all users interested in a topic Find all topics containing a tag Find all topics for a user Find all topics for a URL Combination of the above

23 Overview Motivation and Problem Analysis of tags in a social network
ISID system design Evaluation Conclusion

24 Content Similarity of Topic Clusters
Similarity of two documents Inner product of tf-idf document vectors Keyword-based vector Tag-based vector (comparison) Intra-topic similarity Average cosine similarity of every document pairs Inter-topic similarity Similarity of two topics Average similarity of one topic to all other topics

25 Inter- and Intra- Topic Similarity
Keyword based (tf-idf) Tag based (tf-idf) Intra-topic similarity is significantly higher than inter-topic similarity Tag co-occurrence can well cluster similar content Tag-based similarity is quite close to keyword-based similarity

26 Inter-topic Similarity
Similarity of two topics with different number of overlapped tags Keyword-based (tf-idf) Tag-based (tf-idf) Co-occurrences of tags can really capture similar contents. Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.

27 User Interest Coverage
90% users have ≥ 90% top 5 tags covered 87% users have ≥ 90% top 10 tags covered 90% users have ≥ 80% tags covered The topics discovered by ISID capture the interests of users.

28 Human Reviews Scores: 1, Highly unrelated 2, Unrelated 3, Not sure
5, Highly related From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.

29 Cluster Properties Cluster size follows power-law  User interests follows power-law. There exists really hot topics!

30 Cluster Properties Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.

31 Overview Motivation and Problem Data and Their Properties ISID system
Evaluation Conclusion

32 Conclusion Tags reflect human judgments on contents
Co-occurring tags are effective to represent user interests Reflect human understanding for different but similar web contents Consensus of judgments among users ISID system Topic discovery, Clustering, Indexing Evaluation results are promising


Download ppt "Tag-based Social Interest Discovery"

Similar presentations


Ads by Google