Presentation is loading. Please wait.

Presentation is loading. Please wait.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Similar presentations


Presentation on theme: "«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,"— Presentation transcript:

1 «Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc., CA Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

2 Paper Outline Introduction Previous work Data collection pre-processing Tag analysis System architecture Evaluation Conclusions and future work

3 Introduction Problem statement: discover common interests shared by users in a social network system 1 Two approaches: user-centric (by analyzing online user connections) and object-centric (by analyzing objects transferred, also offline) Paper’s approach: concentrate on user-defined tags (examining pairs of tag-URL’s) 1 Most famous commercial such systems are: http://del.icio.us/ http://www.facebook.com/ http://www.myspace.com/ http://www.youtube.com/

4 Why study tags: 4 key observations Tag vocabulary is rich and large enough For each URL, # of unique tags associated is smaller than # of keywords in the referred web page For the same URL there may be different tags. The tag and keyword vectors are, however, quite similar Tags carry the variation of human judgement and therefore can help identify social interests concisely and within finer granularity

5 Previously … User-centric approach: relations forming online (e.g. through blogging), difficult to extract (non-trivial) Object-centric : locating common objects that different users share through the network, but objects are non- descriptive and implicit to users Tagging techniques have already been used in social nets and blogs (often under descriptor “collaborative tagging”). There has also been proof of the power law obeyed by tag frequency in such nets. But novel idea here is to analyze co-occurrence of multiple tags, instead of single ones

6 Data collection/pre-processing Partial dump of del.icio.us database activity All non-HTML and non-English objects discarded, pages encoded to UTF-8 Then pages filtered for stopwords (producing keywords) Then tags and keywords normalized with Porter stemming algorithm #Tag vocabulary ~ 300,000 #Keyword vocabulary ~ 4,000,000

7 Distribution of data Distribution of tags (zipfian) is basically different from that of customers in online shopping systems

8 Tag analysis (1), VSM Table shows intuitively that user-generated tags have a higher level abstraction of the content (initial observation) and are therefore more appropriate to represent also web page content 1.Use of the Vector Space Model for tf and idf calculation 2.Each URL is represented by two vectors, one in tag space and the other in keyword space

9 Tag analysis (2), statistical estimators Tag vocabulary coverage is up to 90% of URL keywords (satisfactory) Tag matching by URL is almost complete (the opposite) Total tag # that users generate is limited for a given page, no matter how popular it is When multiple tags are used together, they define a topic of interest. This topic corresponds to a virtual community of users (they may have no physical or online connection in the real world)

10 Proposed software architecture Post stream p=(user, URL, tags), where (user, URL)=key

11 Topic Discovery (1) Problem: find a set of frequent tag patterns within a given set of posts (well studied in other domains e.g. supermarket transactions) Solution: classical association rule learning algorithms (e.g. Apriori) Another approach: probabilistic learning by EM algorithm ( A. Plangprasopchok, K. Lerman - AAAI 2007 )

12 Clustering (naive approach) (2) Step 6 is computationally intensive. A prefix tree implementation over the merged topics can reduce complexity

13 Indexing (3) Kinds of queries executed by the system: –For a given topic, a) list all URLs that contain this topic and b) list all users that are interested in this topic –For given tags, list all topics containing the tags –For a given URL, list all topics this URL belongs to –For a given URL and topic, list all appropriate users

14 Evaluation (1) Metrics: compare intra- with inter- topic similarity (cosine) to see how well are clusters formed Tag-based topic clustering and similarity computation is simple and accurate and also computationally cost- effective, because the dimension of term vector space is significantly reduced Topic clustering is also accurate because it is based on multiple co-occurring tags

15 Evaluation (2) Topics discovered capture almost 90% of interests of users To evaluate the quality of URL clusters, a review by 4 human editors was conducted Cluster sizes follow power law distribution (few hot topics on internet capture a large amount of users) Each topic usually contains no more than 5 tags

16 Conclusions Paper justifies use of tags as more appropriate for representing user interest No information on the online or offline social connection among users was necessary Paper provides an inside view to document semantics (by comparing tags and keywords) Paper demonstrates extensive computational (in statistics) and graphical properties. Can easily be characterized as a complete report

17 Any questions? Thank you for your attention!


Download ppt "«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,"

Similar presentations


Ads by Google