«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Extracting and Utilizing.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
The Vector Space Model …and applications in Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Presenting by, Prashanth B R 1AR08CS035 Dept.Of CSE. AIeMS-Bidadi. Sketch4Match – Content-based Image Retrieval System Using Sketches Under the Guidance.
Tag-based Social Interest Discovery
Tag-based Social Interest Discovery 2009/2/9 Presenter: Lin, Sin-Yan 1 Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc WWW 2008 Social Networks & Web 2.0.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Probabilistic Question Recommendation for Question Answering Communities Mingcheng Qu, Guang Qiu, Xiaofei He, Cheng Zhang, Hao Wu, Jiajun Bu, Chun Chen.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Tag-based Social Interest Discovery SNU IDB Lab. Chung-soo Jang April 18, 2008 WWW 2008, Beijing, China. Xin Li, Lei Guo, Yihong (Eric) Zhao Yahoo! Inc.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Facilitating Document Annotation using Content and Querying Value.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Vector Space Models.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Tag-based Social Interest Discovery By yjhuang Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao 此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Queensland University of Technology
Multimedia Information Retrieval
Mining Query Subtopics from Search Log Data
Block Matching for Ontologies
Chapter 5: Information Retrieval and Web Search
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc., CA Paper presentation: Konstantinos Zacharis, Dept. of Comp. & Comm.Engineering, UTH

Paper Outline Introduction Previous work Data collection pre-processing Tag analysis System architecture Evaluation Conclusions and future work

Introduction Problem statement: discover common interests shared by users in a social network system 1 Two approaches: user-centric (by analyzing online user connections) and object-centric (by analyzing objects transferred, also offline) Paper’s approach: concentrate on user-defined tags (examining pairs of tag-URL’s) 1 Most famous commercial such systems are:

Why study tags: 4 key observations Tag vocabulary is rich and large enough For each URL, # of unique tags associated is smaller than # of keywords in the referred web page For the same URL there may be different tags. The tag and keyword vectors are, however, quite similar Tags carry the variation of human judgement and therefore can help identify social interests concisely and within finer granularity

Previously … User-centric approach: relations forming online (e.g. through blogging), difficult to extract (non-trivial) Object-centric : locating common objects that different users share through the network, but objects are non- descriptive and implicit to users Tagging techniques have already been used in social nets and blogs (often under descriptor “collaborative tagging”). There has also been proof of the power law obeyed by tag frequency in such nets. But novel idea here is to analyze co-occurrence of multiple tags, instead of single ones

Data collection/pre-processing Partial dump of del.icio.us database activity All non-HTML and non-English objects discarded, pages encoded to UTF-8 Then pages filtered for stopwords (producing keywords) Then tags and keywords normalized with Porter stemming algorithm #Tag vocabulary ~ 300,000 #Keyword vocabulary ~ 4,000,000

Distribution of data Distribution of tags (zipfian) is basically different from that of customers in online shopping systems

Tag analysis (1), VSM Table shows intuitively that user-generated tags have a higher level abstraction of the content (initial observation) and are therefore more appropriate to represent also web page content 1.Use of the Vector Space Model for tf and idf calculation 2.Each URL is represented by two vectors, one in tag space and the other in keyword space

Tag analysis (2), statistical estimators Tag vocabulary coverage is up to 90% of URL keywords (satisfactory) Tag matching by URL is almost complete (the opposite) Total tag # that users generate is limited for a given page, no matter how popular it is When multiple tags are used together, they define a topic of interest. This topic corresponds to a virtual community of users (they may have no physical or online connection in the real world)

Proposed software architecture Post stream p=(user, URL, tags), where (user, URL)=key

Topic Discovery (1) Problem: find a set of frequent tag patterns within a given set of posts (well studied in other domains e.g. supermarket transactions) Solution: classical association rule learning algorithms (e.g. Apriori) Another approach: probabilistic learning by EM algorithm ( A. Plangprasopchok, K. Lerman - AAAI 2007 )

Clustering (naive approach) (2) Step 6 is computationally intensive. A prefix tree implementation over the merged topics can reduce complexity

Indexing (3) Kinds of queries executed by the system: –For a given topic, a) list all URLs that contain this topic and b) list all users that are interested in this topic –For given tags, list all topics containing the tags –For a given URL, list all topics this URL belongs to –For a given URL and topic, list all appropriate users

Evaluation (1) Metrics: compare intra- with inter- topic similarity (cosine) to see how well are clusters formed Tag-based topic clustering and similarity computation is simple and accurate and also computationally cost- effective, because the dimension of term vector space is significantly reduced Topic clustering is also accurate because it is based on multiple co-occurring tags

Evaluation (2) Topics discovered capture almost 90% of interests of users To evaluate the quality of URL clusters, a review by 4 human editors was conducted Cluster sizes follow power law distribution (few hot topics on internet capture a large amount of users) Each topic usually contains no more than 5 tags

Conclusions Paper justifies use of tags as more appropriate for representing user interest No information on the online or offline social connection among users was necessary Paper provides an inside view to document semantics (by comparing tags and keywords) Paper demonstrates extensive computational (in statistics) and graphical properties. Can easily be characterized as a complete report

Any questions? Thank you for your attention!