Neighborhood - based Tag Prediction

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Chapter 19: Information Retrieval
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,
Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Web Data Mining and Applications Part I
Tag-based Social Interest Discovery
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
1 Web Search Personalization via Social Bookmarking and Tagging Michael G. Noll & Christoph Meinel Hasso-Plattner-Institut an der Universit¨at Potsdam,
No Title, yet Hyunwoo Kim SNU IDB Lab. September 11, 2008.
Which of the two appears simple to you? 1 2.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Efficient Processing of Top-k Spatial Preference Queries
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
 Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute From Web 1.0 to Web 2.0.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Compact Query Term Selection Using Topically Related Text
Wikitology Wikipedia as an Ontology
Information Retrieval
The Search Engine Architecture
Chapter 31: Information Retrieval
Efficient Processing of Top-k Spatial Preference Queries
Chapter 19: Information Retrieval
Presentation transcript:

Neighborhood - based Tag Prediction Adriana Budura (adriana.budura@epfl.ch) joint work with: Sebastian Michel, Philippe Cudré-Mauroux, Karl Aberer 1 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Outline Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 2 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Motivation Tagging portals « Web 2.0 » users attach keywords (tags) to resources: flickr, del.icio.us, citeulike,… Tags: unstructured textual information reflect the meaning of resources for users  powerful tool to improve search BUT: we need many tags and users are lazy  Therefore…. Automatic Tag Inference 3 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Neighborhood -based Tag Prediction IDEA: copy tags from other resources Semantically related resources –> related tags How to discover semantically similar resources? Resources are connected via links (e.g., HTML, citations ) neighborhood of a resource captures its context (e.g., citations in „Related Work“ ) propagate tags along the edges of the graph How relevant is a tag found in the neighborhood? Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Computational Model 3 concepts: Documents the resources for which we infer tags; uniquely identifiable in our scenario: scientific publications, Web pages Tags keywords attached to the resources Document neighborhoods documents connected by users  graph Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

How relevant is a tag found in the neighborhood? Neighborhood defines context (far away -> less related) Enough support in the neighborhood Some tags are more likely to occur together Similar documents are likely to share the same tags Tag Distance Tag Occurence Tag Co-Occurence Document – Document Similarity Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Principles of Tag Propagation e.g. Citation graph of publications d_init Tag Occurence Doc-Doc Similarity TopK distributed IR IR ranking PageRank P2P Tag Co-Occurence Tag Distance IR distributed P2P Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 8 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 (1) Tag Co-Occurrence relevance of a tag t for d_init based on the tags already assigned to d_init ? conditional probability: d_init can have more than one initial tag => we aggregate for sets of tags T(d_init) 9 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 (2) Doc-Doc Similarity relevance of a tag t (coming from a document d) for d_init, based on the similarity between d and d_init ? vector space model: for documents that are several hops away we aggregate 10 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

(3) Tag Distance / (4) Tag Occurence the distance between the documents d_init and d with tag t  smallest path Tag Occurrence what is the popularity (support) of a tag in the neighborhood expressed as a sum over all scores for a tag t 11 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Putting it All Together Combined Scoring Function: - sum of partial scores for each occurrence of a tag t in the neighborhood d_init Tag Occurence Doc-Doc Similarity, Tag Distance Tag Co-Occurence 12 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 13 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Inferring tags for a document traverse the graph of documents and gather tags for the initial document do not visit the whole neighborhood  need smart graph traversal the scoring model can compute a score for “every” tag  top-k tags are enough … when should we stop? Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Graph Traversal Precomputed: Tags + Scores for each document Doc 1 Doc 1 Doc 2 P2P, 0.3 Tag, 0.28 Social, 0.25 Paper, 0.2 2009, 0.1 Social, 0.4 Search, 0.33 Budura, 0.25 Tag, 0.2 Paper, 0.2 Doc 2 D_init Select the next document based on the doc-doc similarity 15 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Top-K Graph Traversal List of all neighbors sorted by doc-doc sim Select best document Doc x Visited Doc x P2P, 0.3 Tag, 0.28 Social, 0.25 Paper, 0.2 2009, 0.1 Social, 0.4 Search, 0.33 Budura, 0.25 Tag, 0.2 Paper, 0.2 D_init Social, 0.65 Paper, 0.4 Tag, 0.48 P2P, 0.3 .... top-k 16 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Top-k Tag Inference Fagin et al. - NRA Algorithm w b for each candidate tag worst_score = actual score best_score = worst_score + best_to_come_score prune a tag when best_score < score of tag currently at rank k stop when seen k tags && no candidate tags left w b w b score (m-m‘) * Top-k, pos. k Candidate Expelled unknown final “score” mass for each tag Consider ONLY m occurences for each tag Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Overview Motivation Principles of Tag Propagation Scoring Model Top-k Tag Inference Experimental Results Conclusions 18 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Experimental Setup Datasets del.icio.us (120K bookmarks) CiteULike/CiteSeer (2200 crawled pdfs) Measures of Interest: Precision (user study) Relative precision (computed based on already assigned tags) Cost (number of visited neighbors) 19 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Experimental Results: CiteULike 30 initial documents manual precision evaluation (user study) m k Precision Neighbors 3 0.73 41 5 0.65 93 7 0.55 74 0.7 124 153 0.57 247 0.72 243 257 356 20 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Experimental Results: Del.icio.us 120 initial documents relative precision evaluation m k Precision Neighbors 3 0.5 1.65 5 0.42 7 0.36 1.67 21 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09

Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09 Conclusions Tag inference over edges of resource graphs 4 principles of tag propagation Scoring model Top-k tag inference with modest access to the resource graph 22 Adriana Budura “Neighborhood – based Tag Prediction” - ESWC’09