+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Group Recommendation: Semantics and Efficiency

Albert Gatt Corpora and Statistical Methods Lecture 13.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Best-Effort Top-k Query Processing Under Budgetary Constraints

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Efficient Network Aware Search in Collaborative Tagging Sites… Sihem Amer Yahia, Michael Benedikt Laks V.S. Lakshmanan, Julia Stoyanovichy PRESENTED BY,

Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han University of.

Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.

On the Topologies Formed by Selfish Peers Thomas Moscibroda Stefan Schmid Roger Wattenhofer IPTPS 2006 Santa Barbara, California, USA.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.

Crowd-Augmented Social Aware Search Soudip Roy Chowdhury & Bogdan Cautis.

By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Network Aware Resource Allocation in Distributed Clouds.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.

The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.

Efficient Processing of Top-k Spatial Preference Queries

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Combining Fuzzy Information: An Overview Ronald Fagin.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

QoS Supported Clustered Query Processing in Large Collaboration of Heterogeneous Sensor Networks Debraj De and Lifeng Sang Ohio State University Workshop.

NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.

+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.

Static Process Scheduling

Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

1 VLDB, Background What is important for the user.

Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,

Cohesive Subgraph Computation over Large Graphs

Neighborhood - based Tag Prediction

Optimizing Parallel Algorithms for All Pairs Similarity Search

Seung-won Hwang, Kevin Chen-Chuan Chang

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Artificial Intelligence Problem solving by searching CSC 361

Rank Aggregation.

Laks V.S. Lakshmanan Depf. of CS UBC

Xu Zhou Kenli Li Yantao Zhou Keqin Li

D. ZeinalipourYazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V

INF 141: Information Retrieval

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009

+ Overview Opportunity: Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker. Incorporate social behavior into processing search queries Network Aware Search Results determined by opinion of network. Existing top-k are too space intensive Dependence of scores on seeker’s network Investigate clustering seekers based on behavior of networks Del.icio.us datasets were used for experiments 2

+ Introduction What is Network Aware Search? Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook Users contribute content annotate items (photos, videos, URLs, …) with tags form social networks friends/family, interest-based need help discovering relevant content What is Relevance of an item? 3

+ What is Network-Aware Search? 4

+ Claims Define what is network-aware search. Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy. Refine score upper-bounds based on the user’s network and tagging behavior 5

+ Data Model Roger, i1, music Roger, i3, music Roger, i5, sports … Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news … Tagged(user u,item i,tag t) Taggers =  u TaggedSeekers =  u Link Link(u 1,v 1 ): directed edge Network (u) = { v | Link (u,v) } For seeker u 1 ε Seekers, Network(u 1 ) = neighbors of u 1 Link (user u, user v) 6

+ What are Scores? Query is a set of tags Q = {t 1,t 2,…,t n } example: fashion, www, sports, artificial intelligence For a seeker u, a tag t, and a item I (Score per tag) score(i,u,t) = f(|Network(u)  {v, |Tagged(v,i,t)}|) Overall Score of the query score(i,u,Q) = g(score(i,u,t 1 ), score(i,u,t 2 ),…, score(i,u, t n )) f and g are monotone, where f = COUNT, g = SUM 7

+ Problem Statement Given a user query Q = t 1 … t n and a number k, we want to efficiently determine the top k items ie: k items with the highest overall score 8

+ Standard Top-k Processing Q = {t 1,t 2,…,t n } Inverted lists per tag, IL 1, IL 2, … IL n, sorted on scores score (i) = g(score(i, IL 1 ), score(i, IL 2 ), …, score(i, IL 3 )) Intuition high-scoring items are close to the top of most lists Fagin-style processing: NRA (no random access) access all lists sequentially in parallel maintain a heap sorted on partial scores stop when partial score of k th item > best case score of unseen/incomplete items 9

+ Item Item Item item780.5item380.6item170.7 item830.4item140.6item610.3 item170.3item50.6item810.2 item210.2item830.5item650.1 item910.1item210.3item100.1 item440.1 [0.9, 2.1] Item 17 [0.6, 2.1] item 25 [0.6, 2.1] worst score best-score Min top-2 score : 0.6 Threshold (Max of unseen tuples): 2.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? List 1 List 2 List 3 Candidates =2.1 NRA 10

+ item item item item item item item item item item item item item item item item item item item worst score best-score Min top-2 score : 0.9 Threshold (Max of unseen tuples): 1.8 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? item 17 [1.3, 1.8] item 83 [0.9, 2.0] item 25 [0.6, 1.9] item 38 [0.6, 1.8] item 78 [0.5, 1.8] List 1 List 2 List 3 Candidates NRA 11

+ item item item item item item item item item item item item item item item item item item item worst score best-score item 83 [1.3, 1.9] item 17 [1.3, 1.9] item 25 [0.6, 1.5] item 78 [0.5, 1.4] Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.3 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue List 1 List 2 List 3 Candidates NRA 12

+ item item item item item item item item item item item item item item item item item item item worst score best-score Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue item item 83 [1.3, 1.9] item 25 [0.6, 1.4] List 1 List 2 List 3 Candidates NRA 13

+ item item item item item item item item item item item item item item item item item item item Min top-2 score : 1.6 Threshold (Max of unseen tuples): 0.8 Pruning Candidates: Min top-2 < best score of candidate item item List 1 List 2 List 3 Candidates NRA 14

+ NRA performs only sorted accesses (SA) (No Random Access) Random access (RA) lookup actual (final) score of an item often very useful Problems with NRA high bookkeeping overhead for “high” values of k, gain in even access cost not significant NRA 15

+ TA item item item item item item item item item item item item item item item item item item item lists sorted by score List 1 List 2 List 3 (a 1, a 2, a 3 ) 16

+ TA item item item item item item item item item item item item item item item item item item item lists sorted by score item item item Candidates min top-2 score: 1.6 maximum score for unseen items: 2.1 TA Algorithm: round 1 List 1 List 2 List 3 read one item from every list Random access 17

+ Computing Exact Scores: Naïve item score i7 i1i1 i2 i3 i4 i5 i6 i seeker Jane i7 i5i5 i9 i2 i6 i5 i8 i3 seeker Ann scoreitem tag = photos item score i7 i1i1 i8 i4 i2 i3 i6 i seeker Jane i4 i5i5 i2 i8 i7 i1 i6 i3 seeker Ann scoreitem tag = music Typical: Maintain single inverted list per (seeker, tag), items ordered by score + can use standard top-k algorithms -- high space overhead 18

+ Computing Score Upper-Bounds Space saving strategy. Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag. Here every item is stored at most once. Q now: what score to store with each entry? We store the maximum score that an item can have across all possible seekers. This is Global Upper-Bound strategy Limitation: Time to dynamically computing exact scores at query time. 19

+ Score Upper-Bounds Global Upper-Bound (GUB): 1 list per tag tag = music itemtaggers upper-bound i6 i1 i2 i3 i5 i4 i9 i7 i8 Miguel,… Kath, … Sam, … Miguel, … Peter, … Jane, … Mary, … Miguel, … Kath, … all seekers + low space overhead -- item upper-bounds, and list order(!) may differ from EXACT for most users -- time to dynamically computing exact scores at query time. How do we do top-k processing with score upper-bounds? Q: what score to store with each entry? We store the maximum score that an item can have across all possible seekers. 20

+ Top-k with Score Upper-Bounds gNRA - “generalized no random access” access all lists sequentially in parallel maintain a heap with partial exact scores stop when partial exact score of k th item > highest possible score from unseen/incomplete items (computed using current list upper-bounds) 21

+ gNRA – NRA Generalization 22

+ gTA – TA Generalization 23

+ Performance of Global Upper Bound (GUB) and Exact Space overhead total # number of entries in all inverted lists Query processing time # of cursor moves 24 GUBExact space (IL entries) 74K63M time K space baseline time baseline

+ Clustering and Query-Processing We want to reduce the distance between score upper-bound and the exact score. Greater the distance, more processing may be required Core idea Cluster users into groups and compute upper-bound for the group. Intuition group users whose behavior is similar 25

+ Clustering Seekers Cluster the seekers based on similarity in their scores (because score of an item depends on the network). Form an inverted list IL t,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster). Query processing for Q = t 1.. t n and seeker u, we First find the cluster C(u) And then perform aggregation over the collection Global Upper-Bound (GUB) is where all seekers fall into the same cluster. 26

+ Clustering Seekers assign each seeker to a cluster compute an inverted list per cluster ub(i,t,C) = max u  C |Network(u)  {v|Tagged(v,i,t j )}| + tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound -- space overhead still high (trade-off) 27 item taggers upper-bound chanel puma gucci adidas diesel versace nike prad a Miguel,… Kath, … Sam, … Miguel, … Peter, … Jane, … Mary, … Chris, … Global Upper-Bound item taggers upper-bound gucci versace chanel prada puma Bob,… Peter, … Mary, … Chris, … Alice, … Example of Clusters C1: seekers Bob & Alice item taggers upper-bound puma adidas diesel nike Miguel,… Sam, … Miguel, … Jane, … gucci Kath, … 5 C2: seekers Sam & Miguel

+ How do we cluster seekers? Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard. Proofs by reduction from independent task scheduling problem and minimum sum of squares problem Authors present some heuristics Use some form of Normalized Discounted Cumulative Gain (NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword. The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order 28

+ NDCG - Example idocIDLog I – base2 Ranking Rank/log i – base 2 Ideal rankingIdeal ranking/log i –base 2 1D N/A3 2D D D D D Cumulative Gain (CG) Distributive CG 8.10 Ideal DCG 8.69 Normalized CG (nDCG) 8.10/8.69 =

+ Clustering Taggers For each tag t we partition the taggers into separate clusters. We form inverted list and an item i in the list for cluster C gets the score as max u seekers |Network(u) ∩ C ∩ {v 1 | Tagged(v 1,i,t)}| How to cluster taggers? Graph with nodes as taggers and an edge exists between nodes v 1 and v 2 iff: Items(v 1,t) ∩ Items(v 2,t) ≥ threshold 30

+ Clustering Seekers Metrics Space Global Upper Bound has the lowest overhead. ASC and NCT achieve an order of magnitude improvement in space overhead over Exact. Time Both gNRA and gTA outperform Global Upper-bound. ASC outperforms NCT on both sequential and total accesses in all cases for gTA and in all cases except one for gNRA. Inverted lists are shorter Score upper-bound order similar to exact score order for many users Average % improvement over Global Upper-Bound Normalized Cut: 38-72% Ratio Association 67-87% 31

+ Clustering Seekers Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users 32

+ Clustering Taggers Space Overhead is significantly lower than that of Exact and of Cluster- Seekers Time Best Case: all taggers relevant to a seeker will reside in a single cluster Worst Case: All taggers will reside in separate clusters. Idea: cluster taggers based on overlap in tagging assign each tagger to a cluster compute cluster upper-bounds: ub(i,t,C) = max u  Seekers, v  C |Network(u)  {v |Tagged(v,i,t j )}| 33

+ Clustering Taggers 34

+ Conclusion and Next Steps Cluster-Taggers worked best for seekers whose network fell into at most 3 * #tags clusters For others, query execution time degraded due to the number of inverted lists that had to be processed For these seekers Cluster-Taggers outperformed Cluster-Seekers in all cases Cluster-Taggers outperforms Global Upper-Bound by %, in all cases. Extended traditional top-k algorithms Achieved a balance between time and space consumption. 35

+ Questions? WebCT/ Thank You!