Identifying the Most Influential Data Objects with Reverse Top-k Queries By Akrivi Vlachou 1, Christos Doulkeridis 1, Kjetil Nørvag 1 and Yannis Kotidis.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Skyline Charuka Silva. Outline Charuka Silva, Skyline2  Motivation  Skyline Definition  Applications  Skyline Query  Similar Interesting Problem.
On the Effect of Trajectory Compression in Spatio-temporal Querying Elias Frentzos, and Yannis Theodoridis Data Management Group, University of Piraeus.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
MARKETING 1.03 Practice Questions.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
1 Top-K Algorithms: Concepts and Applications by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department.
Spatio-temporal Databases
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
VLDB 2011 Pohang University of Science and Technology (POSTECH) Republic of Korea Jongwuk Lee, Seung-won Hwang VLDB 2011.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Introduction to Management Science
July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.
 Motivation  Reverse Queries  From Reverse to Inverse  Inverse Queries  Formal Definition  Applications  Framework  Experiments  Future Extensions.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Efficient Skyline Querying with Variable User Preferences on Nominal Attributes Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jian Pei 3, Yip Sing Ho 2,
1 Continuous k-dominant Skyline Query Processing Presented by Prasad Sriram Nilu Thakur.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Chapter 1.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
1 Progressive Computation of Constrained Subspace Skyline Queries Evangelos Dellis 1 Akrivi Vlachou 1 Ilya Vladimirskiy 1 Bernhard Seeger 1 Yannis Theodoridis.
Chapter 9 - Multicriteria Decision Making 1 Chapter 9 Multicriteria Decision Making Introduction to Management Science 8th Edition by Bernard W. Taylor.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Efficient Progressive Processing of Skyline Queries in Peer-to-Peer Systems INFOSCALE’06.
Efficient Computation of Reverse Skyline Queries VLDB 2007.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Data Management+ Laboratory Dynamic Skylines Considering Range Queries Speaker: Adam Adviser: Yuling Hsueh 16th International Conference, DASFAA 2011 Wen-Chi.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.
S KYLINE Q UERY P ROCESSING OVER J OINS. Akrivi Vlachou1, Christos Doulkeridis1, Neoklis Polyzotis SIGMOD 2011.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
What does the Cloud mean for Data Management: Challenges and Opportunities Akrivi Vlachou Norwegian University of Science and Technology (NTNU), Trondheim,
The σ-neighborhood skyline queries Chen, Yi-Chung; LEE, Chiang. The σ-neighborhood skyline queries. Information Sciences, 2015, 322: 張天彥 2015/12/05.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.
1 Finding Competitive Price Yu Peng (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and Technology)
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Today’s Material Sorting: Definitions Basic Sorting Algorithms
HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.
Parallel Computation of Skyline Queries COSC6490A Fall 2007 Slawomir Kmiec.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
A Unified Framework for Efficiently Processing Ranking Related Queries
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Objective 1.05 Review Questions
Preference Query Evaluation Over Expensive Attributes
Probabilistic Data Management
A Restaurant Recommendation System Based on Range and Skyline Queries
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Identifying the Most Influential Data Objects with Reverse Top-k Queries By Akrivi Vlachou 1, Christos Doulkeridis 1, Kjetil Nørvag 1 and Yannis Kotidis 2 1 Norwegian University of Science and Technology (NTNU), Trondheim, Norway 2 Athens University of Economics and Business (AUEB), Athens, Greece Proceedings of the VLDB Endowment, Vol. 3, No. 1 Presented by Kalpesh Kagresha (UBIT: kkagresh Person# )

Outline Introduction Preliminaries Problem Statement Skyband Based Algorithm Branch and Bound Algorithm Influence Score Variants Experimental Evaluation

Introduction Top-k Queries – Retrieve k most interesting objects based on individual user preference – e.g. SELECT hotels.name FROM hotels ORDER BY (hotels.Dist_beach+hotels.Dist_conf) STOP AT 3 This is a Top-3 query. Influence of Product – Most appealing / popular product in the market Identifying the most influential objects from given database products is important for market analysis and decision making – Optimizing Direct Marketing: Mobile Phone Company – Real Estate Market

4 From the perspective of manufacturers:  it is important that a product is returned in the highest ranked positions for as many user preferences as possible  estimate the impact of a product compared to their competitors products  advertise a product to potential customers Reversing the Top-k Query sales representative customer Which customers would be interested?

Influence Score Cardinality of reverse top-k result set Top-m influential products – Query selects m products with highest influence score Problem with existing techniques – Requires computing a reverse top-k query for each product in the database – Expensive for databases of moderate size

Preliminaries Data space D – N dimensions {d 1, d 2,…,d n } Set of S database objects on D with cardinality |S| Database object represented as point p ϵ S such that p = {p[1],…, p[n]} where p[i] is a value on dimension d i Assume that values p[i] are numerical non-negative scores and smaller score values are preferable. Each d i has associated query-dependent weight w[i] indicating the d i 's relative importance for query. Aggregated Score f w (p) is defined as weighted sum of individual scores i.e. f w (p) = ∑ n i=1 w[i] * p[i] where w[i]>=0 and Ǝj such that w[j] > 0 and ∑ n i=1 w[i]=1

Top-k Queries Query w = [0.75, 0.25] Rank of point p = No. of points in half space defined by the perpendicular hyper-plane containing origin P2 is top-1 object for given query

Reverse Top-k Queries Two weight vectors w1 and w2 Reverse top-3 query is posed with query object p4 RTOP 3 (p4) => w1 and not w2

Problem Statement Influence Score – Given a dataset S and set of weighting vectors W, the definition of influence score only requires setting a single value k that determines the scope of reverse top-k queries for identifying most influential data objects

Top-m Most Influential Data Objects Most Influential Object

Algorithm for Indentifying Most Influential Data Objects We will study following three algorithms  Naive Method  Skyband-based algorithm  Branch-and-bound algorithm

Naive Method  Issues a reverse top-k query for each data object to find influence score f I k (p) and then rank the objects.  Not efficient in practice as it requires computing reverse top-k query for each database object.  Also processing of each reverse top-k query requires several top-k query evaluations.  More efficient algorithm using skyband set.

What is Skyline? The Skyline of a set of objects (records) comprises all records that are not dominated by any other record. A record x dominates another record y if x is as good as y in all attributes and strictly better in at least one attribute. Domination examples: p1 dominates p10 because 1<8 and 9=9 p2 dominates p4 because 2<3 and 6<7 p3 dominates p5 because 3<4 and 1<3 p8 dominates p9 because 7<8 and 1<4 The Skyline set: {p 1, p 2, p 3, p 8 } These objects are not dominated by any other object (1,9) (8,9) (2,6) (3,7) (4,3) (3,1) (7,1) (8,4)

Skyband Based Algorithm Priority Queue Reverse Top-k Query Queue sorted by influence score Retrieve Top-m objects

Performance of Skyband Based Algorithm  For high values of m, size of skyband set increases resulting in large no. of reverse top-k queries.  SB is not incremental, it needs to process reverse top-k query for all objects in mSB(S) before reporting any result.  Cannot compute m+1 most influential data object without computing (m+1)SB(S)  Need for incremental algorithm without processing all the objects in skyband set.

Branch and Bound Algorithm(BB)  Computes upper bound for the influence score of candidate objects based on already processed objects.  Prioritize processing of promising objects and postpone objects that are not likely to be the next most influential objects. (minimizes reverse top-k evaluations)  BB returns influential objects incrementally when score of current object is higher than upper bound of all candidate objects.  BB employees result sharing to avoid costly re-computations and boost performance of influence score evaluations.

Upper Bound of Influence Scores Dynamic Skyline set – Point p i dynamically dominates p i ’ based on point q, if d j ϵ D: |p i [j]-q[j]|<=|p’ i [j]-q[j]| and on at least one dimension d j ϵ D: |pi[j]-q[j]|<|p’ i [j]-q[j]|. The set of points that are not dominated based on q forms dynamic skyline set

Constrained Dynamic Skyline Set CDS(q): Applying dynamic skyline query on the objects enclosed in the window query defined by q and origin of data space For P c ={p1, p2,…p5} CDS(q)={p3, p4, p5}

Property for Upper Bound  Given a point q and a non-empty set of constrained dynamic skyline points of CDS(q)={p i }, it holds that RTOP k (q) ⊆ ∩ pi ϵ CDS(q) RTOP k (p i )  The upper bound U 1 (q) of q’s influence score (f k I (q) <= U 1 (q) U 1 (q) = | ∩ pi ϵ CDS(q) RTOP k (p i )|  A point p ϵ P – CDS(q) can not refine the upper bound

Algorithm for Upper Bound computeBound() CDS(q)={p3, p4, p5} Upper Bound |q w | = 1

BB Algorithm Assumption  Dataset S: indexed by multidimensional indexing structure such as R-tree. A multidimensional bounding rectangle (MBR) e i in R-tree is represented by lower left corner l i & upper right corner u i Influence of an MBR e i is influence of lower left corner  f k I (e i ) = f k I (l i )

MBR is inserted in Q with influence score=|W| Influence score is already computed c is most influential object c is a MBR Reverse top-k computation To facilitate more accurate score for estimation of other entries

Contents of Priority Queue

Analysis of BB algorithm Correctness – BB algorithm always returns correct result set Efficiency – BB minimizes the number of RTOP k evaluations – Evaluates queries only if upper bound cannot be improved by any point in dataset

Influence Score Variants Threshold-based influential objects  Retrieval of objects with influence score higher than a user specified threshold.  E.g. Product is considered important if result set of reverse top-k query includes 10% of all customers.  SB cannot support these queries while BB is easily adapted to such queries.

Influence Score Variants Profit-aware influence score  Each customer(w i ) associated with profit pr i E.g. No. of orders or total amount of cost of orders  Influence score is defined as  SB and BB both support profit-aware influence queries.

Experimental Setup SB, BB and Naive algorithms implemented in Java 3Ghz Dual Core AMD processor with 2GB RAM R-tree with buffer size of 100 blocks with block size 4KB Real and Synthetic datasets with uniform(UN), correlated(CO) and anti-correlated(AC) collections. Two real datasets used – NBA dataset with dimensional tuples representing player’s performance per year. – HOUSE dataset with dimensional tuples representing percentage of an American family’s annual income spent on expenditures. Metrics used are total execution time, No. of I/Os, No. of top-k and reverse top-k evaluations.

Experimental Evaluation Comparative performance of all algorithms for UN dataset and varying dimensionality (d)

Experimental Evaluation BB vs SB for real data By varying dimensionality d, cardinality of S, W, k and m, it is observed that BB outperforms SB in all metrics.

Conclusion Reverse top-k query returns set of customers that find product appealing i.e. top-k results. Influence of product is cardinality of reverse top-k query result. Two algorithms were presented – SB: Restricts candidate set of objects based on skyband set of data objects. – BB: Retrieve results incrementally with minimum no. of reverse top-k evaluations. Variations of query for most influential objects

Questions?