Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
指導教授:陳良弼 老師 報告者:鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Representing and Querying Correlated Tuples in Probabilistic Databases
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Jiang Chen Columbia University Ke Yi HKUST. Motivation  Uncertain data naturally arises in many applications: sensor data, fuzzy data integration, data.
Zhou Zhao, Da Yan and Wilfred Ng
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Cleaning Uncertain Data for Top-k Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang The University of Hong Kong {lymo, ckcheng, xli, dcheung,
Frequent Subgraph Pattern Mining on Uncertain Graph Data
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Processing of Top-k Spatial Preference Queries
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Probabilistic Data Management
Probabilistic Data Management
Preference Query Evaluation Over Expensive Attributes
Mining Frequent Itemsets over Uncertain Databases
Lecture 16: Probabilistic Databases
Probabilistic Data Management
Xu Zhou Kenli Li Yantao Zhou Keqin Li
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Panagiotis G. Ipeirotis Luis Gravano
Continuous Density Queries for Moving Objects
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

Background Uncertain data are inherent in many real world applications e.g. sensor or RFID readings Top-k queries return k most promising probabilistic tuples in terms of some user-specified ranking function Top-k queries are a useful for analyzing uncertain data, but cannot be answered by traditional methods on deterministic data

Background Challenges of defining top-k queries on uncertain data: interplay between score and probability Score: value of ranking function on tuple attributes Occurrence probability: the probability that a tuple occurs Challenges of processing top-k queries on uncertain data: exponential # of possible worlds

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

Probabilistic Data Model Tuple-level probabilistic model: Each tuple is associated with its occurrence probability Attribute-level probabilistic model: Each tuple has one uncertain attribute whose value is described by a probability density function (pdf). Our focus: tuple-level probabilistic model

Probabilistic Data Model Running example: A speeding detection system needs to determine the top- 2 fastest cars, given the following car speed readings detected by different radars in a sampling moment: Radar LocationCar MakePlate No.SpeedConfidence L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 Ranking function Tuple occurrence probability

Probabilistic Data Model Running example: A speeding detection system needs to determine the top- 2 fastest cars, given the following car speed readings detected by different radars in a sampling moment: Radar LocationCar MakePlate No.SpeedConfidence L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t 1 occurs with probability Pr(t 1 )=0.4 t 1 does not occur with probability 1-Pr(t 1 )=0.6

Probabilistic Data Model t 2 and t 6 describes the same car t 2 and t 6 cannot co-occur Two different speeds in a sampling moment Exclusion Rules: (t 2 ⊕ t 6 ), (t 3 ⊕ t 5 ) Radar LocationCar MakePlate No.SpeedConfidence L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6

Probabilistic Data Model Possible World Semantics Pr(PW 1 ) = Pr(t 1 ) × Pr(t 2 ) × Pr(t 4 ) × Pr(t 5 ) Pr(PW 5 ) = [ 1 - Pr(t 1 )] × Pr(t 2 ) × Pr(t 4 ) × Pr(t 5 ) Radar Loc. Car Make Plate No. SpeedConf. L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 Possible WorldProb. PW 1 ={t 1, t 2, t 4, t 5 } PW 2 ={t 1, t 2, t 3, t 4 } PW 3 ={t 1, t 4, t 5, t 6 } PW 4 ={t 1, t 3, t 4, t 6 } PW 5 ={t 2, t 4, t 5 } PW 6 ={t 2, t 3, t 4 } PW 7 ={t 4, t 5, t 6 } PW 8 ={t 3, t 4, t 6 } (t 2 ⊕ t 6 ), (t 3 ⊕ t 5 )

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

Related Work U-Topk, U-kRanks [Soliman et al. ICDE 07 ] Global-Topk [Zhang et al. DBRank 08 ] PT-k [Hua et al. SIGMOD 08 ] ExpectedRank [Cormode et al. ICDE 09 ] Parameterized Ranking Functions (PRF) [VLDB 09 ] Other Semantics: Typical answers [Ge et al. SIGMOD 09 ] Sliding window [Jin et al. VLDB 08 ] Distributed ExpectedRank [Li et al. SIGMOD 09 ] Top-(k, l), p-Rank Topk, Top-(p, l) [Hua et al. VLDBJ 11 ]

Related Work Let us focus on ExpectedRank Consider top -2 queries ExpectedRank returns k tuples whose expected ranks across all possible worlds are the highest If a tuple does not appear in a possible world with m tuples, it is defined to be ranked in the (m+ 1 ) th position No justification

Related Work ExpectedRank Consider the rank of t 5 Radar Loc. Car Make Plate No. SpeedConf. L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 Possible WorldProb. PW 1 ={t 1, t 2, t 4, t 5 } PW 2 ={t 1, t 2, t 3, t 4 } PW 3 ={t 1, t 4, t 5, t 6 } PW 4 ={t 1, t 3, t 4, t 6 } PW 5 ={t 2, t 4, t 5 } PW 6 ={t 2, t 3, t 4 } PW 7 ={t 4, t 5, t 6 } PW 8 ={t 3, t 4, t 6 } (t 2 ⊕ t 6 ), (t 3 ⊕ t 5 )

Related Work ExpectedRank Consider the rank of t 5 Possible WorldProb. PW 1 ={t 1, t 2, t 4, t 5 } PW 2 ={t 1, t 2, t 3, t 4 } PW 3 ={t 1, t 4, t 5, t 6 } PW 4 ={t 1, t 3, t 4, t 6 } PW 5 ={t 2, t 4, t 5 } PW 6 ={t 2, t 3, t 4 } PW 7 ={t 4, t 5, t 6 } PW 8 ={t 3, t 4, t 6 } × × × × × × × × ∑ = 3.88

Related Work ExpectedRank Exp-Rank(t 1 ) = 2.8 Exp-Rank(t 2 ) = 2.3 Exp-Rank(t 3 ) = 3.02 Exp-Rank(t 4 ) = 2.7 Exp-Rank(t 5 ) = 3.88 Exp-Rank(t 6 ) = 4.1 Computed in a similar mannar

Related Work ExpectedRank Exp-Rank(t 1 ) = 2.8 Exp-Rank(t 2 ) = 2.3 Exp-Rank(t 3 ) = 3.02 Exp-Rank(t 4 ) = 2.7 Exp-Rank(t 5 ) = 3.88 Exp-Rank(t 6 ) = 4.1 Highest 2 ranks

Related Work High processing cost U-Topk, U-kRanks, PT-k, Global-Topk Ranking Quality ExpectedRank promotes low-score tuples to the top ExpectedRank assigns rank ( m+1 ) to an absent tuple t in a possible world having m tuples Extra user efforts PRF: parameters other than k Typical answers: choice among the answers

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

U-Popk Semantics We propose a new semantics: U-Popk Short response time High ranking quality No extra user effort (except for parameter k)

U-Popk Semantics Top- 1 Robustness: Any top-k query semantics for probabilistic tuples should return the tuple with maximum probability to be ranked top- 1 (denoted Pr 1 ) when k = 1 Top- 1 robustness holds for U-Topk, U-kRanks, PT-k, and Global-Topk, etc. ExpectedRank violates top- 1 robustness

U-Popk Semantics Top-stability: The top-( i+1 ) th tuple should be the top- 1 st after the removal of the top- i tuples. U-Popk: Tuples are picked in order from a relation according to “top-stability” until k tuples are picked The top- 1 tuple is defined according to “Top- 1 Robustness”

U-Popk Semantics U-Popk Pr 1 ( t 1 ) = p 1 = 0.4 Pr 1 ( t 2 ) = (1- p 1 ) p 2 = 0.42 Stop since (1- p 1 ) (1- p 2 ) = 0.18 < Pr 1 ( t 2 ) Radar LocationCar MakePlate No.SpeedConfidence L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6

U-Popk Semantics U-Popk Pr 1 ( t 1 ) = p 1 = 0.4 Pr 1 ( t 3 ) = (1- p 1 ) p 3 = 0.36 Stop since (1- p 1 ) (1- p 3 ) = 0.24 < Pr 1 ( t 1 ) Radar LocationCar MakePlate No.SpeedConfidence L1L1 HondaX L2L2 ToyotaY L3L3 MazdaW L4L4 NissanL L5L5 MazdaW L6L6 ToyotaY t1t1 t2t2 t3t3 t4t4 t5t5 t6t6

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

U-Popk Algorithm Algorithm for Independent Tuples Tuples are sorted in descending order of score Pr 1 ( t i ) = (1- p 1 ) (1- p 2 ) … (1- p i -1 ) p i Define accum i = (1- p 1 ) (1- p 2 ) … (1- p i -1 ) accum 1 = 1, accum i +1 = accum i · (1- p i ) Pr 1 ( t i ) = accum i · p i

U-Popk Algorithm Algorithm for Independent Tuples Find top -1 tuple by scanning the sorted tuples Maintain accum, and the maximum Pr 1 currently found Stopping criterion: accum ≤ maximum current Pr 1 This is because for any succeeding tuple t j (j>i): Pr 1 ( t j ) = (1- p 1 ) (1- p 2 ) … (1- p i ) … (1- p j -1 ) p j ≤ (1- p 1 ) (1- p 2 ) … (1- p i ) = accum ≤ maximum current Pr 1

U-Popk Algorithm Algorithm for Independent Tuples During the scan, before processing each tuple t i, record the tuple with maximum current Pr 1 as t i.max After top -1 tuple is found and removed, adjust tuple prob. Reuse the probability of t 1 to t i-1 Divide the probability of t i+1 to t j by ( 1- p i ) Choose tuple with maximum current Pr 1 from { t i.max, t i+1, …, t j }

U-Popk Algorithm Algorithm for Tuples with Exclusion Rules Each tuple is involved in an exclusion rule t i 1 ⊕ t i 2 ⊕ … ⊕ t im t i 1, t i 2, …, t im are in descending order of score Let t j 1, t j 2, …, t jl be the tuples before t i and in the same exclusion rule of t i accum i +1 = accum i · (1- p j 1 - p j 2 -…- p jl - p i ) / (1- p j 1 - p j 2 -…- p jl ) Pr 1 ( t i ) = accum i · p i / (1- p j 1 - p j 2 -…- p jl )

U-Popk Algorithm Algorithm for Tuples with Exclusion Rules Stopping criterion: As scan goes on, a rule’s factor in accum can only go down Keep track of the current factors for the rules Organize rule factors by MinHeap, so that the factor with minimum value ( factor min ) can be retrieved in O( 1 ) time A rule is inserted into MinHeap when its first tuple is scanned The position of a rule in MinHeap is adjusted if a new tuple in it is scanned (because its factor changes)

U-Popk Algorithm Algorithm for Tuples with Exclusion Rules Stopping criterion: UpperBound(Pr 1 ) = accum / factor min This is because for any succeeding tuple t j (j>i): Pr 1 ( t j ) = accum j · p j / {factor of t j ’s rule} ≤ accum i · p j / {factor of t j ’s rule} ≤ accum i · p j / factor min ≤ accum i / factor min

U-Popk Algorithm Algorithm for Tuples with Exclusion Rules Tuple Pr 1 adjustment (after the removal of top -1 tuple): t i 1, t i 2, …, t il are in t i 2 ’s rule Segment-by-segment adjustment Delete t i 2 from its rule (factor increases, adjust it in MinHeap) Delete the rule from MinHeap if no tuple remains

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

Experiments Comparison of Ranking Results International Ice Patrol (IIP) Iceberg Sightings Database Score: # of drifted days Occurrence Probability: confidence level according to source of sighting Neutral Approach (p = 0.5 ) Optimistic Approach (p = 0 )

Experiments Efficiency of Query Processing On synthetic datasets (|D| =100,000 ) ExpectedRank is orders of magnitudes faster than others

Outline Background Probabilistic Data Model Related Work U-Popk Semantics U-Popk Algorithm Experiments Conclusion

We propose U-Popk, a new semantics for top-k queries on uncertain data, based on top -1 robustness and top-stability U-Popk has the following strengths: Short response time, good scalability High ranking quality Easy to use, no extra user effort

Thank you!