Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

指導教授：陳良弼老師報告者：鄧雅文  Introduction  Related Work  Problem Formulation  Future Work.

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Representing and Querying Correlated Tuples in Probabilistic Databases

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Probabilistic Similarity Search for Uncertain Time Series Presented by CAO Chen 21 st Feb, 2011.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Presented by: Duong, Huu Kinh Luan March 14 th, 2011.

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8 th, 2008.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010.

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.

Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Efficient Processing of Top-k Spatial Preference Queries

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Machine Learning Chapter 5. Evaluating Hypotheses

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.

D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Presented by: Dardan Xhymshiti Fall  Type: Research paper  Authors:  International conference on Very Large Data Bases. Yoonjar Park Seoul National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

SIMILARITY SEARCH The Metric Space Approach

Probabilistic Data Management

Xu Zhou Kenli Li Yantao Zhou Keqin Li

Probabilistic Databases

Uncertain Data Mobile Group 报告人：郝兴.

Efficient Processing of Top-k Spatial Preference Queries

Probabilistic Ranking of Database Query Results

Presentation transcript:

Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South Wales & NICTA SIGMOD 2008

 Uncertainty in data due to incompleteness of data, limitations of equipment, delay or loss in data transfer, etc. Multiple sensors for same location Two sensors in the same location may detect the presence of a suspect at the (approximately) same time. Uncertain data in Table 1 summarizes the set of all possible worlds

0.3*0.4*0.8*1.0 (1-0.3)*( )01.0*0.2 Rules  1 R1  2 R2, R3  3 R4  4 R5, R6 Top-K Query: top-2 longest duration in uncertain data

 If user-specified probability threshold (p) =0.35, {R2, R3, R5} should be returned.  This is the problem of probabilistic threshold top-k query  Return records, each of which takes a probability of at least p to be in the top-k lists in the possible worlds = 0.202

 Semantics of probabilistic top-k queries (PT-k queries)  Solution ◦ Query returns a tuple which has a high probability to appear in top-k list regardless of its exact rank  Efficient computation of probabilistic top-k queries  Number of all possible worlds is huge; hence naïve method is expensive  Existing techniques require materializing all possible states based on tuples seen so far  Solution ◦ Exact algorithm ◦ Sampling algorithm ◦ Poisson approximation based algorithm

 Dominant Set Property ◦ For a tuple t ∈ P(T), the dominant set of t is the subset of tuples in P(T) that are ranked higher than t  S t = {t’|t’ ∈ P(T) ∧ t’ ≺ f t}, P(T) = set of tuples satisfying query predicate  Using Dominant Set Property, all algorithm scan tuples in ranking order, and derive top-k probability of a tuple based on tuples preceding it in the ranking order

 Basic Case: Tuples are independent, each tuple involved in one rule  Scan tuples in P(T) in ranking order  Derive top-k probability of a tuple based on tuples preceding it in the ranking order ◦ Pr(t i,j) : probability that tuple t i is ranked at j-th position in possible worlds ◦ Pr k (t i ) : top-k probability of tuple t i ◦ Pr(S t i, j): probability that j tuples in S t i appear in possible worlds  Theorem:

 To compute top-2 probability of each tuple  For t j (1 ≤ j ≤ 2), Pr 2 (t j ) = Pr(t j ) ◦ Pr 2 (R 1 ) = 0.3 Pr 2 (R 2 ) = 0.4  For t j (3≤ j ≤ 6), ◦ Pr 2 (R 3 ) = Pr(R 3 ). [Pr(S R 2, 0) + Pr(S R 2, 1) ] by Poisson binomial recurrence ◦ = Pr(R 3 ). [Pr(S R 2, 0) + Pr(R 2 ).Pr(S R 1, 0) + (1-Pr(R 2 )).Pr(S R 1, 0) ] ◦ =... ◦ = 0.5 * 0.76 =0.38 ◦ Similarly, Pr 2 (R 4 ) = Pr 2 (R 5 ) = Pr 2 (R 6 ) = 0.014

 Random variable X t = 1 if t is ranked in top-k list  0otherwise  Pr k (t) = E[X t ] ◦ Draw a set of samples S of possible worlds ◦ Compute E s [X t ] as approximation of E[X t ]  Generate sample units by URS with replacement (a sample unit is a possible world) ◦ To pick a sample unit, scan data table once and include independent tuple t i with Pr(t i )  Compute top-k tuples in sample unit ◦ For each tuple t in top-k list, X t = 1, rest tuples are set to 0  Repeat above steps to obtain sample S and compute E s [X t ]

 Scan tuples in P(T) in ranking order  When a tuple t is scanned, if t is an independent tuple,  Pr k (t) = =  When a tuple t is scanned, if t belongs to a generation rule R,  Pr k (t) = ◦ F(k-1, ;  ), F(k-1, ;  ’) is cumlative distribution function ◦ X follows Poisson Binomial distribution ◦ : sum of membership probabilities of scanned tuples ◦ ’: sum of membership probabilities of scanned tuples in R  Stop scanning when Pr k (t)<p under the condition:

 Real (Uncertain)Dataset ◦ International Ice Patrol (IIP) Iceberg Sightings Database (  Synthetic Dataset : evaluates performance, quality ◦ 20, 000 tuples and 2, 000 multi-tuple generation rules, where number of tuples involved in each multi-tuple generation rule follows normal distribution  Related work (ICDE 2007) ◦ U-Top K query: returns a list of k records that has the highest probability to be the top-k list in all possible worlds ◦ U-K Ranks query: returns a list of k records such that the i-th record has the highest probability to be the i-th best record in all possible worlds

No R14 but R7 No R14 No R10

 A semantically novel formulation of probabilistic threshold top-k query on uncertain data  A exact algorithm (with rule-tuple compression and several pruning techniques to improve efficiency), a fast sampling method (with approximation quality guarantee) and a Poisson approximation based method (that works in linear time)  Comparison of proposed framework with existing U-Top K query and U-K Ranks query

Thank You