UPI: A Primary Index for Uncertain Databases (VLDB 10)

Slides:



Advertisements
Similar presentations
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Efficient Storage and Retrieval of Data
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin.
CSCE Database Systems Chapter 15: Query Execution 1.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
External data structures
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Indexing.
CSCI 4333 Database Design and Implementation – Exercise (5) Xiang Lian The University of Texas – Pan American Edinburg, TX
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
CS4432: Database Systems II
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
CSCI 4333 Database Design and Implementation – Exercise (5)
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
CS 540 Database Management Systems
Data Indexing Herbert A. Evans.
15.1 – Introduction to physical-Query-plan operators
CS522 Advanced database Systems
Record Storage, File Organization, and Indexes
CS 540 Database Management Systems
Indexing Goals: Store large files Support multiple search keys
Indexing and hashing.
Database Management System
Storage and Indexes Chapter 8 & 9
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Lecture 20: Indexing Structures
Database Management Systems (CS 564)
Chapter 12: Query Processing
File Processing : Query Processing
File organization and Indexing
Chapter 11: Indexing and Hashing
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
Lecture 12 Lecture 12: Indexing.
Yan Huang - CSCI5330 Database Implementation – Access Methods
Overview of Storage and Indexing
Cse 344 APRIL 23RD – Indexing.
CPSC-310 Database Systems
Operations to Consider
Lecture 21: Indexes Monday, November 13, 2000.
Overview of Storage and Indexing
Lecture 19: Data Storage and Indexes
CSCI 4333 Database Design and Implementation – Exercise (5)
CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
Lecture 2- Query Processing (continued)
Lecture 28: Index 3 B+ Trees
Database Management System
Chapter 12 Query Processing (1)
CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.
ICOM 5016 – Introduction to Database Systems
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Database Systems (資料庫系統)
Lecture 11: B+ Trees and Query Execution
Lecture 20: Indexes Monday, February 27, 2006.
Chapter 11: Indexing and Hashing
ICOM 5016 – Introduction to Database Systems
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
CSE 190D Database System Implementation
Presentation transcript:

UPI: A Primary Index for Uncertain Databases (VLDB 10) Hideaki Kimura (BrownU) Samuel Madden (MIT) Stanley B. Zdonik (BrownU) Speaker: Yinuo Zhang Supervisor: Dr. Reynold Cheng

Outline Introduction Uncertain Primary Index (UPI) Secondary Index on UPI Experiments Conclusion and Future Work

Introduction Table Author A Possible World Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% A Possible World Name Institution Alice Brown Bob MIT SELECT * FROM Author WHERE Institution=MIT Threshold: confidence QT Query answering over Possible World Semantics The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%

Ex) DBLP with Uncertain Affiliation DBLP: 1.3M Papers and 0.7M Authors Complemented Author Affiliation q=“David DeWitt” Rank URL 1 Wisc.edu/… 2 Microsoft.com/… 3 Columbia.edu/… Name Inst. David DeWitt ? Google API Zipfian Distribution Name Institutionp Countryp David DeWitt Wisconsin: 40%, MS: 20%, Columbia: 13%, … US: 100%

Introduction Achieving an efficient implementation using possible world semantics is difficult. Probabilistic Inverted Index [Singh07] – a secondary index Heap Institution Pointer Brown [Alice] 0.3 [Carol] 0.2 [Bob] 0.4 MIT [Bob] 0.3 [Alice] 0.8 Disk Seeking

Over Uncertain Attributes Introduction Goal Build A Primary Index Over Uncertain Attributes Primary Index Seq. Read

Challenges on Building PI over Uncertain Data Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Cluster on most probable possible value? Replicate tuples into inverted index? Cluster Tuples Brown Alice, Carol MIT Bob Cluster Tuples Brown Alice, Carol, … MIT Alice, Bob, … Too Large for Long-tail distribution (e.g., 100 values with 0.1%) Alice? SELECT …WHERE Inst.=MIT

UPI: Heap + Cutoff Index Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Heap: Sorted by (Inst., Prob) Cutoff Index: Sorted by (Inst., Prob) Institution Tuple Brown (72%) Alice Brown (48%) Carol MIT (95%) Bob MIT (18%) UCB (5%) U. Tokyo (32%) Institution TupleID Pointer UCB (5%) Bob MIT Cutoff Entries with Less than C probability (Cutoff Threshold)

Answering Queries with UPI Probabilistic Threshold Query (PTQ) SELECT * FROM Author WHERE Inst.=UCB With: Probability ≥ QT (Query Threshold) Institution Tuple MIT (95%) Bob … UCB (90%) Dan UCB (20%) Emily C=10% Seek Institution TupleID Pointer UCB (5%) Bob MIT If QT<C (e.g., QT=5%), follow Cutoff pointers If QT≥C (e.g., QT=20%), Sequentially Read

Choosing Cutoff Threshold Faster, but Larger Slower, but Smaller Cutoff Threshold C SELECT * FROM Author WHERE Institution=Ishikawa U Threshold: confidence QT (QT is given at runtime)

Determining C Based on Value/Probability Histograms Histograms (Inst.) #Keys … Br* 30,000 Bs* 31,000 Bt* 30,500 Prob. #Keys … 10%-15% 15,000 15%-25% 28,000 25%-40% 33,000 Histograms (Inst.) Tolerable average query runtime Available Disk Capability -> UPI Size = Costfullscan * Selectivity+ Costseek * # Pointers ? #Pointers C C

#Pointers and Query Cost (replace Ishikawa U with Stanford) Saturation Cost Model Logistic function

Secondary Index on UPI SELECT * FROM Author WHERE Country=US Name Institutionp Countryp Existence Alice Brown: 80%, MIT: 20% US: 100% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% US: 60%, Japan: 40% 80% SELECT * FROM Author WHERE Country=US Secondary Index on (Country) Institution Tuple Brown (72%) Alice … MIT (95%) Bob MIT (18%) Country TulpleID Pointers US (100%) Bob MIT US (90%) Alice Brown, MIT Store Multiple Pointers Tailored Access

Experiments Environments C++ & BerkeleyDB 4.7 on Fedora Core 11 Quad-Core, 4GB RAM, 10k RPM SATA HDD Dataset: DBLP w/ Uncertain Affiliation 700k authors and 1.3M publications SwetoDblp, Google API (institutions up to ten per author) Compared With PII

Query Runtime: PII vs. UPI Q1: SELECT * FROM Author WHERE Institution=x Elapsed Read PII 5 [ms] Sort Pointer 30 [µs] Read Heap 5,200 [ms] Elapsed Read UPI 47 [ms] UPI Causes Much Fewer Disk Seeks

Secondary Index Access Q2: SELECT Journal, COUNT (*) FROM Publication WHERE Country=x GROUP BY Journal Elapsed Read PII 110 [ms] Read UPI 3,200 [ms] Elapsed Read PII 110 [ms] Tailor 33 [ms] Read UPI 500 [ms]

Conclusion and Future Work UPI Heap + Cutoff Index Tailored Secondary Index Access Fractured UPI (not presented here) Applying to other types of queries Top-k Query: UPI as Tuple Access Layer

Thanks!

Fractured UPI New Fracture Main Fracture Fracture 1 Query Dump Heap File Cutoff Index Delete Set Delete Set 2ndary Index 2ndary Index Query Independently Dump Insert Buffer (On RAM) SELECT INSERT DELETE

Fractured UPI 8 sec 75 sec 650 sec 212 sec 4 sec 0.03 sec Insert 10% Delete 1% Unclustered Heap 8 sec 75 sec UPI 650 sec 212 sec Fractured UPI 4 sec 0.03 sec Fragmentation More Fractures

Cutoff Index Cost Model (1) Selective Case (Q1, #Pointers=300) Real Runtime Estimated Runtime

Cutoff Index Cost Model (2) Non-Selective Case (Q1, #Pointers=37000) Real Runtime Estimated Runtime