UPI: A Primary Index for Uncertain Databases (VLDB 10)

Slides:

Advertisements

Similar presentations

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.

1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

Efficient Storage and Retrieval of Data

1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)

DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Correlation Maps: A Compressed Access Method for Exploiting Soft Functional Dependencies George Huo Google, Inc. With Hideaki Kimura (Brown), Alex Rasin.

CSCE Database Systems Chapter 15: Query Execution 1.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.

External data structures

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8 “How index-learning turns no student pale Yet.

Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Indexing.

CSCI 4333 Database Design and Implementation – Exercise (5) Xiang Lian The University of Texas – Pan American Edinburg, TX

CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.

CS 540 Database Management Systems

Introduction to Database Systems1 External Sorting Query Processing: Topic 0.

CS4432: Database Systems II

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

CSCI 4333 Database Design and Implementation – Exercise (5)

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

CS 540 Database Management Systems

Data Indexing Herbert A. Evans.

15.1 – Introduction to physical-Query-plan operators

CS522 Advanced database Systems

Record Storage, File Organization, and Indexes

CS 540 Database Management Systems

Indexing Goals: Store large files Support multiple search keys

Indexing and hashing.

Database Management System

Storage and Indexes Chapter 8 & 9

Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.

Lecture 20: Indexing Structures

Database Management Systems (CS 564)

Chapter 12: Query Processing

File Processing : Query Processing

File organization and Indexing

Chapter 11: Indexing and Hashing

Session #, Speaker Name Indexing Chapter 8 11/19/2018.

Lecture 12 Lecture 12: Indexing.

Yan Huang - CSCI5330 Database Implementation – Access Methods

Overview of Storage and Indexing

Cse 344 APRIL 23RD – Indexing.

CPSC-310 Database Systems

Operations to Consider

Lecture 21: Indexes Monday, November 13, 2000.

Overview of Storage and Indexing

Lecture 19: Data Storage and Indexes

CSCI 4333 Database Design and Implementation – Exercise (5)

CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

Lecture 2- Query Processing (continued)

Lecture 28: Index 3 B+ Trees

Database Management System

Chapter 12 Query Processing (1)

CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

ICOM 5016 – Introduction to Database Systems

Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.

Database Systems (資料庫系統)

Lecture 11: B+ Trees and Query Execution

Lecture 20: Indexes Monday, February 27, 2006.

Chapter 11: Indexing and Hashing

ICOM 5016 – Introduction to Database Systems

Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦博士國立東華大學資訊管理系教授.

CSE 190D Database System Implementation

Presentation transcript:

UPI: A Primary Index for Uncertain Databases (VLDB 10) Hideaki Kimura (BrownU) Samuel Madden (MIT) Stanley B. Zdonik (BrownU) Speaker: Yinuo Zhang Supervisor: Dr. Reynold Cheng

Outline Introduction Uncertain Primary Index (UPI) Secondary Index on UPI Experiments Conclusion and Future Work

Introduction Table Author A Possible World Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% A Possible World Name Institution Alice Brown Bob MIT SELECT * FROM Author WHERE Institution=MIT Threshold: confidence QT Query answering over Possible World Semantics The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%

Ex) DBLP with Uncertain Affiliation DBLP: 1.3M Papers and 0.7M Authors Complemented Author Affiliation q=“David DeWitt” Rank URL 1 Wisc.edu/… 2 Microsoft.com/… 3 Columbia.edu/… Name Inst. David DeWitt ? Google API Zipfian Distribution Name Institutionp Countryp David DeWitt Wisconsin: 40%, MS: 20%, Columbia: 13%, … US: 100%

Introduction Achieving an efficient implementation using possible world semantics is difficult. Probabilistic Inverted Index [Singh07] – a secondary index Heap Institution Pointer Brown [Alice] 0.3 [Carol] 0.2 [Bob] 0.4 MIT [Bob] 0.3 [Alice] 0.8 Disk Seeking

Over Uncertain Attributes Introduction Goal Build A Primary Index Over Uncertain Attributes Primary Index Seq. Read

Challenges on Building PI over Uncertain Data Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Cluster on most probable possible value? Replicate tuples into inverted index? Cluster Tuples Brown Alice, Carol MIT Bob Cluster Tuples Brown Alice, Carol, … MIT Alice, Bob, … Too Large for Long-tail distribution (e.g., 100 values with 0.1%) Alice? SELECT …WHERE Inst.=MIT

UPI: Heap + Cutoff Index Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Heap: Sorted by (Inst., Prob) Cutoff Index: Sorted by (Inst., Prob) Institution Tuple Brown (72%) Alice Brown (48%) Carol MIT (95%) Bob MIT (18%) UCB (5%) U. Tokyo (32%) Institution TupleID Pointer UCB (5%) Bob MIT Cutoff Entries with Less than C probability (Cutoff Threshold)

Answering Queries with UPI Probabilistic Threshold Query (PTQ) SELECT * FROM Author WHERE Inst.=UCB With: Probability ≥ QT (Query Threshold) Institution Tuple MIT (95%) Bob … UCB (90%) Dan UCB (20%) Emily C=10% Seek Institution TupleID Pointer UCB (5%) Bob MIT If QT<C (e.g., QT=5%), follow Cutoff pointers If QT≥C (e.g., QT=20%), Sequentially Read

Choosing Cutoff Threshold Faster, but Larger Slower, but Smaller Cutoff Threshold C SELECT * FROM Author WHERE Institution=Ishikawa U Threshold: confidence QT (QT is given at runtime)

Determining C Based on Value/Probability Histograms Histograms (Inst.) #Keys … Br* 30,000 Bs* 31,000 Bt* 30,500 Prob. #Keys … 10%-15% 15,000 15%-25% 28,000 25%-40% 33,000 Histograms (Inst.) Tolerable average query runtime Available Disk Capability -> UPI Size = Costfullscan * Selectivity+ Costseek * # Pointers ? #Pointers C C

#Pointers and Query Cost (replace Ishikawa U with Stanford) Saturation Cost Model Logistic function

Secondary Index on UPI SELECT * FROM Author WHERE Country=US Name Institutionp Countryp Existence Alice Brown: 80%, MIT: 20% US: 100% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% US: 60%, Japan: 40% 80% SELECT * FROM Author WHERE Country=US Secondary Index on (Country) Institution Tuple Brown (72%) Alice … MIT (95%) Bob MIT (18%) Country TulpleID Pointers US (100%) Bob MIT US (90%) Alice Brown, MIT Store Multiple Pointers Tailored Access

Experiments Environments C++ & BerkeleyDB 4.7 on Fedora Core 11 Quad-Core, 4GB RAM, 10k RPM SATA HDD Dataset: DBLP w/ Uncertain Affiliation 700k authors and 1.3M publications SwetoDblp, Google API (institutions up to ten per author) Compared With PII

Query Runtime: PII vs. UPI Q1: SELECT * FROM Author WHERE Institution=x Elapsed Read PII 5 [ms] Sort Pointer 30 [µs] Read Heap 5,200 [ms] Elapsed Read UPI 47 [ms] UPI Causes Much Fewer Disk Seeks

Secondary Index Access Q2: SELECT Journal, COUNT (*) FROM Publication WHERE Country=x GROUP BY Journal Elapsed Read PII 110 [ms] Read UPI 3,200 [ms] Elapsed Read PII 110 [ms] Tailor 33 [ms] Read UPI 500 [ms]

Conclusion and Future Work UPI Heap + Cutoff Index Tailored Secondary Index Access Fractured UPI (not presented here) Applying to other types of queries Top-k Query: UPI as Tuple Access Layer

Thanks!

Fractured UPI New Fracture Main Fracture Fracture 1 Query Dump Heap File Cutoff Index Delete Set Delete Set 2ndary Index 2ndary Index Query Independently Dump Insert Buffer (On RAM) SELECT INSERT DELETE

Fractured UPI 8 sec 75 sec 650 sec 212 sec 4 sec 0.03 sec Insert 10% Delete 1% Unclustered Heap 8 sec 75 sec UPI 650 sec 212 sec Fractured UPI 4 sec 0.03 sec Fragmentation More Fractures

Cutoff Index Cost Model (1) Selective Case (Q1, #Pointers=300) Real Runtime Estimated Runtime

Cutoff Index Cost Model (2) Non-Selective Case (Q1, #Pointers=37000) Real Runtime Estimated Runtime