UPI: A Primary Index for Uncertain Databases (VLDB 10)

UPI: A Primary Index for Uncertain Databases (VLDB 10)
Hideaki Kimura (BrownU) Samuel Madden (MIT) Stanley B. Zdonik (BrownU) Speaker: Yinuo Zhang Supervisor: Dr. Reynold Cheng

Outline Introduction Uncertain Primary Index (UPI)
Secondary Index on UPI Experiments Conclusion and Future Work

Introduction Table Author A Possible World Name Institutionp Existence
Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% A Possible World Name Institution Alice Brown Bob MIT SELECT * FROM Author WHERE Institution=MIT Threshold: confidence QT Query answering over Possible World Semantics The probability of such world: 90%*80% * 100%*95% * 20% = 13.7%

Ex) DBLP with Uncertain Affiliation
DBLP: 1.3M Papers and 0.7M Authors Complemented Author Affiliation q=“David DeWitt” Rank URL 1 Wisc.edu/… 2 Microsoft.com/… 3 Columbia.edu/… Name Inst. David DeWitt ? Google API Zipfian Distribution Name Institutionp Countryp David DeWitt Wisconsin: 40%, MS: 20%, Columbia: 13%, … US: 100%

Introduction Achieving an efficient implementation using possible world semantics is difficult. Probabilistic Inverted Index [Singh07] – a secondary index Heap Institution Pointer Brown [Alice] 0.3 [Carol] 0.2 [Bob] 0.4 MIT [Bob] 0.3 [Alice] 0.8 Disk Seeking

Over Uncertain Attributes
Introduction Goal Build A Primary Index Over Uncertain Attributes Primary Index Seq. Read

Challenges on Building PI over Uncertain Data
Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Cluster on most probable possible value? Replicate tuples into inverted index? Cluster Tuples Brown Alice, Carol MIT Bob Cluster Tuples Brown Alice, Carol, … MIT Alice, Bob, … Too Large for Long-tail distribution (e.g., 100 values with 0.1%) Alice? SELECT …WHERE Inst.=MIT

UPI: Heap + Cutoff Index
Name Institutionp Existence Alice Brown: 80%, MIT: 20% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% 80% Heap: Sorted by (Inst., Prob) Cutoff Index: Sorted by (Inst., Prob) Institution Tuple Brown (72%) Alice Brown (48%) Carol MIT (95%) Bob MIT (18%) UCB (5%) U. Tokyo (32%) Institution TupleID Pointer UCB (5%) Bob MIT Cutoff Entries with Less than C probability (Cutoff Threshold)

Answering Queries with UPI
Probabilistic Threshold Query (PTQ) SELECT * FROM Author WHERE Inst.=UCB With: Probability ≥ QT (Query Threshold) Institution Tuple MIT (95%) Bob … UCB (90%) Dan UCB (20%) Emily C=10% Seek Institution TupleID Pointer UCB (5%) Bob MIT If QT<C (e.g., QT=5%), follow Cutoff pointers If QT≥C (e.g., QT=20%), Sequentially Read

Choosing Cutoff Threshold
Faster, but Larger Slower, but Smaller Cutoff Threshold C SELECT * FROM Author WHERE Institution=Ishikawa U Threshold: confidence QT (QT is given at runtime)

Determining C Based on Value/Probability Histograms Histograms (Inst.)
#Keys … Br* 30,000 Bs* 31,000 Bt* 30,500 Prob. #Keys … 10%-15% 15,000 15%-25% 28,000 25%-40% 33,000 Histograms (Inst.) Tolerable average query runtime Available Disk Capability -> UPI Size = Costfullscan * Selectivity+ Costseek * # Pointers ? #Pointers C C

#Pointers and Query Cost
(replace Ishikawa U with Stanford) Saturation Cost Model Logistic function

Secondary Index on UPI SELECT * FROM Author WHERE Country=US
Name Institutionp Countryp Existence Alice Brown: 80%, MIT: 20% US: 100% 90% Bob MIT: 95%, UCB: 5% 100% Carol Brown: 60%, U. Tokyo: 40% US: 60%, Japan: 40% 80% SELECT * FROM Author WHERE Country=US Secondary Index on (Country) Institution Tuple Brown (72%) Alice … MIT (95%) Bob MIT (18%) Country TulpleID Pointers US (100%) Bob MIT US (90%) Alice Brown, MIT Store Multiple Pointers Tailored Access

Experiments Environments C++ & BerkeleyDB 4.7 on Fedora Core 11
Quad-Core, 4GB RAM, 10k RPM SATA HDD Dataset: DBLP w/ Uncertain Affiliation 700k authors and 1.3M publications SwetoDblp, Google API (institutions up to ten per author) Compared With PII

Query Runtime: PII vs. UPI
Q1: SELECT * FROM Author WHERE Institution=x Elapsed Read PII 5 [ms] Sort Pointer 30 [µs] Read Heap 5,200 [ms] Elapsed Read UPI 47 [ms] UPI Causes Much Fewer Disk Seeks

Secondary Index Access
Q2: SELECT Journal, COUNT (*) FROM Publication WHERE Country=x GROUP BY Journal Elapsed Read PII 110 [ms] Read UPI 3,200 [ms] Elapsed Read PII 110 [ms] Tailor 33 [ms] Read UPI 500 [ms]

Conclusion and Future Work
UPI Heap + Cutoff Index Tailored Secondary Index Access Fractured UPI (not presented here) Applying to other types of queries Top-k Query: UPI as Tuple Access Layer

Thanks!

Fractured UPI New Fracture Main Fracture Fracture 1 Query Dump
Heap File Cutoff Index Delete Set Delete Set 2ndary Index 2ndary Index Query Independently Dump Insert Buffer (On RAM) SELECT INSERT DELETE

Fractured UPI 8 sec 75 sec 650 sec 212 sec 4 sec 0.03 sec Insert 10%
Delete 1% Unclustered Heap 8 sec 75 sec UPI 650 sec 212 sec Fractured UPI 4 sec 0.03 sec Fragmentation More Fractures

Cutoff Index Cost Model (1)
Selective Case (Q1, #Pointers=300) Real Runtime Estimated Runtime

Cutoff Index Cost Model (2)
Non-Selective Case (Q1, #Pointers=37000) Real Runtime Estimated Runtime

UPI: A Primary Index for Uncertain Databases (VLDB 10)

Similar presentations

Presentation on theme: "UPI: A Primary Index for Uncertain Databases (VLDB 10)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UPI: A Primary Index for Uncertain Databases (VLDB 10)

Similar presentations

Presentation on theme: "UPI: A Primary Index for Uncertain Databases (VLDB 10)"— Presentation transcript:

Similar presentations

About project

Feedback