Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik

Interactive Data Exploration (IDE) Searching for “interesting stuff” within big data Exploratory analysis: ad-hoc & repetitive Questions are not well defined “Interesting” can be complex Human-in-the loop operation Fast, online results Query refinement Where’s Waldo? Where’s Horrible Gelatinous Blob? 2

Exploratory Queries: An SDSS example Searching for regions of interest “Celestial 3-5 o by 5-7 o rectangular regions with average brightness > 0.8” Shape-based conditions “3-5 o by 5-7 o regions” Content-based conditions “average brightness > 0.8” Sloan Digital Sky Survey (SDSS) 3 Semantic Windows

“Celestial 3-5 o by 5-7 o regions with average brightness > 0.8” in SQL 1.Divide the data into cells 2.Enumerate all regions 3.Final filtering (> 0.8) 4

No native support for exploratory constructs! SQL queries No power set GROUP BY – no overlaps OVER – too restrictive Performance problems Large CPU overhead Hard to optimize No interactivity 5

SQL/SW Extensions for Data Exploration SELECT lb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROM sdss GRID BY ra BETWEEN 100 AND 300 STEP 1 dec BETWEEN 5 AND 40 STEP 1 HAVING avg(brightness) > 0.8 AND size(ra) = 3 AND size(dec) >= 1 AND size(dec) <= 3 ra dec 6

Search Process Outline 1.Dynamically enumerate windows (subject to pruning) 2.Study in order of utility 3.Output the windows satisfying the conditions 7 Focus is on online results!

24121334 Enumerating Windows 1234 12 34 8 Extension: 1 12 Any dimension One step

Cost-aware Solver Best-first search based on the utility Utility = f(benefit, cost) Benefit – how close a window is to satisfy the conditions Computed for the aggregates from content-based conditions A distance between the required value and the estimated value Cost – how expensive it is to read a window from disk Measured in cells we have to read Adjustments are made for skewed data 9

Best-first Search 3 1 2 0.98 1 0.80 4 0.79 34 1 2 0.98 13 0.85 1 0.80 4 0.79 Priority Queue (utility-ordered) 10

Best-first Search 34 1 2 0.98 13 0.85 1 0.80 4 0.79 2 0.98 13 0.85 1 0.80 4 0.79 12 34 0.80 11 Priority Queue (utility-ordered)

Optimizations Cost and benefit are estimated by sampling Uniform – sample the whole search space Stratified – sample each cell uniformly Aggregate values are cached in a cell cache Dynamic utility updates Avoiding same cells re-reads Constraint-based pruning during the search 12

24121334 Pruning 1234 12 34 13 Size > 1 Shape-based conditions: Shape is ? x 2

Prefetching Problem: small reads Help online results Hurt total performance Window-locality vs. disk-locality Poor disk page utilization Thrashing: reading the same pages multiple times Prefetching: read a neighborhood with every window Larger reads, fewer number Better disk page utilization 14 3 2 1 4 No prefetching With prefetching 1 2 3 4

Adaptive Prefetching How much to prefetch? Large reads might hurt online results Progress-driven scheme: Finding new results? Prefetch a small amount No new results? Increase the prefetch exponentially 15

Online vs. Total Performance Results 35GB data set (part of the SDSS) 4GB total memory (1GB shared buffer) First results in 10-20 seconds 16

Distributed Semantic Windows Architecture Coordinator Starts workers Collects results Data Overlap Windows belong to multiple partitions Workers exchange cells Asynchronous communication Workers request data No blocking Small overhead Coordinator Worker Query Executor Window Processor Data Manager Functions/ Estimations Cell Data Worker Query Executor Window Processor Data Manager Functions/ Estimations Cell Data Worker Query Executor Window Processor Data Manager Functions/ Estimations Cell Data DBMS 17

Data Overlap in Distributed Search 18

Other Experiments (from the paper) Data layout: window-locality vs. disk-locality Hilbert ordering Index-based clustering Sorting by an axis Controlling the aggressiveness of prefetching Users can control the size of prefetching Smaller result delays vs. total completion time Sampling Stratified vs. uniform 19

Related Work OLAP cubes Grid-based aggregation, no exploration Online Aggregation (Hellerstein, et al.) Approximation, exact result at the end Online skylines (Rundensteiner, et al.) Careful input/output space analysis to determine candidates Difficult for Semantic Windows: dimensional vs. measurement attributes Big data systems (SciBORQ, BlinkDB, etc.) Approximate query answering via sampling 20

Conclusion and Future Work New data exploration framework – Semantic Windows Cost-aware solver Adaptive prefetching to address data layout issues Distributed computation What is next? Constraint Programming (CP) can perform exploration DBMS can store and manage data CP + DBMS = Searchlight 21

Questions? Supported by:

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Similar presentations

Presentation on theme: "Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Similar presentations

Presentation on theme: "Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik."— Presentation transcript:

Similar presentations

About project

Feedback