MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Path-Sensitive Analysis for Linear Arithmetic and Uninterpreted Functions SAS 2004 Sumit Gulwani George Necula EECS Department University of California,
Introduction to Algorithms
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Constraint Satisfaction Problems
Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.
Analysis of Algorithms
Dynamic Programming Introduction Prof. Muhammad Saeed.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
Introduction to Algorithms 6.046J/18.401J/SMA5503
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Addition Facts
Vehicle Routing & Job Shop Scheduling: Whats the Difference? ICAPS03, June 13, 2003 J. Christopher Beck, Patrick Prosser, & Evgeny Selensky Dept. of Computing.
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Genes and Evolution Comparative Genome Structure and Evolution Synteny- comparison of chromosome order in related species.
Spectral Clustering Eyal David Image Processing seminar May 2008.
Reductions Complexity ©D.Moshkovitz.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs Tyler Lu and Craig Boutilier University of Toronto.
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
Reducing Order Enforcement Cost in Complex Query Plans Ravindra Guravannavar and S. Sudarshan (To appear in ICDE 2007)
1 Bart Jansen Polynomial Kernels for Hard Problems on Disk Graphs Accepted for presentation at SWAT 2010.
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
Minimum Weight Plastic Design For Steel-Frame Structures EN 131 Project By James Mahoney.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
Copyright © Cengage Learning. All rights reserved.
Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.
An Application of Linear Programming Lesson 12 The Transportation Model.
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
CSE554Cell ComplexesSlide 1 CSE 554 Lecture 3: Skeleton and Thinning (Part II) Fall 2013.
COMP 482: Design and Analysis of Algorithms
1 Analysis of Random Mobility Models with PDE's Michele Garetto Emilio Leonardi Politecnico di Torino Italy MobiHoc Firenze.
演 算 法 實 驗 室演 算 法 實 驗 室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Introduction to Algorithms NP-Complete
1 K  Convexity and The Optimality of the (s, S) Policy.
1 CS 391L: Machine Learning: Rule Learning Raymond J. Mooney University of Texas at Austin.
Differential Forms for Target Tracking and Aggregate Queries in Distributed Networks Rik Sarkar Jie Gao Stony Brook University 1.
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
Addition 1’s to 20.
25 seconds left…...
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
12 System of Linear Equations Case Study
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
CS203 Lecture 15.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
From Approximative Kernelization to High Fidelity Reductions joint with Michael Fellows Ariel Kulik Frances Rosamond Technion Charles Darwin Univ. Hadas.
Tree Clustering for Constraint Networks 1 Chris Reeson Advanced Constraint Processing Fall 2009 By Rina Dechter & Judea Pearl Artificial Intelligence,
Bart Jansen 1.  Problem definition  Instance: Connected graph G, positive integer k  Question: Is there a spanning tree for G with at least k leaves?
Amit Goyal Laks V. S. Lakshmanan RecMax: Exploiting Recommender Systems for Fun and Profit University of British Columbia
Minimum Vertex Cover in Rectangle Graphs
RAT R1 R2 R3 R4 R5 R6 R7 Fetch Q RS MOB ROB Execute Retire.
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
The Project Please read the project’s description first. Each router will have a unique ID, with your router’s ID of 0 Any two connected routers will have.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,
The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
A Framework for Testing Query Transformation Rules
Presentation transcript:

MDL Summarization with Holes Shaofeng Bu Laks V.S. Lakshmanan Raymond T. Ng University of British Columbia, Canada

VLDB 05Shaofeng Bu UBC2 Introduction Multi-dimensional OLAP queries typically produce data intensive answers Often the question is: how to express the large answer set of cells that satisfy the OLAP query conditions: Simple enumeration: accurate but not necessarily the most intuitive; Summaries: not (necessarily) 100% accurate but can be more intuitive and informative. Summarized answers can be more easily understood

3 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Each dimension is associated with a hierarchical tree

4 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Data Cell: (c1,c2), c1,c2 are leaf-nodes in axis-trees, e.g. (Vancouver, ties) Data Region: describes all data cells covered by given nodes in the axis- trees, (x1, y1), e.g.: (Vancouver, ties) (Vancouver, women’s) (northwest, women’s)

5 OLAP Data Cube Example clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Blue cells: the cells that satisfy the query conditions; How to find a summary of the blue cells in a data cube?

VLDB 05Shaofeng Bu UBC6 MDL Summarization MDL: Minimum Description Length Use regions to cover the blue cells; Length of an MDL description is the number of included regions and cells; MDL is to find the description with the minimum length.

7 R9 R5 R6 R7R8 R1 An Example of MDL Summarization clothes R2R3R4 New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s northwest

8 ?R9 R10R11 R12 R13 R5 10 regions 8 single blue cells Total length = 18 MDL Summarization R6 R7R8 A Motivating Example: A New Case clothes R2?R3R4 ?R1 New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast location jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s Not blue cells any more

VLDB 05Shaofeng Bu UBC9 Can we do better? Yes! We present a new compression approach: MDL with Holes: Identify regions with blue cells, even if they contain non-blue cells; Express the included blue cells by using regions with the exception of the covered non-blue cells; Non-blue cells are called holes.

10 R5 R6 R7R8 R2R4 Plus other 6 regions ?R1 R1-(Vancouver,Skirts) ?R9 R9-(Boston,ties) -(New York, dress skirts) ?R3 R3-(Vancouver,Skirts) A Motivating Example: MDL with Holes clothes New York Vancouver Edmonton San Jose San Francisco Chicago Minneapolis Boston Summit Albany northwest midwest northeast locatio n jackets tops women’s jeans blouses skirtsformal wear men’s jeans dress pants ties dress skirts women’smen’s R1+R3-(Vancouver,Skirts) MDL with Holes : Length = 6+3+3=12 MDL Approach: Length is 18

VLDB 05Shaofeng Bu UBC11 Problem Statements MDL with Holes (MDLH) is to find a description with holes that has the minimum length and the maximum benefit. In practice, we can drill down on regions to get additional details.

VLDB 05Shaofeng Bu UBC12 Definitions: Length & Benefit Given a set B of data cells (blue cells), an MDLH description for B: D=S – H, S is a set of data regions, H is a set of data cells, also called ‘holes’, D covers exactly the data cells in B. Length: total number of the included regions and cells in the description. |D|=|S|+|H| Benefit : how much shorter is the MDLH summary than the enumeration of B. Benefit (D) = |B| – | D| B 1 ={a, b, c} D 1 = s – d |D 1 |=2 Benefit(D 1 ) = |B 1 | - |D 1 | = 1 B 2 ={e, g} D 2 = t – f – h |D 2 | = 3 Benefit(D 2 )= |B 2 | - |D 2 | = -1 a bcde f s t x g h

13 Related Work The Generalized MDL Approach for Summarization, Laks V.S. Lakshmanan, Raymond T. Ng et al., VLDB 2002 Reduce description length by allowing non-blue cells to be covered in the regions The regions are not pure. Concise Descriptions of Subsets of Structured Sets, Alberto O. Mendelzon & Ken Q. Pu, PODS 2003 Allow Cartesian products to be formed; Not purely hierarchical: NP Completeness result is less surprising ; What about the pure hierarchical? Intelligent Rollups in Multidimensional OLAP Data, Gayatri Sathe and Sunita Sarawagi, VLDB 2001 Only report consistent generalization: A tuple can be generalized along a set of dimensions only if it can be generalized along all subsets of dimensions.

VLDB 05Shaofeng Bu UBC14 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Complete Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

15 ‘x’ D 1 = x – d – f – j Benefit(D 1 ) = 7 – 4 = 3 D 2 =(s – d ) + e + ( u – j ) Beneift(D 2 ) = 7 – 5 = 2 ‘y’ D 3 = y – m – p – q – r Benefit(D 3 ) = 4 – 5 = -1 D 4 = ( v – m ) + o, Benefit(D 4 ) = 4 – 3 = 1 ‘z’ D 5 = z – d – f – j – m – p – q – r Benefit(D 5 ) = 11 – 8 = 3 D 6 =(x – d – f – j)+( v – m + o ) Benefit(D 6 ) = 11 – 7 = 4 1-D Case: MDLH is Tractable a bcdefghijklmnopqr s tuv w y x z MDLH is Tractable: the Optimal MDLH description, which has the maximum benefit, can be generated in polynomial time in 1-D case.

VLDB 05Shaofeng Bu UBC16 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

a b c d e f g i (c,8),(d,8),(e,8) 4 0 rows length benefit (f,8),(g,8) 3 2 (a,8),(b,8) 5 -2 columns length benefit (i,1) 3 2 (i,5) 5 -2 (i,2),(i,3),(i,4) (i,6),(i,7) D Case: Optimality is not Preserved Any More Optimal Solution: {(c,8)+(d,8)+(e,8)+(i,2)+(i,e)+(i,4)} -{(c,2)+(c,3)+(c,4)+(d,2)+(d,3)+(d,4) +(e,2)+(e,3)+(e,4)} +(f,1)+(g,1)+(f,6)+(g,7) Length = 19Benefit = = 9

VLDB 05Shaofeng Bu UBC18 MDLH is NP-Hard in 2-D Case It is NP-Hard to find the optimal MDLH description in 2-D data cube; Not a Trivial Proof: Details are in the paper; Reduction Strategy: Clique Maximum Induced Subgraph in Complete Edge-Weighted(CEW) Bipartite Graph MDL with Holes

VLDB 05Shaofeng Bu UBC19 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

VLDB 05Shaofeng Bu UBC20 Heuristics for MDLH Greedy Each time, choose the row/column with the most benefit Dynamic Programming A bottom-up method to get the description of a region from the descriptions of its children regions Quadratic Programming Using a quadratic function to represent the benefit of a 2-d data cube

VLDB 05Shaofeng Bu UBC21 Example for Comparison with Heuristics The optimal description for this example: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)- (b,3)+(d,4)+(b,5) +(e,6)+(e,8)+(a,11)-(a,8) Length = 12 Benefit = a b c d e

VLDB 05Shaofeng Bu UBC22 Heuristics: A Greedy Heuristic a b c d e region length benefit holes (e,6) (d,10) 2 2 (d,5) (e,1) 2 1 (a,1) (e,2) 2 1 (b,2) (e,3) 2 1 (b,3) (a,11) 2 1 (a,8) (e,8) 2 1 (a,8) (c,10) 3 0 (c,4)(c,5) Description by Greedy: (e,6)+(a,11)+(e,8)-(a,8) +(d,10)-(d,5) +(a,2)+(a,3)+(b,1)+(b,5)+(c,1)+(c,2)+(c,3) The length is 13 The benefit is = 7

VLDB 05Shaofeng Bu UBC23 Greedy: Why it is not optimal? a b c d e Description from Greedy a b c d e Optimal Description A selection of row/column may reduce more total benefit

24 Heuristics: Dynamic Programming a224 b224 c325 d224 e a b c d e at2t2 gt2t2 bt2t2 t2t2 t2t2 ct2t2 t2t2 t2t2 dgt2t2 t2t2 egggt1t1 t1t1 t2t2 gt1t1 gt1t1 t2t2 t2t2 L: The Length of a Region S: Selection of Rows & Columns (a,10) : (a,2) + (a,3) L(a,10)=2, S(a,10)=‘t 2 ’ (e,4) : (d,4) L(e,4)=1, S(e,4)=‘t 1 ’ (d,10): (d,10) – (d,5) L(d,10)=2, S(d,10)=‘g’ t1t1 t2t2

25 Heuristics: Dynamic Programming(2) a b c d e S at2t2 gt2t2 bt2t2 t2t2 t2t2 ct2t2 t2t2 t2t2 dgt2t2 t2t2 egggt1t1 t1t1 t2t2 gt1t1 gt1t1 t2t2 t2t2 S (e,12)=‘t 2 ’ S (e,11)=‘t 2 ’ D(e,6)+D(e,7)+D(e,8)+D(e,9) S (e,10)=‘t 2 ’ D(e,1)+D(e,2)+D(e,3)+D(e,4)+D(e,5) D(e,12)=D(e,10)+D(e,11) (e,1)-(a,1)(e,2)-(b,2) (e,3)-(b,3) (d,4) (b,5)(e,6)(a,7) (e,8)-(a,8) (a,9) Generated Description: (e,1)-(a,1)+(e,2)-(b,2)+(e,3)-(b,3)+(d,4)+(b,5) +(e,6)+(a,7)+(e,8)-(a,8)+(a,9) The length is 13 and the benefit is = 7 D(x 1,x 2 ):description for region (x 1,x 2 ) t1t1 t2t2

VLDB 05Shaofeng Bu UBC26 Dynamic Programming: Why it is not optimal? Description by Dynamic Programming Optimal Description a b c d e a b c d e Misses the combination of rows and columns

VLDB 05Shaofeng Bu UBC27 Use variables to represent rows/columns; for a variable v: v=1: the corresponding row/column is selected; v=0: the corresponding row/column is not selected; f = – Benefit( D) Maximizing the benefit is to minimize the value of f For the previous example, quadratic programming generates the optimal description; Optimality is not guaranteed. Heuristics: Quadratic Programming

VLDB 05Shaofeng Bu UBC28 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

VLDB 05Shaofeng Bu UBC29 Experiments We ran a set of experiments on the TPC-H benchmark data set; We compared the three MDLH heuristics with MDL and GMDL.

30 Experimental Results: Comparison of All Methods Compression Ratio: MDLH-Quadratic generates the most concise descriptions: a yardstick of quality; MDLH-Dynamic is a very close second.

31 Experimental Results: Compression Ratio The more children per parent node, the greater the benefit

VLDB 05Shaofeng Bu UBC32 Experimental Results: Summary Running time & Scalability: MDLH-Greedy is the fastest; MDLH-Dynamic runs slower than MDLH-Greedy, but it is still scalable w.r.t. the number of cells;

VLDB 05Shaofeng Bu UBC33 Outline Introduction to MDL with Holes A motivating example 1-D Case: MDLH is Tractable 2-D Case: MDLH is NP-Hard Heuristics A Greedy Heuristic Dynamic Programming Quadratic Programming Experimental Results Summarization on Holes: An Extension Conclusions & Contributions

34 As the blue density becomes high, a large part of the MDLH description is made up of holes. Can we further reduce the total length by summarizing ‘Holes’? MDLH description is: (a,11)-{(a,6)+(a,8)+(a,9)} +(d,11)-{(d,6)+(d,7)+(d,8)} +(b,6)+(c,8) Total length is 10. Summarization on holes: (a,6)+(a,8)+(a,9) = (a,10)-(a,7) (d,6)+(d,7)+(d,8) = (d,10)-(d,9) After summarization on holes: (a,11) - { (a,10) - (a,7)} +(d,11) - { (d,10) - (d,9)} +(b,6) + (c,8) Total length is 8. Extension: Summarization on holes a b c d e 10 11

VLDB 05Shaofeng Bu UBC35 Conclusions & Contributions We present a new method, MDLH, to compress the answers of OLAP queries; We present a bottom-up algorithm for 1-d cube; We proved the NP-Hardness of the MDLH problem; We provided three heuristics for MDLH: greedy, dynamic programming, and quadratic programming; We extended the summarization on holes to further reduce the total length; We did a set of experiments on the TPC-H benchmark data to compare the heuristics.

VLDB 05Shaofeng Bu UBC36 On going work Based on the summarization on blue cells and summarization on holes, build a visualization tool with MDLH summarization: Return summarized answers to user’s queries; Provide drill down operation for users: Browse details on blue cells Browse details on holes Design k-approximation algorithm for MDLH: What is the best quality we can guarantee?