A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.

Slides:



Advertisements
Similar presentations
Lecture 24 MAS 714 Hartmut Klauck
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Divide and Conquer Strategy
CS4413 Divide-and-Conquer
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Exhaustive Signature Algorithm
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Dynamic Programming Reading Material: Chapter 7..
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Reduced Support Vector Machine
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Cluster Analysis (1).
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Module C9 Simulation Concepts. NEED FOR SIMULATION Mathematical models we have studied thus far have “closed form” solutions –Obtained from formulas --
ECE 667 Synthesis and Verification of Digital Systems
1 Systems of Linear Equations Error Analysis and System Condition.
Chapter 2: Algorithm Discovery and Design
Fast Spectral Transforms and Logic Synthesis DoRon Motter August 2, 2001.
Matrices and Determinants
Hamming Codes 11/17/04. History In the late 1940’s Richard Hamming recognized that the further evolution of computers required greater reliability, in.
Basics of Compression Goals: to understand how image/audio/video signals are compressed to save storage and increase transmission efficiency to understand.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Genetic Algorithm.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Fingerprinting techniques. 2 Is X equal to Y? = ? = ?
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Palette: Distributing Tables in Software-Defined Networks Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay.
CSC 211 Data Structures Lecture 13
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Wavelets and Multiresolution Processing (Wavelet Transforms)
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Low Density Parity Check codes
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
1/32 This Lecture Substitution model An example using the substitution model Designing recursive procedures Designing iterative procedures Proving that.
Divide and Conquer Strategy
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Optimizing Parallel Algorithms for All Pairs Similarity Search
Image transformations
Subspace Clustering/Biclustering
Algorithm Discovery and Design
CS 485G: Special Topics in Data Mining
Quiz: Computational Thinking
Presentation transcript:

A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside

Haifa Stringology Research Workshop 2005 Problem Given a matrix A over a binary alphabet Find an invertible transform T such that T(A) is more “compressible” than A Idea: try a reordering of the rows and columns of A

Haifa Stringology Research Workshop 2005 Applications Lossless compression of binary matrices (i.e., binary digital images) Evaluation of the randomness of a graph (represented by its adjacency matrix)

Haifa Stringology Research Workshop 2005 Compression-boosting transform Design an invertible transform that improves compression The transform must expose the “patterns” in the matrix [Storer and Helfgott 97] find uniform blocks to compress images (lossless)

Haifa Stringology Research Workshop 2005 Redundancy in binary matrices Look for uniform submatrices A uniform submatrix is a submatrix induced by a subset of rows and columns (i.e., not necessarily contiguous) solely composed by the same symbol (either 0 or 1)

Haifa Stringology Research Workshop 2005 Outline of the direct transform

Haifa Stringology Research Workshop 2005 Direct transform Find the largest uniform submatrix in A Reorder the rows and the columns such that the uniform submatrix is moved to the upper-left corner Recursively apply the transform on the rest of the matrix Stop when the partition produces a matrix which is smaller than r x c (r,c predetermined thresholds)

Haifa Stringology Research Workshop 2005 Does it help? Maybe … Each uniform submatrix can be represented by one bit and a list of its rows/columns If a matrix can be decomposed in a small number of uniform submatrices, the compression should improve

Haifa Stringology Research Workshop 2005 Does it help? Maybe not… Typically, in order to get good compression of 2D data (e.g., digital images) one has to exploit dependencies between adjacent rows and columns The reordering can “break” local dependencies between adjacent rows or columns, therefore could affecting negatively the compression

Haifa Stringology Research Workshop 2005 Complexity of the direct transform The computation cost depends on the complexity of finding the largest uniform submatrix This problem is a special case of the biclustering problem

Haifa Stringology Research Workshop 2005 Some related works Hartigan, ‘72 Aggarwal et al., SIGMOD’99 Cheng & Church, ISMB’00 Wang et al., SIGMOD’02 Ben-Dor et al., RECOMB’02 Tanay et al., ISMB’02 Procopiuc et al., SIGMOD’02 Murali & Kasif, PSB’03 Sheng et al., ECCB’03 Mishra et al., COLT’03 Lonardi et al., CPM’04

Haifa Stringology Research Workshop 2005 Problem definition L ARGEST U NIFORM S UBMATRIX Instance: A matrix Question: Find a row selection R and a column selection C such that A (R, C) is uniform and |R||C| is maximized This problem is also called Maximum Edge Biclique, and is computationally hard

Haifa Stringology Research Workshop 2005 Randomized search

Haifa Stringology Research Workshop 2005 Randomized search (step 1) Select a random subset S of size k uniformly from the set of columns {1,2,…, m}

Haifa Stringology Research Workshop 2005 Randomized search (step 2) For all the subset U of S, check whether string 1 |U| (resp., 0 |U| ) appears at least r times and record the rows R 1 (resp., R 0 ) in which it occurs 1

Haifa Stringology Research Workshop 2005 Randomized search (step 3) Given R 1 (resp., R 0 ) select the set of columns C which read 1 (resp., 0) Check whether

Haifa Stringology Research Workshop 2005 Randomized search (step 4) Save the solutions (R 1,C) and (R 0,C) and repeat step 1 to 4 for t iterations

Haifa Stringology Research Workshop 2005 Parameters Projection size k Column threshold c Row threshold r Number of iterations t See [Lonardi ‘04] for details on how to choose k, r and c

Haifa Stringology Research Workshop 2005 Selecting the number of iterations t We can miss a solution in two cases –S completely misses C* –when S overlaps C*, but the string 11…1 (00…0) selected by the algorithm also appears in a row outside R*

Haifa Stringology Research Workshop 2005 Selecting the number of iterations t The probability of missing the solution in one iteration is which is

Haifa Stringology Research Workshop 2005 Selecting the number of iterations t Given the probability of missing the solution in t iterations to be smaller than ε For the experiments, we fixed the number of iterations to t=1,000 and t=10,000

Haifa Stringology Research Workshop 2005 Recursive decomposition Input: binary matrix A, row/column thresholds r/c, iterations t (or error ε) Run L ARGEST U NIFORM S UBMATRIX on A Reorder A and decompose A into four smaller matrices U,a,b,c where is U is the uniform submatrix Should we recursively apply the transform to (a,b,c), (a+c,b) or (a,b+c) ?

Haifa Stringology Research Workshop 2005 Outline of the direct transform

Haifa Stringology Research Workshop 2005 Refined transform

Haifa Stringology Research Workshop 2005 Is (a+c,b) better than (a,b+c)? Let us call A 1, A 2, B 1, B 2 the area of the uniform submatrix found in a+c, b, a, and b+c respectively Choose decomposition (a+c,b) if –S UM : A 1 +A 2 >B 1 +B 2 –M AX : max{A 1,A 2 }>max{B 1,B 2 } –I NDIV : A 1 >B 2

Haifa Stringology Research Workshop 2005 Implementation Each uniform submatrix can be represented by one bit Non-decomposable matrices are saved in row-major order The content of both types of matrices is saved in a file called strings

Haifa Stringology Research Workshop 2005 Implementation Row and column indices for uniform and non-decomposable matrices are saved in a file called index For each set of rows/column indices the first index is saved as is, while the other are saved as difference between adjacent indices

Haifa Stringology Research Workshop 2005 Implementation The number of rows and columns for uniform and non-decomposable matrices is saved in a file called length The files string, index and length allow one to invert the transform Note that the complexity of the inverse transform is linear in the size of the matrix

Haifa Stringology Research Workshop 2005 Experiments on synthetic data We generated datasets composed by four random 256x256 binary matrices In each matrix of a dataset we embedded 1, 2, 3 and 4 uniform submatrices of size 64x64 For each matrix we compared the compression size before and after the transform

Haifa Stringology Research Workshop 2005 Experiments on synthetic data Matrix i is a random 256x256 binary matrix and contains i uniform 64x64 submatrices. Parameters are r=c=10, t=10,000

Haifa Stringology Research Workshop 2005 Experiments on synthetic data Failure in recovering all the embedded submatrices is typically due to the recursive partitioning If the partitioning happens to split an embedded submatrix, there is no hope in recovering it

Haifa Stringology Research Workshop 2005 Experiments on compressing images

Haifa Stringology Research Workshop 2005 Compressing images

Haifa Stringology Research Workshop 2005 Experiments on images Parameters are r=c=60, t=10,000

Haifa Stringology Research Workshop 2005 Experiments on images How does the performance depend on –the partitioning strategy? –the row and column thresholds? –the number of iterations? The next graphs are about the image bird

Haifa Stringology Research Workshop 2005 Size as a function of decomposition strategy

Haifa Stringology Research Workshop 2005 Average area of uniform submatrices

Haifa Stringology Research Workshop 2005 Number of uniform submatrices

Haifa Stringology Research Workshop 2005 Proportion of uniform submatrices

Haifa Stringology Research Workshop 2005 Final compressed size

Haifa Stringology Research Workshop 2005 Findings Proposed a transform to boost compression of 2D binary data The direct transform is hard, but a randomized algorithm can be used The inverse transform is very fast The transform boosts compression

Haifa Stringology Research Workshop 2005 Randomness of a graph Problem: determine whether a graph G is “random” or not (and how much) Use the Kolmogorov complexity Use the compressibility of the adjacency matrix of G to bound the Kolmogorov complexity of G

Haifa Stringology Research Workshop 2005

The general problem Biclustering is the problem of finding a partition of the vectors and a subset of the dimensions such that the projections along those directions of the vectors in each cluster are close to one another The problem requires to cluster the vectors and the dimensions simultaneously, thus the name “biclustering”

Haifa Stringology Research Workshop 2005 Applications of biclustering Collaborative filtering and recommender systems Finding web communities Discovery association rules in databases Gene expression analysis …

Haifa Stringology Research Workshop 2005 Pseudo-code