Presentation on theme: "Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University."— Presentation transcript:
Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University
Outline zAcknowledgements zComplexity zSampling Versus Binning zSome Quantization Theory zRecommendations for Quantization
Acknowledgements zThis is joint work with Nkem-Amin (Martin) Khumbah zThis work was funded by the Army Research Office
Complexity Descriptor Data Set Size in Bytes Storage Mode Tiny 10 2 Piece of Paper Small 10 4 A Few Pieces of Paper Medium 10 6 A Floppy Disk Large 10 8 Hard Disk Huge 10 10 Multiple Hard Disks e.g. RAID Storage Massive 10 12 Robotic Magnetic Tape Storage Silos Super Massive10 15 Distributed Archives The Huber/Wegman Taxonomy of Data Set Sizes
Complexity O(r)Plot a scatterplot O(n)Calculate means, variances, kernel density estimates O(n log(n))Calculate fast Fourier transforms O(nc)Calculate singular value decomposition of an rc matrix; solve a multiple linear regression O(n 2 )Solve most clustering algorithms. O(a n )Detect Multivariate Outliers Algorithmic Complexity
Motivation zMassive data sets can make many algorithms computationally infeasible, e.g. O(n 2 ) and higher zMust reduce effective number of cases yReduce computational complexity yReduce data transfer requirements yEnhance visualization capabilities
Data Sampling zDatabase Sampling yExhaustive search may not be practically feasible because of their size yThe KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined yFor sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases) ySampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.
Data Compression zSquishing, Squashing, Thinning, Binning ySquishing = # cases reduced xSampling = Thinning xQuantization = Binning ySquashing = # dimensions (variables) reduced yDepending on goal, one of sampling or quantization may be preferable
Data Quantization Thinning vs Binning zPeople’s first thoughts about Massive Data usually is statistical subsampling zQuantization is engineering’s success story zBinning is statistician’s quantization
Data Quantization zImages are quantized in 8 to 24 bits, i.e. 256 to 16 million levels. zSignals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels zAsk a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3 zFor a terabyte data set, 10 6 bins
Data Quantization zBinning, but at microresolution zConventions yd = dimension yk = # of bins yn = sample size yTypically k << n
Data Quantization zChoose E[W|Q = y j ] = mean of observations in j th bin = y j zIn other words, E[W|Q] = Q zThe quantizer is self-consistent
Data Quantization zE[W] = E[Q] zIf is a linear unbiased estimator, then so is E[ |Q] zIf h is a convex function, then E[h(Q)] E[h(W)]. yIn particular, E[Q 2 ] E[W 2 ] and var (Q) var (W). zE[Q(Q-W)] = 0 zcov (W-Q) = cov (W) - cov (Q) zE[W-P] 2 E[W-Q] 2 where P is any other quantizer.
Distortion due to Quantization zDistortion is the error due to quantization. zIn simple terms, E[W-Q] 2. zDistortion is minimized when the quantization regions, S j, are most like a (hyper-) sphere.
Geometry-based Quantization zNeed space-filling tessellations zNeed congruent tiles zNeed as spherical as possible
Geometry-based Quantization zIn one dimension yOnly polytope is a straight line segment (also bounded by a one-dimensional sphere). zIn two dimensions yOnly polytopes are equilateral triangles, squares and hexagons
Geometry-based Quantization Hexagonal Prism 24 Cell with Cuboctahedron Envelope
Geometry-based Quantization zUsing 10 6 bins is computationally and visually feasible. zFast binning, for data in the range [a,b], and for k bins j = fixed[k*(x i -a)/(b-a)] gives the index of the bin for x i in one dimension. zComputational complexity is 4n+1=O(n). zMemory requirements drop to 3k - location of bin + # items in bin + representor of bin, I.e. storage complexity is 3k.
Geometry-based Quantization zIn two dimensions yEach hexagon is indexed by 3 parameters. yComputational complexity is 3 times 1-D complexity, yI.e. 12n+3=O(n). yComplexity for squares is 2 times 1-D complexity. yRatio is 3/2. yStorage complexity is still 3k.
Geometry-based Quantization zIn 3 dimensions yFor truncated octahedron, there are 3 pairs of square sides and 4 pairs of hexagonal sides. yComputational complexity is 28n+7 = O(n). yComputational complexity for a cube is 12n+3. yRatio is 7/3. yStorage complexity is still 3k.
Quantization Strategies zOptimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. yComplexity is always O(n). yStorage complexity is 3k. y# tiles grows exponentially with dimension, so-called curse of dimensionality. yHigher dimensional geometry is poorly known. yComputational complexity grows faster than hypercube.
Quantization Strategies zFor purposes of simplicity, always use hypercube or d- dimensional simplices yComputational complexity is always O(n). yMethods for data adaptive tiling are available yStorage complexity is 3k. y# tiles grows exponentially with dimension. yBoth polytopes depart spherical shape rapidly as d increases. yHypercube approach is known as datacube in computer science literature and is closely related to multivariate histograms in statistical literature.
Quantization Strategies zConclusions on Geometric Quantization yGeometric approach good to 4 or 5 dimensions. yAdaptive tilings may improve rate at which # tiles grows, but probably destroy spherical structure. yGood for large n, but weaker for large d.
Quantization Strategies zAlternate Strategy yForm bins via clustering xKnown in the electrical engineering literature as vector quantization. xDistance based clustering is O(n 2 ) which implies poor performance for large n. xNot terribly dependent on dimension, d. xClusters may be very out of round, not even convex. yConclusion xCluster approach may work for large d, but fails for large n. xNot particularly applicable to “massive” data mining.
Quantization Strategies zThird strategy yDensity-based clustering xDensity estimation with kernel estimators is O(n). xUses modes m to form clusters xPut x i in cluster if it is closest to mode m . xThis procedure is distance based, but with complexity O(kn) not O(n 2 ). xNormal mixture densities may be an alternative approach. xRoundness may be a problem. yBut quantization based on density-based clustering offers promise for both large d and large n.
Data Quantization zBinning does not lose fine structure in tails as sampling might. zRoundoff analysis applies. zWith scale of binning, discretization not likely to be much less accurate than accuracy of recorded data. zDiscretization - finite number of bins implies discrete variables more compatible with categorical data.
Data Quantization zAnalysis on a finite subset of the integers has theoretical advantages yAnalysis is less delicate xdifferent forms of convergence are equivalent yAnalysis is often more natural since data is already quantized or categorical yGraphical analysis of numerical data is not much changed since 10 6 pixels is at limit of HVS