Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University.

Similar presentations

Presentation on theme: "Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University."— Presentation transcript:

1 Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University

2 Outline zAcknowledgements zComplexity zSampling Versus Binning zSome Quantization Theory zRecommendations for Quantization

3 Acknowledgements zThis is joint work with Nkem-Amin (Martin) Khumbah zThis work was funded by the Army Research Office

4 Complexity Descriptor Data Set Size in Bytes Storage Mode Tiny 10 2 Piece of Paper Small 10 4 A Few Pieces of Paper Medium 10 6 A Floppy Disk Large 10 8 Hard Disk Huge 10 10 Multiple Hard Disks e.g. RAID Storage Massive 10 12 Robotic Magnetic Tape Storage Silos Super Massive10 15 Distributed Archives The Huber/Wegman Taxonomy of Data Set Sizes

5 Complexity O(r)Plot a scatterplot O(n)Calculate means, variances, kernel density estimates O(n log(n))Calculate fast Fourier transforms O(nc)Calculate singular value decomposition of an rc matrix; solve a multiple linear regression O(n 2 )Solve most clustering algorithms. O(a n )Detect Multivariate Outliers Algorithmic Complexity

6 Complexity

7 Motivation zMassive data sets can make many algorithms computationally infeasible, e.g. O(n 2 ) and higher zMust reduce effective number of cases yReduce computational complexity yReduce data transfer requirements yEnhance visualization capabilities

8 Data Sampling zDatabase Sampling yExhaustive search may not be practically feasible because of their size yThe KDD systems must be able to assist in the selection of appropriate parts if the databases to be examined yFor sampling to work, the data must satisfy certain conditions (not ordered, no systematic biases) ySampling can be very expensive operation especially when the sample is taken from data stored in a DBMS. Sampling 5% of the database can be more expensive that a sequential full scan of the data.

9 Data Compression zSquishing, Squashing, Thinning, Binning ySquishing = # cases reduced xSampling = Thinning xQuantization = Binning ySquashing = # dimensions (variables) reduced yDepending on goal, one of sampling or quantization may be preferable

10 Data Quantization Thinning vs Binning zPeople’s first thoughts about Massive Data usually is statistical subsampling zQuantization is engineering’s success story zBinning is statistician’s quantization

11 Data Quantization zImages are quantized in 8 to 24 bits, i.e. 256 to 16 million levels. zSignals (audio on CDs) are quantized in 16 bits, i.e. 65,536 levels zAsk a statistician how many bins to use, likely response is a few hundred, ask a CS data miner, likely response is 3 zFor a terabyte data set, 10 6 bins

12 Data Quantization zBinning, but at microresolution zConventions yd = dimension yk = # of bins yn = sample size yTypically k << n

13 Data Quantization zChoose E[W|Q = y j ] = mean of observations in j th bin = y j zIn other words, E[W|Q] = Q zThe quantizer is self-consistent

14 Data Quantization zE[W] = E[Q] zIf  is a linear unbiased estimator, then so is E[  |Q] zIf h is a convex function, then E[h(Q)]  E[h(W)]. yIn particular, E[Q 2 ]  E[W 2 ] and var (Q)  var (W). zE[Q(Q-W)] = 0 zcov (W-Q) = cov (W) - cov (Q) zE[W-P] 2  E[W-Q] 2 where P is any other quantizer.

15 Data Quantization

16 Distortion due to Quantization zDistortion is the error due to quantization. zIn simple terms, E[W-Q] 2. zDistortion is minimized when the quantization regions, S j, are most like a (hyper-) sphere.

17 Geometry-based Quantization zNeed space-filling tessellations zNeed congruent tiles zNeed as spherical as possible

18 Geometry-based Quantization zIn one dimension yOnly polytope is a straight line segment (also bounded by a one-dimensional sphere). zIn two dimensions yOnly polytopes are equilateral triangles, squares and hexagons

19 Geometry-based Quantization zIn 3 dimensions yTetrahedrons (3-simplex), cube, hexagonal prism, rhombic dodecahedron, truncated octahedron. zIn 4 dimensions y4 simplex, hypercube, 24 cell Truncated octahedron tessellation

20 Geometry-based Quantization Tetrahedron *.1040042… Cube *.0833333… Octahedron.0825482… Hexagonal Prism *.0812227… Rhombic Dodecahedron *.0787451… Truncated Octahedron *.0785433… Dodecahedron.0781285… Icosahedron.0778185… Sphere.0769670 Dimensionless Second Moment for 3-D Polytopes

21 Geometry-based Quantization TetrahedronCubeOctahedron IcosahedronDodecahedronTruncated Octahedron

22 Geometry-based Quantization Rhombic Dodecahedron

23 Geometry-based Quantization Hexagonal Prism 24 Cell with Cuboctahedron Envelope

24 Geometry-based Quantization zUsing 10 6 bins is computationally and visually feasible. zFast binning, for data in the range [a,b], and for k bins j = fixed[k*(x i -a)/(b-a)] gives the index of the bin for x i in one dimension. zComputational complexity is 4n+1=O(n). zMemory requirements drop to 3k - location of bin + # items in bin + representor of bin, I.e. storage complexity is 3k.

25 Geometry-based Quantization zIn two dimensions yEach hexagon is indexed by 3 parameters. yComputational complexity is 3 times 1-D complexity, yI.e. 12n+3=O(n). yComplexity for squares is 2 times 1-D complexity. yRatio is 3/2. yStorage complexity is still 3k.

26 Geometry-based Quantization zIn 3 dimensions yFor truncated octahedron, there are 3 pairs of square sides and 4 pairs of hexagonal sides. yComputational complexity is 28n+7 = O(n). yComputational complexity for a cube is 12n+3. yRatio is 7/3. yStorage complexity is still 3k.

27 Quantization Strategies zOptimally for purposes of minimizing distortion, use roundest polytope in d-dimensions. yComplexity is always O(n). yStorage complexity is 3k. y# tiles grows exponentially with dimension, so-called curse of dimensionality. yHigher dimensional geometry is poorly known. yComputational complexity grows faster than hypercube.

28 Quantization Strategies zFor purposes of simplicity, always use hypercube or d- dimensional simplices yComputational complexity is always O(n). yMethods for data adaptive tiling are available yStorage complexity is 3k. y# tiles grows exponentially with dimension. yBoth polytopes depart spherical shape rapidly as d increases. yHypercube approach is known as datacube in computer science literature and is closely related to multivariate histograms in statistical literature.

29 Quantization Strategies zConclusions on Geometric Quantization yGeometric approach good to 4 or 5 dimensions. yAdaptive tilings may improve rate at which # tiles grows, but probably destroy spherical structure. yGood for large n, but weaker for large d.

30 Quantization Strategies zAlternate Strategy yForm bins via clustering xKnown in the electrical engineering literature as vector quantization. xDistance based clustering is O(n 2 ) which implies poor performance for large n. xNot terribly dependent on dimension, d. xClusters may be very out of round, not even convex. yConclusion xCluster approach may work for large d, but fails for large n. xNot particularly applicable to “massive” data mining.

31 Quantization Strategies zThird strategy yDensity-based clustering xDensity estimation with kernel estimators is O(n). xUses modes m  to form clusters xPut x i in cluster  if it is closest to mode m . xThis procedure is distance based, but with complexity O(kn) not O(n 2 ). xNormal mixture densities may be an alternative approach. xRoundness may be a problem. yBut quantization based on density-based clustering offers promise for both large d and large n.

32 Data Quantization zBinning does not lose fine structure in tails as sampling might. zRoundoff analysis applies. zWith scale of binning, discretization not likely to be much less accurate than accuracy of recorded data. zDiscretization - finite number of bins implies discrete variables more compatible with categorical data.

33 Data Quantization zAnalysis on a finite subset of the integers has theoretical advantages yAnalysis is less delicate xdifferent forms of convergence are equivalent yAnalysis is often more natural since data is already quantized or categorical yGraphical analysis of numerical data is not much changed since 10 6 pixels is at limit of HVS

Download ppt "Data Compression by Quantization Edward J. Wegman Center for Computational Statistics George Mason University."

Similar presentations

Ads by Google