Download presentation
Presentation is loading. Please wait.
Published byCatherine Franklin Modified over 9 years ago
1
Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01
2
Visual Data Mining
3
Outline of Lecture ä Visual Complexity ä Description of Basic Techniques ä Parallel Coordinates ä Grand Tour ä Saturation Brushing ä Illustrations of Basic Techniques ä Rapid Data Editing, Density Estimation (Pollen Data) ä Inverse Regression, Tree Structured Decision Rules (Bank Data) ä Classification & Clustering (SALAD Data & Artificial Nose ) ä Structural Inference (PRIM 7 Data) ä Data Mining (BLS Cereal Scanner Data) ä Cluster Trees (Oronsay Sand Particle Size Data)
4
Visual Complexity Scenarios Scenarios Typical high resolution workstations, 1280x1024 = 1.31x10 6 pixels 1280x1024 = 1.31x10 6 pixels Realistic using Wegman, immersion, 4:5 aspect ratio, 2333x1866 = 4.35x10 6 pixels 2333x1866 = 4.35x10 6 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10 8 pixels
5
Visual Complexity zVisualization for Data Mining can realistically hope to deal with somewhere on the order of 10 6 to 10 7 observations. This coincides with the approximate limits for interactive computing of O(n 2 ) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye.
6
Methodologies for Visual Data Mining ä Parallel Coordinates ä Effective Method for High Dimensional Data ä High Dimensions = Multiple Attributes ä Grand Tour ä Generalized Rotation in High Dimensions ä In Depth Study of High Dimensional Data ä Saturation Brushing ä Effective Method for Large Data Sets
7
Visual Data Mining Techniques zMultidimensional Data Visualization yScatterplot matrix yParallel coordinate plots y3-D stereoscopic scatterplots yGrand tour on all plot devices yDensity plots yLinked views ySaturation brushing yPruning and cropping
8
Crystal Vision
12
Data Editing and Density Estimation zPollen Data y3848 points y5 dimensions C
13
Pollen Data
19
Inverse Regression and Tree Structured Decision Rules with Financial Data ä Bank Demographic Data in 8 Dimensions with 12,000+ points
20
Inverse Regression and Tree Structured Decision Rules with Financial Data
23
Classification and Clustering Using SALAD Data ä Chemical Agent Detection Data in 13 Dimensions with 10,000+ points
24
Classification and Clustering Using SALAD Data
26
Artificial Dog Nose z19 dimensional time series in 2 spectral bands y60 time steps for 300 chemical species c
27
Artificial Dog Nose Time series in two spectral bands for same chemical species
28
Artificial Dog Nose Phase loop
29
Artificial Dog Nose Orthogonal components
30
Artificial Dog Nose After grand tour, orthogonal variables x2 *, x9 *, x15 *, x16 *, x18 * separate the two spectral bands
31
Artificial Dog Nose Four chemical species, target highlighted in red
32
Artificial Dog Nose Target species separated by x1 *, x3 *, x5 *, x6 *, x11 *, x15 *
33
PRIM-7 7 dimensional high energy physics data 500 data points pi-meson proton interaction
34
Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data
35
Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data
36
Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data
37
Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data
38
Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data
39
Scanner Data for Breakfast Cereals z5.5 gigabytes of scanner data in relational database yPrice, sales volume, promotion, store, chain, PSU, UPC yWork done at BLS zPhase 1 – Basic Data Analysis – Single Month zPhase 2 – Price Relative Effects – 1 Year zPhase 3 – Churning Effects – 5 Years
40
Scanner Data for Breakfast Cereals Promotion has huge impact on sales volume
41
Scanner Data for Breakfast Cereals Stores not randomized
42
Scanner Data for Breakfast Cereals Aggressive promotion pays
43
Scanner Data for Breakfast Cereals
45
Phase 2
46
Scanner Data for Breakfast Cereals
47
Outliers belong to same chain
48
Scanner Data for Breakfast Cereals Promotion both years
49
Scanner Data for Breakfast Cereals Range of items with no promotion
50
Scanner Data for Breakfast Cereals One chain ceased promotions
51
Scanner Data for Breakfast Cereals Phase 3
52
Scanner Data for Breakfast Cereals Churning comes from both new items and new stores
53
Scanner Data for Breakfast Cereals Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity
54
Scanner Data for Breakfast Cereals New items tend to have higher prices
55
Scanner Data for Breakfast Cereals Many discontinued items have high expenditures
56
Scanner Data for Breakfast Cereals Effect of item churning
57
Scanner Data for Breakfast Cereals Removing Store Birth-Death Effects
58
Scanner Data for Breakfast Cereals Outlier due to price coding error
59
Scanner Data for Breakfast Cereals Effects of Cereal Types
60
Scanner Data for Breakfast Cereals Quantity Effects
61
Sands of Time Data z300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides
62
Sands of Time - Objective “The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.” Flenley and Olbricht, 1993
63
Sands of Time - Objective zCluster samples of modern sand into “beach-like” or “dune-like” sand. Classify archeological sand samples as to whether they are beach sand or dune sand.
64
Sands of Time – Parametric Analysis zHistorical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters. zWeibull, 1933; lognormal (breakage models), log- hyperbolic, log-skew-Laplace, 1937, Barndorff- Nielsen, 1977. zModels 2 to 4 parameters, theory developed, practice problematic.
65
Sands of Time - Graphical Analysis zMultidimensional Parallel Coordinate Display Combined with Grand Tour. zBRUSH-TOUR strategy yClusters recognized by gaps in any horizontal axis. yBrush existing clusters with colors. yExecute grand tour until new clusters appear, brush again. yContinue until clusters are exhausted.
66
Mining the Sands of Time
73
Sands of Time - Conclusions zSands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated. zData at small and at large particle dimensions is too quantized to be used effectively. zThe visual based BRUSH-TOUR strategy is extremely effective at clustering.
74
Sands of Time - Conclusions Continued zMidden sands are neither modern beach sands nor modern dune sands. zMidden sands are more similar to modern dune sands. zThis result does not support the seaward-shift-of-the- beach-dune-interface hypothesis, but suggests the middens were always in the dunes
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.