Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01

Visual Data Mining

Outline of Lecture ä Visual Complexity ä Description of Basic Techniques ä Parallel Coordinates ä Grand Tour ä Saturation Brushing ä Illustrations of Basic Techniques ä Rapid Data Editing, Density Estimation (Pollen Data) ä Inverse Regression, Tree Structured Decision Rules (Bank Data) ä Classification & Clustering (SALAD Data & Artificial Nose ) ä Structural Inference (PRIM 7 Data) ä Data Mining (BLS Cereal Scanner Data) ä Cluster Trees (Oronsay Sand Particle Size Data)

Visual Complexity Scenarios Scenarios Typical high resolution workstations, 1280x1024 = 1.31x10 6 pixels 1280x1024 = 1.31x10 6 pixels Realistic using Wegman, immersion, 4:5 aspect ratio, 2333x1866 = 4.35x10 6 pixels 2333x1866 = 4.35x10 6 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10 8 pixels

Visual Complexity zVisualization for Data Mining can realistically hope to deal with somewhere on the order of 10 6 to 10 7 observations. This coincides with the approximate limits for interactive computing of O(n 2 ) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye.

Methodologies for Visual Data Mining ä Parallel Coordinates ä Effective Method for High Dimensional Data ä High Dimensions = Multiple Attributes ä Grand Tour ä Generalized Rotation in High Dimensions ä In Depth Study of High Dimensional Data ä Saturation Brushing ä Effective Method for Large Data Sets

Visual Data Mining Techniques zMultidimensional Data Visualization yScatterplot matrix yParallel coordinate plots y3-D stereoscopic scatterplots yGrand tour on all plot devices yDensity plots yLinked views ySaturation brushing yPruning and cropping

Crystal Vision

Data Editing and Density Estimation zPollen Data y3848 points y5 dimensions C

Pollen Data

Inverse Regression and Tree Structured Decision Rules with Financial Data ä Bank Demographic Data in 8 Dimensions with 12,000+ points

Inverse Regression and Tree Structured Decision Rules with Financial Data

Classification and Clustering Using SALAD Data ä Chemical Agent Detection Data in 13 Dimensions with 10,000+ points

Classification and Clustering Using SALAD Data

Artificial Dog Nose z19 dimensional time series in 2 spectral bands y60 time steps for 300 chemical species c

Artificial Dog Nose Time series in two spectral bands for same chemical species

Artificial Dog Nose Phase loop

Artificial Dog Nose Orthogonal components

Artificial Dog Nose After grand tour, orthogonal variables x2 *, x9 *, x15 *, x16 *, x18 * separate the two spectral bands

Artificial Dog Nose Four chemical species, target highlighted in red

Artificial Dog Nose Target species separated by x1 *, x3 *, x5 *, x6 *, x11 *, x15 *

PRIM-7 7 dimensional high energy physics data 500 data points pi-meson proton interaction

Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

Scanner Data for Breakfast Cereals z5.5 gigabytes of scanner data in relational database yPrice, sales volume, promotion, store, chain, PSU, UPC yWork done at BLS zPhase 1 – Basic Data Analysis – Single Month zPhase 2 – Price Relative Effects – 1 Year zPhase 3 – Churning Effects – 5 Years

Scanner Data for Breakfast Cereals Promotion has huge impact on sales volume

Scanner Data for Breakfast Cereals Stores not randomized

Scanner Data for Breakfast Cereals Aggressive promotion pays

Scanner Data for Breakfast Cereals

Phase 2

Scanner Data for Breakfast Cereals

Outliers belong to same chain

Scanner Data for Breakfast Cereals Promotion both years

Scanner Data for Breakfast Cereals Range of items with no promotion

Scanner Data for Breakfast Cereals One chain ceased promotions

Scanner Data for Breakfast Cereals Phase 3

Scanner Data for Breakfast Cereals Churning comes from both new items and new stores

Scanner Data for Breakfast Cereals Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity

Scanner Data for Breakfast Cereals New items tend to have higher prices

Scanner Data for Breakfast Cereals Many discontinued items have high expenditures

Scanner Data for Breakfast Cereals Effect of item churning

Scanner Data for Breakfast Cereals Removing Store Birth-Death Effects

Scanner Data for Breakfast Cereals Outlier due to price coding error

Scanner Data for Breakfast Cereals Effects of Cereal Types

Scanner Data for Breakfast Cereals Quantity Effects

Sands of Time Data z300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides

Sands of Time - Objective “The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.” Flenley and Olbricht, 1993

Sands of Time - Objective zCluster samples of modern sand into “beach-like” or “dune-like” sand.  Classify archeological sand samples as to whether they are beach sand or dune sand.

Sands of Time – Parametric Analysis zHistorical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters. zWeibull, 1933; lognormal (breakage models), log- hyperbolic, log-skew-Laplace, 1937, Barndorff- Nielsen, 1977. zModels 2 to 4 parameters, theory developed, practice problematic.

Sands of Time - Graphical Analysis zMultidimensional Parallel Coordinate Display Combined with Grand Tour. zBRUSH-TOUR strategy yClusters recognized by gaps in any horizontal axis. yBrush existing clusters with colors. yExecute grand tour until new clusters appear, brush again. yContinue until clusters are exhausted.

Mining the Sands of Time

Sands of Time - Conclusions zSands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated. zData at small and at large particle dimensions is too quantized to be used effectively. zThe visual based BRUSH-TOUR strategy is extremely effective at clustering.

Sands of Time - Conclusions Continued zMidden sands are neither modern beach sands nor modern dune sands. zMidden sands are more similar to modern dune sands. zThis result does not support the seaward-shift-of-the- beach-dune-interface hypothesis, but suggests the middens were always in the dunes

Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Similar presentations

Presentation on theme: "Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Similar presentations

Presentation on theme: "Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01."— Presentation transcript:

Similar presentations

About project

Feedback