Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01.

Similar presentations


Presentation on theme: "Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01."— Presentation transcript:

1 Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01

2 Visual Data Mining

3 Outline of Lecture ä Visual Complexity ä Description of Basic Techniques ä Parallel Coordinates ä Grand Tour ä Saturation Brushing ä Illustrations of Basic Techniques ä Rapid Data Editing, Density Estimation (Pollen Data) ä Inverse Regression, Tree Structured Decision Rules (Bank Data) ä Classification & Clustering (SALAD Data & Artificial Nose ) ä Structural Inference (PRIM 7 Data) ä Data Mining (BLS Cereal Scanner Data) ä Cluster Trees (Oronsay Sand Particle Size Data)

4 Visual Complexity Scenarios Scenarios Typical high resolution workstations, 1280x1024 = 1.31x10 6 pixels 1280x1024 = 1.31x10 6 pixels Realistic using Wegman, immersion, 4:5 aspect ratio, 2333x1866 = 4.35x10 6 pixels 2333x1866 = 4.35x10 6 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Very optimistic using 1 minute arc, immersion, 4:5 aspect ratio, 8400x6720 = 5.65x10 7 pixels Wildly optimistic using Maar(2), immersion, 4:5 aspect ratio, 17,284x13,828 = 2.39x10 8 pixels

5 Visual Complexity zVisualization for Data Mining can realistically hope to deal with somewhere on the order of 10 6 to 10 7 observations. This coincides with the approximate limits for interactive computing of O(n 2 ) algorithms and for data transfer. This also roughly corresponds to the number of foveal cones in the eye.

6 Methodologies for Visual Data Mining ä Parallel Coordinates ä Effective Method for High Dimensional Data ä High Dimensions = Multiple Attributes ä Grand Tour ä Generalized Rotation in High Dimensions ä In Depth Study of High Dimensional Data ä Saturation Brushing ä Effective Method for Large Data Sets

7 Visual Data Mining Techniques zMultidimensional Data Visualization yScatterplot matrix yParallel coordinate plots y3-D stereoscopic scatterplots yGrand tour on all plot devices yDensity plots yLinked views ySaturation brushing yPruning and cropping

8 Crystal Vision

9

10

11

12 Data Editing and Density Estimation zPollen Data y3848 points y5 dimensions C

13 Pollen Data

14

15

16

17

18

19 Inverse Regression and Tree Structured Decision Rules with Financial Data ä Bank Demographic Data in 8 Dimensions with 12,000+ points

20 Inverse Regression and Tree Structured Decision Rules with Financial Data

21

22

23 Classification and Clustering Using SALAD Data ä Chemical Agent Detection Data in 13 Dimensions with 10,000+ points

24 Classification and Clustering Using SALAD Data

25

26 Artificial Dog Nose z19 dimensional time series in 2 spectral bands y60 time steps for 300 chemical species c

27 Artificial Dog Nose Time series in two spectral bands for same chemical species

28 Artificial Dog Nose Phase loop

29 Artificial Dog Nose Orthogonal components

30 Artificial Dog Nose After grand tour, orthogonal variables x2 *, x9 *, x15 *, x16 *, x18 * separate the two spectral bands

31 Artificial Dog Nose Four chemical species, target highlighted in red

32 Artificial Dog Nose Target species separated by x1 *, x3 *, x5 *, x6 *, x11 *, x15 *

33 PRIM-7 7 dimensional high energy physics data 500 data points pi-meson proton interaction

34 Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

35 Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

36 Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

37 Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

38 Structural Inference Using PRIM 7 Data Structural Inference Using PRIM 7 Data

39 Scanner Data for Breakfast Cereals z5.5 gigabytes of scanner data in relational database yPrice, sales volume, promotion, store, chain, PSU, UPC yWork done at BLS zPhase 1 – Basic Data Analysis – Single Month zPhase 2 – Price Relative Effects – 1 Year zPhase 3 – Churning Effects – 5 Years

40 Scanner Data for Breakfast Cereals Promotion has huge impact on sales volume

41 Scanner Data for Breakfast Cereals Stores not randomized

42 Scanner Data for Breakfast Cereals Aggressive promotion pays

43 Scanner Data for Breakfast Cereals

44

45 Phase 2

46 Scanner Data for Breakfast Cereals

47 Outliers belong to same chain

48 Scanner Data for Breakfast Cereals Promotion both years

49 Scanner Data for Breakfast Cereals Range of items with no promotion

50 Scanner Data for Breakfast Cereals One chain ceased promotions

51 Scanner Data for Breakfast Cereals Phase 3

52 Scanner Data for Breakfast Cereals Churning comes from both new items and new stores

53 Scanner Data for Breakfast Cereals Churning Effects: Red: PR=0, Blue: PR>0, Green PR=infinity

54 Scanner Data for Breakfast Cereals New items tend to have higher prices

55 Scanner Data for Breakfast Cereals Many discontinued items have high expenditures

56 Scanner Data for Breakfast Cereals Effect of item churning

57 Scanner Data for Breakfast Cereals Removing Store Birth-Death Effects

58 Scanner Data for Breakfast Cereals Outlier due to price coding error

59 Scanner Data for Breakfast Cereals Effects of Cereal Types

60 Scanner Data for Breakfast Cereals Quantity Effects

61 Sands of Time Data z300 Samples of Sand Data from Oronsay Island in the Scotch Hebrides

62 Sands of Time - Objective “The mesolithic shell middens on the island of Oronsay are one of the most important archeological sites in Britain. It is of considerable interest to determine their position with respect to the mesolithic coastline. If the sand below the midden were beach sand and the sand from the upper layers dune sand, this would indicate a seaward shift of the beach-dune interface.” Flenley and Olbricht, 1993

63 Sands of Time - Objective zCluster samples of modern sand into “beach-like” or “dune-like” sand.  Classify archeological sand samples as to whether they are beach sand or dune sand.

64 Sands of Time – Parametric Analysis zHistorical strategy is to fit parametric distributions and compare modern and archeological sands based on parameters. zWeibull, 1933; lognormal (breakage models), log- hyperbolic, log-skew-Laplace, 1937, Barndorff- Nielsen, 1977. zModels 2 to 4 parameters, theory developed, practice problematic.

65 Sands of Time - Graphical Analysis zMultidimensional Parallel Coordinate Display Combined with Grand Tour. zBRUSH-TOUR strategy yClusters recognized by gaps in any horizontal axis. yBrush existing clusters with colors. yExecute grand tour until new clusters appear, brush again. yContinue until clusters are exhausted.

66 Mining the Sands of Time

67

68

69

70

71

72

73 Sands of Time - Conclusions zSands from the CC site and the CNG site have considerably different particle size distributions and cannot be effectively aggregated. zData at small and at large particle dimensions is too quantized to be used effectively. zThe visual based BRUSH-TOUR strategy is extremely effective at clustering.

74 Sands of Time - Conclusions Continued zMidden sands are neither modern beach sands nor modern dune sands. zMidden sands are more similar to modern dune sands. zThis result does not support the seaward-shift-of-the- beach-dune-interface hypothesis, but suggests the middens were always in the dunes


Download ppt "Statistical Data Mining - 3 Edward J. Wegman A Short Course for Interface ‘01."

Similar presentations


Ads by Google