Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.

Similar presentations


Presentation on theme: "Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables."— Presentation transcript:

1 Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables (histogram, smoothing, box and whisker plot) Relationships between pairs of variables (scatterplot, contour plot) Relationship between multiple variables (scatterplot matrix, trellis plotting, star icons, parallel coordinates) Projection pursuit methods (principal component analysis) Parallel coordinates plots

2 Summarizing data Mean  = 1/n  i x i Median value that has an equal number of data points above and below it. Quartile first quartile = value that is greater than a quarter of data points Variance  2 = 1/n  i (x i - ) 2 Skewnes measures whether or not a distribution has a single long tail  i (x i - ) 3 / ( i (x i - ) 2 ) 3/2

3 Histogram (Microsoft Excel)

4 Smoothing estimates The contribution of a data point x i to the estimate at some point x * depends on K((x * - x i )/h) K() … kernel function  i K(x i ) = 1 e.g. normal (Gaussian) distribution h … bandwidth Estimated density at point x * is f(x) = 1/n  i K((x * - x i )/h) Example (Xgobi koule.txt -var2 )

5 Box and whisker plots Upper and lower boundaries of each box represent the first and third quartiles. Horizontal line within each box represents the median. The whiskers extend 1.5 times the interquartile range from the end of each box. All data points outside the whiskers are plotted individually. ++ +++

6 Scatterplot Two variables at a time One point for each data record Example (Xgobi koule.txt ) Scatterplots can reveal anomalies and shortcomings in data. Example: changes in measured weight of childern in summer and winter periods Problems. 1. In case of many points we may get a black rectangle. 2. Overprinting can conceal the strength of correlation. A solution is the Contour plot – with contour lines like in a topographic map. 3. Only two dimensional.

7 More than two variables Scatterplot matrix Multivariate data are projected into two-dimensional plots (all other variables are ignored). Example Crystal Vision pollen.data Trellis plot Series of scatterplots conditioned on levels of one or more other variables Brushing Enables to highlight corresponding points Star icons Different directions from the origin correspond to different variables. The lengths correspond to the magnitudes.

8 Interactive graphics Rotating directions of projections in search for a structure Random rotations Manual rotations Example (Xgobi koule ) Projection pursuit methods Allowing computer to search for interesting directions using a criteria Example (Xgobi krychle ) a special case – Principal component analysis

9 Principal component analysis Assumption data lie in a two dimensional linear subspace spanned by a linear combinations of measured variables Criteria for interesting direction a plane for which the sum of squared distances between the data points and their projections onto this plane is minimized Solution in polynomial time the plane is spanned by the linear combination of variables that has maximum sample variance and the linear combination that has maximum variance subject to being uncorrelated with the first linear combination

10 Principal component analysis X … n x p data matrix, rows are data cases a … p x 1 column vector of projection weights a T x … projection of a vector x Xa … projected values of all data vectors  a 2 = ( Xa ) T ( Xa ) = a T V a … variance along a Maximize variance under a normaliz. constraint a T a =1, i.e. max a T V a - ( a T a – 1 ) It reduces to eigenvalue form (V - I) a = 0 The first principal component a is the eigenvector associated with the largest eigenvalue. The second principal component a is the eigenvector associated with the second largest eigenvalue, etc. Scree plot … amount of variance explained by each consecutive value

11 Example (Huba et al. 1981) Data on 1684 students in LA showing consumption of 13 legal and illegal psychoactive substances The weights of the first principal components were: cigarettes 0.278, beer 0.286, wine 0.265, spirits 0.318, cocaine 0.208, tranquilizers 0.293, medications 0.176, heroin 0.202, marijuana 0.339, hashish 0.329,inhalants 0.276,hallucinogens 0.248, amphetamines 0.329 a measure how often students use psychoactive substances, regardless of which substance they use. The weights of the second principal components were : 0.280, 0.396, 0.392, 0.325, -0.288, -0.259, -0.189, -0.315, 0.163, -0.050, -0.169, -0.329, -0.232 it gives positive weights to legal substances and negative weights to illegal ones. Once the overall substance use is controlled, the major difference lies in the legal versus illegal use.

12 General Multidimensional Scaling Crumbled piece of paper is two-dimensional but principle components analysis would fail. The goal of scaling methods: preserving distances in a lower dimensional space Methods differ in: distances that are to be preserved … jk distances they map to … d jk how the calculations are performed

13 General Multidimensional Scaling Most common distance measure is Euclidean metric Common score function is stress (  j  k ( jk 2 - d jk 2 ) 2 /  j  k d jk 2 ) 1/2 The methods may start from distances between data vectors … metric scalling or rank order or a monotonic relationship … non-metric scalling The methods can be iterative: 1) regression of distances and 2) minimization of the stress

14 Parallel coordinates plots Variables as parallel axes Each data case is a piecewise linear plot connecting the values of the case Wegman, E. J. (1990), Hyperdimensional data analysis using parallel coordinates, J. American Statistical Association, 85, 664-675. Example (Crystal Vision krychle.data )


Download ppt "Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables."

Similar presentations


Ads by Google