Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate.

Similar presentations


Presentation on theme: "Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate."— Presentation transcript:

1 Multidimensional data processing

2 Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing. Removal of variables before data analysis leads to information loss. Unknown information is never recovered. One of the most common task is clustering or classification.

3 classification target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes clustering target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation

4

5

6 we are trying to extract information from data measurements, observations, surveys data preparation data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting extracting information we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration

7 preliminary analysis of the data better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns also known as Exploratory Data Analysis (EDA) a different approach – mind shift is required concentrates on the larger view 1977+ aka visual data mining

8 Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962

9

10 steps maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings heavily relies on graphics numbers are very abstract

11 Characteristics: N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Have we realized something important? 10.00 8.04 8.00 6.95 13.00 7.58 9.00 8.81 11.00 8.33 14.00 9.96 6.00 7.24 4.00 4.26 12.00 10.84 7.00 4.82 5.00 5.68

12

13 Run-sequence plot similar to line-chart in excel shifts in variations shifts in location outliers Histogram center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud)

14 check whether the data set is random or no random data should have no observable structure lag = fixed time displacement can be arbitrary most common is 1 observe week autocorrelation strong autocorrelation sinusoidal model outliers

15

16

17

18 1 dimension – piece of cake (pie) 2 dimensions – still easy – Cartesian coordinate system 3 dimensions – still doable in Cartesian system 4 and more dimensions – only Chuck Norris can do that in Cartesian system other types of visualization are required some may be useful only for some types of data

19

20 understanding the data is very important good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods some options: bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates

21 also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3 rd dimension – point size optional 4 th dimension – point color advantages allows to uncover clusters and variable dependencies easy to understand disadvantages different combinations need to be tried

22

23 extension to common scatter plot 2 dimensional array of scatter plots each combination of variables is drawn (twice) diagonal descriptions easy to create messy dependencies between more than two variables are still hidden

24 Sepal width Petal length Petal width Sepal length

25 axes radiate from central point Star plot values of a data point are connected to form a polygon can display only a small number of points order of variables may be important Radviz values of a data point act as spring stiffness values normalized into interval object is placed in equilibrium of all forces order of variables becomes very important

26 Iris-virginica

27 Iris-versicolor

28 Iris-setosa

29

30

31 similar principle to Radviz data points are not attracted to a single point data points are attracted to an axis circle becomes polygon → Polyviz order of variables is less important polygon edges become very important candidates for classification rules different combinations of variables exact position of point is displayed – no information loss

32

33

34

35

36 advantages determine correlation between variables both positive and negative determine partial correlations only some values of some variable are correlated with some values of other variable very important disadvantages dependent on variable ordering not that useful without interactive software may be hard to understand for newbies

37 Exploratory data analysis: http://www.itl.nist.gov/div898/handbook/eda/eda.htm Have a look at the graphical techniques: http://www.itl.nist.gov/div898/handbook/eda/section3/ed a33.htm http://www.itl.nist.gov/div898/handbook/eda/section3/ed a33.htm Orange Canvas – open-source data mining http://orange.biolab.si/ interface similar to IBM Clementine (SPSS Modeler) widget documentation: http://orange.biolab.si/doc/widgets/ http://orange.biolab.si/doc/widgets/ Sample data http://archive.ics.uci.edu/ml/index.html http://www- 958.ibm.com/software/data/cognos/manyeyes/ http://www- 958.ibm.com/software/data/cognos/manyeyes/


Download ppt "Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate."

Similar presentations


Ads by Google