Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni.

Similar presentations


Presentation on theme: "Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni."— Presentation transcript:

1 Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni

2 Classical Data Analysis Data  Model  Analysis

3 Exploratory Data Analysis (EDA) Data  Analysis  Model

4 4/21/034 EDA EDA is an approach/philosophy that employs a variety of techniques (mostly graphical) to 1. Maximize insight into a data set 2. Uncover underlying structure 3. Detect outliers and anomalies

5 4/21/035 Data analysis: Data preparation Exploring, displaying and examining data Data mining – extract patterns and build predictive models

6 4/21/036 Web-based EDA Tool Data Preparation -- Missing value treatment -- Data transformation Data visualization -- Histograms -- Box Whisker Plots Clustering -- K-means clustering

7 4/21/037 Architecture Client Interface Components Interface Components Input Graphs, Replace Missing values, Data Trans, Clustering.. Web Server Output/New dataset after computations Server Database Computation Components Computation Components

8 4/21/038 Data Preparation

9 Replacing Missing Values Numeric values: PositionOriginal Sample 10.0886 20.0684 30.3515 40.9874 50.4713 60.6115 70.2573 80.2914 90.1662 100.4400 110.6939 Position 11 missing 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 ?????? Preserve Mean 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.3731 Preserve Std.Dev. 0.0886 0.0684 0.3515 0.9874 0.4713 0.6115 0.2573 0.2914 0.1662 0.4400 0.6622 Mean0.4023 Std. Dev0.2753 0.3731 0.2753 0.3731 0.2612 0.3994 0.2753 Size of error in estimate 0.3208 0.0317

10 4/21/0310 Replacing Missing Values Nonnumeric or alpha values: -- We have alpha variable ‘Z’ -- Takes as values 20 a’s, 50 b’s and 30 c’s a b c 1 20 70 100 -- Pick a number between 1 and 100 randomly

11 4/21/0311 Data Transformation S.No X Y Z 1 10 11 a 2 12 23 b 3 11 14 a Data Set containing Alpha Variable ‘Z’ before transforming S.No X Y a b 1 10 11 1 0 2 12 23 0 1 3 11 14 1 0 Data Set containing pseudo variables ‘a’ and ‘b’ after transforming Converting Alpha variables into numeric variables: Option to add or delete columns

12 4/21/0312 Data Visualization

13 4/21/0313 Histogram A graph of frequency distribution Summarize the distribution of data set -- locates center of distribution -- determines the spread of data -- notes overall shape of distribution Example dataset [65,82,87,94,96,91,75,69,67,98,85,100,89,77,46, 76,70,54,92,70,54,92,70,85,88,74,82,90,87,78,89,70,96,79,83,83,94,88,93,59,80,84,72]

14 4/21/0314 Frequency distribution Class Frequency 41 – 50 1 51 – 60 2 61 – 70 6 71 – 80 8 81 – 90 14 91 – 100 9

15 4/21/0315 Histograms

16 4/21/0316 Box Whisker Plot Provides a visual image of a distribution’s: Location Symmetry Outliers Most useful when comparing two or more variables

17 4/21/0317 Box Whisker plot components Upper quartile Lower quartile Median A five point summary of one dimensional data IQR Outlier Lower adjacent value Upper adjacent value

18 4/21/0318 Example Dataset: [35,47,48,50,51,53,54,70,75] Median: 51 1 st quartile: 48 3 rd quartile: 54 IQR : 54 – 48 = 6 Max.Whisker Length: 1.5*6 =9

19 4/21/0319 Example 30 40 35 50 45 60 55 706575 80

20 4/21/0320 Observations Symmetric Right Skewed Left Skewed Small spread

21 4/21/0321 Clustering Technique for grouping similar objects Cluster starts with an undifferentiated group Computation of similarities through Euclidean distances Selection of mutually exclusive clusters -- maximize within-cluster similarity -- maximize between-cluster differences

22 4/21/0322

23 4/21/0323 What we see after all this? Prepare the data for effective analysis Histograms: Center, shape of distribution Box Whisker Plots: Spread, symmetry, outliers of the distribution. Clustering: Identify any related data points in the dataset.

24 4/21/0324 Now the data is ready for modeling


Download ppt "Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni."

Similar presentations


Ads by Google