# 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

## Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment."— Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment

Contents PDA Interpreting what you get back (the stats, the plots) Detailed analyses/ fitting – a start How to assess/ intercompare 2

Preliminary Data Analysis Relates to the sample v. population (for Big Data) discussion last week Also called Exploratory DA –EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe will be there (John Tukey) Distribution analysis and comparison, visual analysis, model testing, i.e. pretty much the things you did last Friday! Thus we are going to review those results 3

Patterns and Relationships Stepping from elementary/ distribution analysis to algorithmic-based analysis I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non- parametric models Relations – associations between/among populations Outcome: model and an evaluation of its fitness for purpose 4

Models Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) Two key topics: –N=all and the open world assumption –Model of the thing of interest versus model of the data (data model; structural form) All models are wrong but some are useful (generally attributed to the statistician George Box) 5

Conceptual, logical and physical models 6 Applied to a database: However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from the data ;-)

Art or science? The form of the model, incorporating the hypothesis determines a form Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 7

Exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 32.10 48.60 59.20 58.37 67.60 93.50 68 > boxplot(EPI) > fivenum(EPI,na.rm=TRUE) [1] 32.1 48.6 59.2 67.6 93.5 Tukey: min, lower hinge, median, upper hinge, max 8

Stem and leaf plot > stem(EPI)# like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully.. 3 | 234 3 | 66889 4 | 00011112222223344444 4 | 5555677788888999 5 | 0000111111111244444 5 | 55666677778888999999 6 | 000001111111222333344444 6 | 5555666666677778888889999999 7 | 000111233333334 7 | 5567888 8 | 11 8 | 669 9 | 4 9

Histogram > hist(EPI)#defaults 10

Distributions Shape Character Parameter(s) Which one fits? 11

12 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=1.)) > rug(EPI) or > lines (density(EPI,na.rm=TR UE,bw=SJ))

13 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=SJ))

Why are histograms so unsatisfying? 14

> xn<-seq(30,95,1) > qn<- dnorm(xn,mean=63, sd=5,log=FALSE) > lines(xn,qn) > lines(xn,.4*qn) > ln<-dnorm(xn,mean=44, sd=5,log=FALSE) > lines(xn,.26*ln) 15

Eland ~ EPI !Landlock > hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines … 16

No surface water 17

EPIreg<- EPI_data\$EPI[EPI_data\$EPI_reg ions=="Europe"] 18

Exploring other distributions > summary(DALY) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 37.19 60.35 53.94 71.97 91.50 39 > fivenum(DALY,na.rm=TRUE) [1] 0.000 36.955 60.350 72.320 91.500 19 EPIDALY

Stem and leaf plot > stem(DALY) # The decimal point is 1 digit(s) to the right of the | 0 | 0000111244 0 | 567899 1 | 0234 1 | 56688 2 | 000123 2 | 5667889 3 | 00001134 3 | 5678899 4 | 00011223444 4 | 555799 5 | 12223344 5 | 556667788999999 6 | 0000011111222233334444 6 | 6666666677788889999 7 | 00000000223333444 7 | 66888999 8 | 1113333333 8 | 555557777777777799999 9 | 22 20

DALY hist(DALY, seq(0., 99., 1.0), prob=TRUE) lines(density( DALY, na.rm=TRUE,bw=1.)) lines(density( DALY, na.rm=TRUE,bw=SJ)) 21

Beyond histograms Cumulative distribution function: probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) 22

Beyond histograms Quantile ~ inverse cumulative density function – points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles Quantile-Quantile (versus default=normal dist.) > par(pty="s") > qqnorm(EPI); qqline(EPI) 23

Beyond histograms Simulated data from t-distribution (random): > x <- rt(250, df = 5) > qqnorm(x); qqline(x) 24

Beyond histograms Q-Q plot against the generating distribution: x<- seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 25

DALY (ecdf and qqplot) 26

Weibull qqplot…….. 27

Testing the fits shapiro.test(EPI) # null hypothesis – normal? Shapiro-Wilk normality test data: EPI W = 0.9866, p-value = 0.1188 Interpretation: W and probability-value Reject null hypothesis or not? Here.. ~ NO. DALY: W = 0.9365, p-value = 1.891e-07 (reject) 28

Kolmogorov–Smirnov One-sided or two-sided: > ks.test(EPI,seq(30.,95.,1.0)) Two-sample Kolmogorov-Smirnov test data: EPI and seq(30, 95, 1) D = 0.2507, p-value = 0.005451 alternative hypothesis: two-sided Warning message: In ks.test(EPI, seq(30, 95, 1)) : p-value will be approximate in the presence of ties D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>0.05. 29

Variability in normal distributions 30

F-test 31 F = S 1 2 / S 2 2 where S 1 and S 2 are the sample variances. The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

> var.test(EPI,DALY) F test to compare two variances data: EPI and DALY F = 0.2393, num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.1781283 0.3226470 sample estimates: ratio of variances 0.2392948 32

T-test 33

Comparing distributions > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = 2.1361, df = 286.968, p-value = 0.03352 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.3478545 8.5069998 sample estimates: mean of x mean of y 58.37055 53.94313 34

Comparing distributions > boxplot(EPI,DALY) 35

CDF for EPI and DALY 36 > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

qqplot(EPI,DALY) 37

Oooppss did we forget? 38

Goal? Find the single most important factor in increasing the EPI in a given region Preceding table gives a nested conceptual model Examine distributions down to the leaf nodes and build up an EPI model 39

boxplot(ENVHEALTH,ECOSYSTEM) 40

qqplot(ENVHEALTH,ECOSYSTEM) 41

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = 0.9161, p-value = 1.083e-08 ------- Reject. > shapiro.test(ECOSYSTEM) Shapiro-Wilk normality test data: ECOSYSTEM W = 0.9813, p-value = 0.02654 ----- ~reject 42

Kolmogorov- Smirnov - KS test - > ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = 0.2331, p-value = 0.0001382 alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43

44

How are the software installs going? R/Scipy (et al)/Matlab – getting comfortable? Data infrastructure … http://hyperpolyglot.org/numerical-analysis (Matlab, R, scipy/numpy) table comparisonhttp://hyperpolyglot.org/numerical-analysis 45

Tentative assignments Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 46

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg)pfox@cs.rpi.edu Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.educhenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014http://tw.rpi.edu/web/courses/DataAnalytics/2014 –Schedule, lectures, syllabus, reading, assignments, etc. 47

Download ppt "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment."

Similar presentations