Download presentation

Presentation is loading. Please wait.

Published bySergio Frome Modified over 2 years ago

1
**CA200 (based on the book by Prof. Jane M. Horgan)**

3. Basics of R – cont. Summarising Statistical Data Graphical Displays 4. Basic distributions with R CA200 (based on the book by Prof. Jane M. Horgan)

2
**Basics 6+7*3/2 #general expression [1] 16.5**

x <- 1:4 #integers are assigned to the vector x x #print x [1] x2 <- x**2 #square the element, or x2<-x^2 x2 [1] X < #case sensitive! prod1 <- X*x prod1 [1] CA200

3
**Getting Help click the Help button on the toolbar help() help.start()**

demo() ?read.table help.search ("data.entry") apropos (“boxplot”) - "boxplot", "boxplot.default", "boxplot.stat” CA200

4
**Statistics: Measures of Central Tendency**

Typical or central points: Mean: Sum of all values divided by the number of cases Median: Middle value. 50% of data below and 50% above Mode: Most commonly occurring value, value with the highest frequency CA200

5
**Statistics: Measures of Dispersion**

Spread or variation in the data Standard Deviation (σ): The square root of the average squared deviations from the mean - measures how the data values differ from the mean - a small standard deviation implies most values are near the average - a large standard deviation indicates that values are widely spread above and below the average. CA200

6
**Statistics: Measures of Dispersion**

Spread or variation in the data Range: Lowest and highest value Quartiles: Divides data into quarters. 2nd quartile is median Interquartile Range: 1st and 3rd quartiles, middle 50% of the data. CA200

7
**Data Entry Entering data from the screen to a vector Example: 1.1**

downtime <-c(0, 1, 2, 12, 12, 14, 18, 21, 21, 23, 24, 25, 28, 29, 30,30,30,33,36,44,45,47,51) mean(downtime) [1] median(downtime) [1] 25 range(downtime) [1] 0 51 sd(downtime) [1] CA200

8
**Data Entry – cont. Entering data from a file to a data frame**

Example 1.2: Examination results: results.txt gender arch1 prog1 arch2 prog2 m m NA NA m m m m m f and so on CA200

9
**Data Entry – cont. results$arch1[5] NA indicates missing value.**

No mark for arch1 and prog1 in second record. results <- read.table ("C:\\results.txt", header = T) # download the file to desired location results$arch1[5] [1] 89 Alternatively attach(results) names(results) allows you to access without prefix results. arch1[5] CA200

10
**Data Entry – Missing values**

mean(arch1) [1] NA #no result because some marks are missing na.rm = T (not available, remove) or na.rm = TRUE mean(arch1, na.rm = T) [1] mean(prog1, na.rm = T) [1] 84.25 mean(arch2, na.rm = T) mean(prog2, na.rm = T) mean(results, na.rm = T) gender arch1 prog1 arch2 prog2 NA

11
Data Entry – cont. Use “read.table” if data in text file are separated by spaces Use “read.csv” when data are separated by commas Use “read.csv2” when data are separated by semicolon CA200

12
**Data Entry – cont. Entering a data into a spreadsheet:**

newdata <- data.frame() #brings up a new spreadsheet called newdata fix(newdata) #allows to subsequently add data to this data frame CA200

13
Summary Statistics Example 1.1: Downtime: summary(downtime) Min. 1st Qu. Median Mean 3rd Qu. Max Example 1.2: Examination Results: summary(results) Gender arch1 prog1 arch2 prog2 f: 4 Min. : 3.00 Min. :65.00 Min. :56.00 Min. :63.00 m:22 1st Qu.: st Qu.: st Qu.: st Qu.:77.50 Median : Median :82.50 Median :85.50 Median :84.00 Mean : Mean :84.25 Mean :81.15 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:92.50 Max. : Max. :98.00 Max. :96.00 Max. :97.00 NA's : 2.00 NA's : 2.00

14
**Summary Statistics - cont.**

Example 1.2: Examination Results: For a separate analysis use: mean(results$arch1, na.rm=T) # hint: use attach(results) [1] summary(arch1, na.rm=T) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

15
**Programming in R x <- sum(downtime) # sum of elements in downtime**

Example 1.3: Write a program to calculate the mean of downtime Formula for the mean: x <- sum(downtime) # sum of elements in downtime n <- length(downtime) #number of elements in the vector mean_downtime <- x/n or mean_downtime <- sum(downtime) / length(downtime)

16
**Programming in R – cont. #hint - use sqrt function**

Example 1.4: Write a program to calculate the standard deviation of downtime #hint - use sqrt function CA200

17
**Graphical displays - Boxplots**

Boxplot – a graphical summary based on the median, quartile and extreme values boxplot(downtime) box represents the interquartile range which contains 50% of cases whiskers are lines that extend from max and min value line across the box represents median extreme values are cases on more than 1.5box length from max/min value CA200

18
**Graphical displays – Boxplots – cont.**

To improve graphical display use labels: boxplot(downtime, xlab = "downtime", ylab = "minutes")

19
**Graphical displays – Multiple Boxplots**

Multiple boxplots at the same axis - by adding extra arguments to boxplot function: boxplot(results$arch1, results$arch2, xlab = " Architecture, Semesters 1 and 2" ) Conclusions: marks are lower in sem2 Range of marks in narrower in sem2 Note outliers in sem1! 1.5 box length from max/min value. Atypical values.

20
**Graphical displays – Multiple Boxplots – cont.**

Displays values per gender: boxplot(arch1~gender, xlab = "gender", ylab = "Marks(%)", main = "Architecture Semester 1") Note the effect of using: main = "Architecture Semester 1”

21
**Par Display plots using par function**

par (mfrow = c(2,2)) #outputs are displayed in 2x2 array boxplot (arch1~gender, main = "Architecture Semester 1") boxplot(arch2~gender, main = "Architecture Semester 2") boxplot(prog1~gender, main = "Programming Semester 1") boxplot(prog2~gender, main = "Programming Semester 2") To undo matrix type: par(mfrow = c(1,1)) #restores graphics to the full screen

22
**Par – cont. Conclusions:**

- female students are doing less well in programming for sem1 - median for female students for prog. sem1 is lower than for male students

23
**Histograms hist(arch1, breaks = 5, xlab ="Marks(%)",**

A histogram is a graphical display of frequencies in the categories of a variable hist(arch1, breaks = 5, xlab ="Marks(%)", ylab = "Number of students", main = "Architecture Semester 1“ ) Note: A histogram with five breaks equal width - count observations that fill within categories or “bins”

24
**Histograms hist(arch2, xlab ="Marks(%)", ylab = "Number of students",**

main = “Architecture Semester 2“ ) Note: A histogram with default breaks CA200

25
**Using par with histograms**

The par can be used to represent all the subjects in the diagram par (mfrow = c(2,2)) hist(arch1, xlab = "Architecture", main = " Semester 1", ylim = c(0, 35)) hist(arch2, xlab = "Architecture", main = " Semester 2", ylim = c(0, 35)) hist(prog1, xlab = "Programming", main = " ", ylim = c(0, 35)) hist(prog2, xlab = "Programming", Note: ylim = c(0, 35) ensures that the y-axis is the same scale for all four objects! CA200

26
CA200

27
Stem and leaf Stem and leaf – more modern way of displaying data! Like histograms: diagrams gives frequencies of categories but gives the actual values in each category Stem usually depicts the 10s and the leaves depict units. stem (downtime, scale = 2) The decimal point is 1 digit(s) to the right of the | 0 | 012 1 | 2248 2 | 3 | 00036 4 | 457 5 | 1 CA200

28
**Stem and leaf – cont. stem(prog1, scale = 2)**

The decimal point is 1 digit(s) to the right of the | 6 | 5 7 | 12 7 | 66 8 | 8 | 5788 9 | 012 9 | 7778 Note: e.g. there are many students with mark 80%-85% CA200

29
**Scatter Plots To investigate relationship between variables:**

plot(prog1, prog2, xlab = "Programming, Semester 1", ylab = "Programming, Semester 2") Note: one variable increases with other! students doing well in prog1 will do well in prog2! CA200

30
**Pairs If more than two variables are involved:**

courses <- results[2:5] pairs(courses) #scatter plots for all possible pairs or pairs(results[2:5]) CA200

31
Pairs – cont. CA200

32
**Graphical display vs. Summary Statistics**

Importance of graphical display to provide insight into the data! Anscombe(1973), four data sets Each data set consist of two variables on which there are 11 observations CA200

33
**Graphical display vs. Summary Statistics**

Data Set 1 Data Set 2 Data Set 3 Data Set 4 x1 y1 x2 y2 x3 y3 x4 y CA200

34
**First read the data into separate vectors: **

x1<-c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1<-c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68) x2 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y2 <-c(9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74) x3<- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y3 <- c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73) x4<- c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8) y4 <- c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89) CA200

35
**For convenience, group the data into frames: **

dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4) CA200

36
**It is usual to obtain summary statistics: Calculate the mean: **

mean(dataset1) x y1 mean(data.frame(x1,x2,x3,x4)) x1 x2 x3 x4 mean(data.frame(y1,y2,y3,y4)) y y y y4 Calculate the standard deviation: sd(data.frame(x1,x2,x3,x4)) x x x x4 sd(data.frame(y1,y2,y3,y4)) Everything seems the same! CA200

37
**plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) **

But when we plot: par(mfrow = c(2, 2)) plot(x1,y1, xlim=c(0, 20), ylim =c(0, 13)) plot(x2,y2, xlim=c(0, 20), ylim =c(0, 13)) plot(x3,y3, xlim=c(0, 20), ylim =c(0, 13)) plot(x4,y4, xlim=c(0, 20), ylim =c(0, 13)) CA200

38
Note: Data set 1 in linear with some scatter Data set 2 is quadratic Data set 3 has an outlier. Without them the data would be linear Data set 4 contains x values which are equal expect one outlier. If removed, the data would be vertical. Everything seems different! Graphical displays are the core of getting insight/feel for the data!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google