Descriptive statistics Petter Mostad
Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when you first get the data. Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.
Data exploration Understand description of variables Find ranges, typical values, distributions of variables –Is the data OK? Meaningful? Outliers? Errors? How do variables relate to each other? –Is it meaningful? As expected? Can you form new hypotheses?
Data presentation Remove superfluous information Present essential information fairly Present information efficiently Make it possible to understand information quickly and simply
Types of variables Numerical variables –Discrete –Continuous Categorical variables –Nominal values –Ordinal values
Histograms Subdivide continuous data into intervals, and display counts in intervals Desicion about width of intervals can influence result a lot ”Ogives”
Bar charts Can show variation between categories Grouped bars can compare variations in different groups Stacked bars can show proportions, or cumulative effects
Example Shows changing proportions of 8 types across 24 groups Groups: coexpressed genes Types: Types of organisms
Cumulative distributions Cumulates the proportions up to each level Can never decrease; goes from 0 to 1 (or 100%)
Stem-and-leaf diagrams A way to show both the distribution of numbers graphically, and the digits involved Age in years Stem-and-Leaf Plot Frequency Stem & Leaf 2,00 1. & 18, , , , , , , , , , & 7, & 1,00 7. & Stem width: 10 Each leaf: 2 case(s) & denotes fractional leaves.
Pie charts Illustrates percentages or parts well for comparison between the parts. 3D pies, or ”exploded” pies, distort more than they clarify the information
Pareto diagrams Focuses on the most important (frequent) categories. Shows cumulative frequences when including each category
Numerical summary statistics (Arithmetic) mean Median Mode Skewness Outliers Max, min, range
Arithmetic versus geometric mean Given observations x 1, x 2, …, x n Arithmetic mean: Geometric mean: They correspond to each other when the scale is changed by taking logarithms!
Measures of variability (Sample) variance (Sample) standard deviation Coefficient of variation
Percentiles and quartiles The x percentile is the number p such that x percent of the data is smaller than p. The first and third quartiles are the 25th and 75th percentiles, respectively The inter-quartile range is the difference between the third and first quartiles.
Boxplots ”Box and whisker plots” Sometimes shows min, 1st quartile, median, 3rd quartile, max May instead show some outliers separately
Scatterplots Probably the most useful graphical plot Can show any kind of connection between variables, not only linear Can be done for many pairs at a time (matrix plot), or for triplets (3D plot)
Covariance Given paired observations (x 1,y 1 ), (x 2,y 2 ), …, (x n, y n ) (sample) covariance: Positive when variables tend to change in the same direction, negative if opposite direction
Correlation coefficient Correlation coefficient: Always between -1 and 1 If exactly equal to 1, then points are on an increasing line Can be a more illustrative measure than covariance
Least squares line fitting We can illustrate a trend in the data by fitting a line
Fitting the line The line is often fitted by minimizing the sum of the squares of the ”errors” (the vertical distances to the line) We will hear much about regression methods later
Cross tables When items can be classified using two different categorical variables, we can illustrate counts in a cross table. If percentages are computed, they must be either relative to the columns or the rows. In multiway tables, more than two classifying variables are used.
Early example: Napoleons Russian campain
DNA sequence logos Used to show what is conserved, and what varies, at DNA binding sites for some protein Relative height of letters show which bases are conserved Total height shows degree of conservation
Chernoff faces A way to visualize about 20 parameters in one figure Background: We are good at remembering and comparing faces Features in the face correspond to parameters you want to visualize
Chernoff faces
Use your own creativity! When exploring data, try to make the kinds of plots that will answer your questions! When presenting data, think about –simplicity –fairness –efficiency –inventiveness