Book list Level of course Aims of course What are multivariate data? What is multivariate data analysis? Aims of multivariate data analysis Why do multivariate data analysis? Terminology Types of variables Geometrical models and concept of similarity (dissimilarity or distance) Computing Course topics INTRODUCTION
Approach from practical biological and geological viewpoint, not statistical theory viewpoint. Assume no background in matrix algebra, eigenanalysis, or statistical theory. Emphasis on techniques that are ecologically realistic and useful and that are computationally feasible. LEVEL OF THE COURSE
“Truths which can be proved can also be known by faith. The proofs are difficult and can only be understood by the learned; but faith is necessary also to the young, and to those who, from practical preoccupations, have not the leisure to learn. For them, revelation suffices.” Bertrand Russell 1946 The History of Western Philosophy
“It cannot be too strongly emphasised that a long mathematical argument can be fully understood on first reading only when it is very elementary indeed, relative to the reader’s mathematical knowledge. If one wants only the gist of it, he may read such material once only, but otherwise he may expect to read it at least once again. Serious reading of mathematics is best done sitting bolt upright on a hard chair at a desk. Pencil and paper are indispensable.” L Savage 1972 The Foundations of Statistics. BUT: “A journey of a thousand miles begins with a single step” Lao Tsu
STATUS OF MULTIVARIATE NUMERICAL DATA ANALYSIS Basic mathematics of correlation, regression, analysis of variance, eigenanalysis, randomisation etc. not new, worked out in s. Arithmetic manipulations and calculations involved so numerous and so time consuming; virtually impossible to work with anything other than smallest data-sets on hand calculator or early computer. Development of numerical data analysis closely linked to development of computers. Now possible to do in seconds what would have taken hours, days, even weeks. Increased availability of computer program packages has advantages and disadvantages. Advantages fastfast painlesspainless simplesimple Disadvantages too fasttoo fast too easytoo easy too simpletoo simple Need to understand a technique well before one can critically evaluate results. Sound interpretation requires a good understanding of the technique.
Provide introductory understanding to the most appropriate methods for the numerical analysis of complex multivariate biological and environmental data. Recent maturation of methods. Provide introduction to what these methods do and do not do. Provide some guidance as to when and when not to use particular methods. Provide an outline of major assumptions, limitations, strengths, and weaknesses of different methods. Indicate to you when to seek expert advice. Encourage numerical thinking (ideas, reasons, potentialities behind the techniques). Not so concerned here with numerical arithmetic (the numerical manipulations involved). AIMS
ON THE USES AND METHODS OF STATISTICS By Professor F. Y. Edgeworth, M. A., D. C. L. Syllabus for Edgeworth’s 1892 Newmarch Lectures, University College London I. FIRST PRINCIPLES The extent of the subject here treated is that which is denoted by two leading definitions of statistics, viz: the study of numerical statements relating to society, and the theory of means. The subject may be divided according as the element of induction is more or less prevalent. First come general directions as to the acquisition of data; e.g., that figures should be accurate, and terms unambiguous. Examples of the violation of these rules; together with other precepts and cautions. Use of relative figures (per head, per cent, &c.). Analysis of the data. References: Conférences sur la Statistique (Rozier Editeur), 1891; Pidgin, Practical Statistics, 1888; Giffen, International Statistical Comparisons, Economic Journal, June, II. GRAPHICAL METHODS The Cartesian system of co-ordinates. Integration and interpolation. Case where several dependent variables (i.e. diseases from different causes) are referred to one independent variable (i.e. the time). The case of one variable dependent on two independent variables is properly represented by a surface; but curves of level and variously coloured planes are more convenient. Methods of expressing variation of a quantity relative to its initial, or average, value. Miscellaneous devices for exhibiting numerical relations to the eye. References: Marey, La Méthode Graphique, 1885; Favaro, Leçons de Statique Graphique (translated into French by Terrier), Ch. V. with appendix by the translator. Levasseur, La Statistique Graphique, Journal of the Statistical Society, Jubilee vol., 1885; Marshall, The Graphic Method of Statistics, Ibid; Cheysson, Les Cartogrammes à teintes graduées, Journal de la Société de Statistique de Paris, 1887; Scribner’s Statistical Atlas of the United States; Longstaff, Studies in Statistics, 1891.
III. THE DOCTRINE OF AVERAGES The general idea of a mean comprehends innumerable species, of which the most important are, the Arithmetic Mean, the Median, the Greatest Ordinate (or centre of greatest condensation) and the Geometric Mean. A cross division is between simple and weighted means. Concrete instances of these varieties. Subtle distinction between so-called objective and subjective means. Peculiar prestige attaches to the means of which the constituents are grouped according to the Probability Curve, or law of error. A priori demonstration, and empirical verification, that this form arises under certain conditions. References: Venn, Logic of Chance, Third Edition, 1888, chap, xviii., and xix.; On….Averages. Journal of the Statistical Society, 1891; Galton, Statistics by inter-comparison, Philosophical Magazine, 1875; Bertillon, Moyenne, Dictionnaire Encyclopédique des Science Médicales; Edgeworth, On the Choice of Means, Phil. Mag., 1887, On the empirical proof of the law of error, Ib., IV. TYPES AND CORRELATIONS The ‘mean man’ has for stature, length of cubit, height of knee, &c, the respective means of the statures, lengths, &c., of a greater number of men. Reply of the objection that such a combination of partial means may not form a possible whole. Relation between the deviation of one organ or attribute, e.g. length of cubit, from its mean; as established by Mr. Galton, and illustrated by Mr. H. Dickson. Abridged method of ascertaining the co-efficient which expresses the correlation between three attributes, e.g. stature, length of cubit and height of knee. The formula for the most probable attribute, e.g. stature corresponding to assigned values of two other attributes, e.g. length of cubit and height of knee, may be ascertained either from three simple correlations, between stature and cubit, stature and height of knee, cubit and height of knee; or by observations special to the case of three variables. Correlation between any number of attributes. References: Quetelet, Anthropométrie; Galton, Family Likeness in Stature, Proceedings of the Royal Society, 1886; Co-relations and their measurements Ibid. 1888; Weldon, Correlated Variations, Ibid, 1892.
V. THE STATISTICAL PART OF INDUCTIVE LOGIC Passing Insurance and other direct applications of statistics, we come to the investigation of causes. The inductive method to which statistics lends itself, the Method of Agreement, is liable to the fallacy Post hoc propter hoc; of which numerous examples occur. The Method of Concomitant variations is facilitated by the use of parallel curves. The Method of Residues is exemplified when in comparing the death rates of different classes, we make allowance for their different ages; and in similar cases. References: Mill, Logic; Giffen, Essays on Finance, and Article in June No. of Economic Journal; Humphreys, Value of death rates as a test of Sanitary conditions, Journal of the Statistical Society, 1874, Class Mortality Statistics, Ibid, VI. THE ELIMINATION OF CHANCE One case of the Method of Residues, for which there exists a technical apparatus, is where the agency allowed for consists of those “fleeting causes” called chance. The simple method of eliminating chance, described by Mill (Logic, iii, xviii, 4) and the higher method derived from the theory of error. The latter method is particularly applicable where the deviation from the average value of a ratio – e.g. that between male and female births – follows the analogy of the simpler games of chance. In other cases the higher theory affords rather regulative ideas than exact conclusions; in this respect, comparable to the use of the mathematical theory of economics. References: Westergaard, Grundzüge der Theorie der Statistik, 1891; Duesing, Das geschlechtverhaltniss in Preussen, 1890; Edgeworth, Methods of Statistics, Journal of the Statistical Society, Jubilee vol., [The lectures were presented on six consecutive Wednesdays at 5:00 P.M., beginning 11 May 1892, admission free.]
At the end of the semester, could my students fully understand all of the statistical methods used in a typical issue of Ecology? Probably not, but they did have the foundation to consider the methods if authors clearly described their approach. Statistics can still mislead students, but students are less apt to see all statistics as lies and more apt to constructively criticise questionable methods. They can dissect any approach by applying the conceptual terms used throughout the semester. Students leave the course believing that statistics does, after all, have relevance, and that it is more accessible than they believed at the beginning of the semester. At its best, statistical analysis sharpens thinking about data, reveals new patterns, prompts creative thinking, and stimulates productive discussions in multi-disciplinary research groups. For many scientists, these positive possibilities of statistics are over-shadowed by negatives; abstruse assumptions, emphasis of things one can’t do, and convoluted logic based on hypothesis rejection. One colleague’s reaction to this Special Feature (on statistical analysis of ecosystem studies) was that “statistics is the scientific equivalent of a trip to the dentist.” This view is probably widespread. It leads to insufficient awareness of the fact that statistics, like ecology, is a vital, evolving discipline with ever- changing capabilities. AIMS
Species#11#12#13#14#15#16#17#18#19#20 Equisetum pratense Rubus pubescens R. strigosus Cornus stolonifera C. canadenis Rosa acicularis Galium boreale Ribes oxycanthoides R. triste Mitella nuda Mertensia nudicaulis Aralia nudicaulis Viburnum edule Calamagrostis canescens Populus balsamifera (seedling) Prunus virginiana (seedling) Populus tremuloides (seedling) Actaea rubra Circaea alpina Thalictrun venulosum Matteuccia struthiopteris NO. OF SPECIES July 18, 1998.Plot 6 (quadrats)(Rt. Bank, c 300 m S of mouth of Steepbank R., 40m inland) A typical page from a field notebook. This one records observations on the ground vegetation in Populus balsamifera woodland in the flood plain of the Athabasca River, Alberta.
TYPES OF MULTIVARIATE DATA Object (n)Variable (m) Botany (plant ecology)Quadrat Relevé Plot Plant species ArchaeologySitesArtefacts GeologySamplesParticle-size classes ChemistryStream sedimentsTrace elements ZoologyGeographical localitiesMorphometric characters Pollen analysisSediment samplesPollen types Diatom analysisSediment samplesDiatom types PalaeontologyRock samplesFossil taxa... Features in common – MANY OBJECTS n MANY VARIABLES m CAN BE ARRANGED IN DATA MATRIX of SAMPLES or OBJECTS x VARIABLES
Samples (n samples) N (columns) 1x ik ***... X 1n Variables (m vars) 2**** 3**** 4****... M (rows) x m1 X mn DATA MATRIX Matrix X with n columns x m rows. n x m matrix. Order (n x m). X 21 element in row two column one X ik row icolumn k subscript
FEATURES OF MULTIVARIATE DATA Complex Show: Noise Redundancy Internal relationships Outliers Some information in the data is only indirectly interpretable BIOLOGICAL DATA many species +/–, quantitative, often %, many zero values, skewed non-linear responses to environment ENVIRONMENTAL DATA fewer variables +/–, ranks, quantitative non-normal linear inter-relationships, often high correlations, some redundancy
STATISTICS AND DATA ANALYSIS 1.Hypothesis testing ‘confirmatory data analysis’ (CDA). 2.Model building explanatory empirical [statistical] Pielou (1981) Quart. Rev. Biol. “Models are often displayed with little or no effort to link them with the real world. As a result the whole body of knowledge and theory has grown top-heavy with models... Models are not useless but too much should not be expected of them. Modelling is only a part, and a subordinate part, of research.” 3.Hypothesis generation ‘exploratory data analysis’ (EDA). Detective work CDA & EDA - different aims, philosophies, methods “We need both exploratory and confirmatory”. J W Tukey 1980
EXPLORATORY DATA ANALYSIS Real world ’facts’ Observations Measurements Data Data analysis Patterns ‘Information’ Hypotheses Decisions CONFIRMATORY DATA ANALYSIS Hypotheses Real world ‘facts’ Observations Measurements Data Statistical testing Hypothesis testing Theory
Underlying statistical model (e.g. linear or unimodal response) Exploratory data analysis Biological Data Y Description Confirmatory data analysis Testable ‘null hypothesis’ Additional (e.g. environmental data) X Rejected hypotheses
Observation Data collection Analysis Evaluate statistical H 0, H A Evaluate prediction Evaluate scientific H 0, H A Evaluate theory/paradigm Theory/Paradigm Prediction Scientific H 0 Scientific H A Statistical H 0 Statistical H A Conceptual design of study, choice of format (experimental, non- experimental) and classes of data Sampling or experimental design induction deduction The Popperian hypothetico-deductive method, after Underwood and others. H O = null hypothesisH A = alternative hypothesis
EXPLORATORY DATA ANALYSIS CONFIRMATORY DATA ANALYSIS How can I optimally describe or explain variation in data set? Can I reject the null hypothesis that the species are unrelated to a particular environmental factor or set of factors? Samples can be collected in many ways, including subjective sampling. Samples must be representative of universe of interest – random, stratified random, systematic. ‘Data-fishing’ permissible, post-hoc analyses, explanations, hypotheses, narrative okay. Analysis must be planned a priori. P-values only a rough guide.P-values meaningful. Stepwise techniques (e.g. forward selection) useful and valid. Stepwise techniques not strictly valid. Main purpose is to find ‘pattern’ or ‘structure’ in nature. Inherently subjective, personal activity. Interpretations not repeatable. Main purpose is to test hypotheses about patterns. Inherently analytical and rigorous. Interpretations repeatable.
A WELL-DESIGNED MODERN ECOLOGICAL STUDY COMBINES BOTH. 1) Two-phase study - Initial phase is exploratory, perhaps involving subjectively located plots or previous data to generate hypotheses. - Second phase is confirmatory, collection of new data from defined sampling scheme, planned data analysis. 2) Split-sampling - Large data set (>100 objects), randomly split into two (75/25) – exploratory set and confirmatory set. - Generate hypotheses from exploratory set (allow data fishing); test hypotheses with confirmatory set. - Rarely done in ecology.
Data diving with cross-validation: an investigation of broad- scale gradients in Swedish weed communities. ERIK HALLGREN, MICHAEL W. PALMER and PER MILBERG. Journal of Ecology, 1999, 87, Full data set Some previously removed data Clean data set Exploratory data set Combined data set Confirmatory data set RESULTS Remove observations with missing data Random split Hypotheses Ideas for more analysis Choice of variables Analyses for display Hypothesis tests Flow chart for the sequence of analyses. Solid lines represent the flow of data and dashed lines the flow of analysis.
EUROPEAN FOOD (From A Survey of Europe Today, The Reader’s Digest Association Ltd.) Percentage of all households with various foods in house at time of questionnaire. Foods by countries. Country
Dendrogram showing the results of minimum variance agglomerative cluster analysis of the 16 European countries for the 20 food variables listed in the table. Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland Classification
Ordination Correspondence analysis of percentages of households in 16 European countries having each of 20 types of food. Key: Countries: A Austria, B Belgium, CH Switzerland, D West Germany, E Spain, F France, GB Great Britain, I Italy, IRL Ireland, L Luxembourg, N Norway, NL Holland, P Portugal, S Sweden, SF Finland
Minimum spanning tree fitted to the full 15-dimensional correspondence analysis solution superimposed on a rotated plot of countries from previous figure.
Percentages of people employed in nine different industry groups in Europe. (AGR = agriculture, MIN = mining, MAN = manufacturing, PS = power supplies, CON = construction, SER = service industries, FIN = finance, SPS = social and personal services, TC = transport and communications). Source: Euromonitor (1979, pp. 76-7) with the percentage employed in finance in Spain reduced from 14.7 to the more reasonable figure of 8.5
WHY DO MULTIVARIATE DATA ANALYSIS? 1: Data simplification and data reduction - “signal from noise” 2: Detect features that might otherwise escape attention. 3: Hypothesis generation and prediction. 4: Data exploration as aid to further data collection. 5: Communication of results of complex data. Ease of display of complex data. 6: Aids communication and forces us to be explicit. “The more orthodox amongst us should at least reflect that many of the same imperfections are implicit in our own cerebrations and welcome the exposure which numbers bring to the muddle which words may obscure”. D Walker (1972) 7: Tackle problems not otherwise soluble. Hopefully better science. 8: Fun!
“General impressions are never to be trusted. Unfortunately when they are of long standing they become fixed rules of life, and assume a prescriptive right not to be questioned. Consequently those who are not accustomed to original inquiry entertain a hatred and a horror of statistics. They cannot endure the idea of submitting their sacred impressions to cold-blooded verification. But it is the triumph of scientific men to rise superior to their superstitions, to desire tests by which the value of their beliefs may be ascertained, and to feel sufficiently masters of themselves to discard contemptuously whatever may be found untrue.” Francis Galton Quoted from Quotes, Damned Quotes and... compiled by J Bibby Edinburgh: John Bibby (Books)
TYPES OF VARIABLES 1) Numeric, quantitative, continuous variables 3) Binary or dichotomous variables +/– (e.g. male, female) 4) Conditionally present variables 2) Nominal and ordinal variables (qualitative multistate) Nominal “disordered multistate” (e.g. red, white, blue) Ordinal “ordered multistate” (e.g. dry, moist, wet) e.g. 3 species - A, B, C Only A & B have petals A pink petals B white petals ABC Pink petals+-- White petals-+-nominal disordered No petals--+ 5) Mixed data – see Lecture 12
Pollen data - 2 pollen types x 15 samples Depths are in centimetres, and the units for pollen frequencies may be either in grains counted or percentages. Adam (1970) GEOMETRICAL MODELS
Palynological representation Geometrical representation ALTERNATE REPRESENTATIONS OF THE POLLEN DATA In (a) the data are plotted as a standard diagram, and in (b) they are plotted using the geometric model. Units along the axes may be either pollen counts or percentages. Adam (1970)
Geometrical model of a vegetation space containing 52 records (stands). A: A cluster within the cloud of points (stands) occupying vegetation space. B: 3-dimensional abstract vegetation space: each dimension represents an element (e.g. proportion of a certain species) in the analysis (X Y Z axes). A, the results of a classification approach (here attempted after ordination) in which similar individuals are grouped and considered as a single cell or unit. B, the results of an ordination approach in which similar stands nevertheless retain their unique properties and thus no information is lost (X 1 Y 1 Z 1 axes). N. B. Abstract space has no connection with real space from which the records were initially collected.
Concept of Similarity, Dissimilarity, Distance and Proximity s ij – how similar object i is object j Proximity measure DC or SC Dissimilarity = Distance _________________________________ Convert s ij d ij s ij = C – d ij where C is constant
COMPUTING In the 10 practicals, mainly use R, a public-domain statistical- computing environment, rather than specific commercial packages such as MINITAB or SYSTAT. Relatively steep learning curve but worth it. Recommend Fox (2002) An R and S-PLUS companion to applied regression (Sage), Crawley (2005) Statistics – An introduction using R (Wiley), Crawley (2007) The R Book (Wiley), Everitt (2005) An R and S-PLUS companion to multivariate analysis (Springer), and Verzani (2005) Using R for introductory statistics (Chapman Hall/CRC) as excellent guides. Will also use specialised software for specific methods (e.g. TWINSPAN, CANOCO and CANODRAW, C2, ZONE, etc.) Computing practicals are an integral and essential part of the course.
COURSE POWERP0INTS In some of the lectures, some of the slides are rather technical. They are included for the sake of completion to the topic under discussion. They are for reference only and are marked REF