| Unit Organisation | Useful resources | An Introduction to multivariate methods | Student
comment & performance 2000/2001 |
This unit is being developed to provide guidance for Masters students from the Department of Biological Sciences at Manchester Metropolitan University. It may also be a useful resource for other postgraduate, and final year undergraduate, students.
If you have arrived here from another institution please let us know what you think - in particular any suggestions for improvements/corrections would be useful. I am particularly keen to collaborate with colleagues from other institutions who would like to extend the content.
Other OnLine courses are also available from our Web site.
The unit is studied as a core unit on three of our MSc courses and is available as an option for other students who are registered on the Faculty's network MSc course. This includes the department's other MSc courses.
| Course | Contact |
| Behavioural Ecology | Dr. Matthew Sullivan |
| Biomedical Sciences | Ms. Joyce Overfield |
| Conservation Biology | Dr. Martin Jones |
| Stress Management | Professor Terry Looker |
It is assumed that students will have some understanding of the basics of statistical analyses and interpretation. A set of univariate revision notes is available for reference and updating. Using this background knowledge as a foundation the unit has the following aims.
After completing this unit students should be able to:
Normally it is expected that this unit will be taught as a mixture of lectures / tutorials / terminal classes and independent learning using these web-based tutorials. It should also be possible to undertake the entire unit as a distance-learning module. All students are advised to keep in regular email, telephone or written contact with the course team, particularly if they are having problems. It is assumed that each student will spend 120 hours on this unit, including time taken up by assessments and extra reading. It is expected that the assessments will take at least 30 hours to complete.
The recommended route through the material is to study the methods in the order:
Recommended for purchase
Kinnear, P. R. and Gray, C. D. 2000. SPSS for Windows made simple. Psychology Press, Andover - £14.95 - an excellent and very clear book. (www.psypress.co.uk)
Others
Chatfield, C. and Collin, A. J. 1980. Introduction to multivariate analysis. Science Paperbacks.
Field, A. 2000. Discovering Statistics using SPSS for Windows. Sage Publications, London. - an excellent comprehensive text about a wide range of 'difficult analyses.'
Flury, B. and Riedwyl, H. 1988. Multivariate statistics: a practical approach. Chapman and Hall.
Jongman, R. H. et al. 1995. Data analysis in community and landscape ecology. Pudoc Wageningen.
Legendre, P. and Legendre, L. 1998. Numerical Ecology (2nd English Edition). Elsevier, Amsterdam.
Tabachnick, B. G. and Fidell, L. S. 1996. Using multivariate statistics. 3rd edition. Harper.
The following Web sites contain links to free or shareware software, most of which is relevant to multivariate analyses.
Some general web resources
The makers of STATISTICA (a commercial software package) have a very useful set of notes about many statistical methods, including some that are only briefly covered in this course.
Pierre Legendre's (Université de Montréal) site has links to many useful programs (particularly those involving spatial analyses). Much of this software is written for Apple Mac computers, but there are also some Window's versions.
PopTools is a very versatile Excel addin from CSIRO. In addition to Mantel tests it also incorporates a range of Matrix methods and resampling techniques.
The ADE-4 site is an online multivariate statistical package. You submit your data, it does the analyses and returns your results. You can also download the entire package to run on your own computer.
The R package is a public domain (i.e. free) 'clone' of the very powerful S-Plus package. Although it is very powerful it is not for the faint-hearted! Using it belies its Unix heritage. If you wish to find a version for the Mac or PC go to the binaries link. Note this is a completely different R statistics package to that distributed from Pierre Legendre's site!
Warren Kovach's MVSP software does most common multivariate analyses, including cluster analysis, PCA and PCO. The windows version also does CA and CCA. This is shareware software but you can try it before you buy it.
The SISA site has links to lots of biomedically orientated software.
Contact me: Alan Fielding
Biological Sciences, Manchester Metropolitan University, Manchester, M1 5GD, United Kingdom
(telephone 0161 247 1198)
Multivariate Techniques What are they?
By their very nature many biological systems are inherently multifactorial and we may need to examine many variables in order to understand our particular system. Multivariate statistical techniques have been developed to deal with situations in which you have two or more variables that you wish analyse simultaneously. They can be placed in two broad categories:
The value of a single response variable is assumed to be a function of a set of predictor variables. For example we may wish to predict:
| Response (single variable) | Predictors (>1) |
| number of species | climate, soils, disturbance |
| amount of lead in body tissue | traffic volume and distance from a road |
| bone mineral density in survivors from childhood cancer | age, weight, height, diet |
| probability of death | blood pressure, gender and diet. |
In these methods the primary aim is one of dimension reduction. Consider a data table that has n rows (e.g. cases or sites) and p columns (variables). A table such as this can be compressed in one or two directions.

Two general categories of methods have been recognised:
The main aim of the geometrical methods is compression of the variables. The reason for this approach is that studies often collect data for many variables. A large number of variables is difficult to process and assimilate. Many of the variables will be correlated. Consequently it may be possible to combine them into a small number of groups (derived variables) which relate to more abstract features, for example increasing human influence or 'lifestyle'. These derived variables can be used as the primary measures in subsequent analyses. A variety of methods have been employed to obtain these new variables, mainly
Clustering and partitioning methods are used to group cases on the basis of their similarity over a range of variables. The main examples of these techniques come under the general heading of cluster analysis. Many clustering algorithms are available; they differ with respect to the method used to measure similarities (or dissimilarities) and the points between which distances are measured. Thus, although clustering algorithms are objective, there is scope for subjectivity in the selection of an algorithm. The most common clustering algorithms are polythetic agglomerative, i.e. a series of increasingly larger clusters are formed by the fusion of smaller clusters on the basis of more than one variable. A problem with the hierarchical approach is that they are computer-intensive and large data sets may be difficult to analyse. A less computer intensive approach is the nonhierarchical k means or iterative relocation algorithm. Each case is initially placed in one of k clusters, cases are then moved between clusters if it minimises the differences between cases within a cluster.The Ordination Web site has a lot of useful background about the application of ordination methods to ecological problems.
Multiple Regression(A regression method)
If the dependent variable is continuous, or at least not binary or categorical, multiple regression is usually employed to investigate the nature of the relationship between the response variable and its predictors. Multiple regression attempts to fit, in the simplest case, a plane to the predictors such that the error (as measured by a sum of squares) between observed and predicted y values is minimised. R² (the coefficient of determination) provides a measure of the overall fit between observed and predicted values of y. The resultant relationship is described by an equation such as
y = b0 + b1x1 + b2x2 + ... bpxp
If a ß coefficient is significantly different from 0 it provides a measure of the rate of change in y with respect to each increase of xi by one unit. The main problems with multiple regression relate to the validity of assumptions; the interpretation of the ß coefficients and the selection of the appropriate model.
Logistic Regression (A regression method)
If the dependent variable has only two possible values, for example 0 and 1 (this could be any binary variable, such as gender, in which one value (e.g. male) is assigned 0 and the other is assigned a value of 1), methods such as multiple regression become invalid since predicted values of y would not be constrained within the 0 and 1 limits. Discriminant analysis can be used in such circumstances. However discriminant analysis will only produce optimal solutions if its assumptions are supported by the data. An alternative approach is logistic regression. In logistic regression the dependent variable is the probability that an event will occur, hence y is constrained between 0 and 1. The logistic model is written as:
![]()
where z is b0 + b1x1 + b2x2 + ... bpxp The logistic equation can be rearranged by converting the probability into a log odds or logit.
Log [Prob.(event)/Prob.(n event)] = b0 + b1x1 + b2x2 + ... bPxP
This produces a relationship similar to that in multiple regression except that now each one-unit change in a predictor is associated with a change in log odds rather than the response directly. Different types of response model can be investigated with this approach, for example if squared predictors (quadratic terms) are included as predictors the model is assumed to be gaussian rather than sigmoidal. As with multiple regression it is also possible to test a range of models by applying stepwise inclusion or elimination of predictors. Interpretation of the coefficients is complicated by the fact that they relate to changes in log odds rather than the response itself.
Discriminant Analysis (A regression method)
Identifying the features which are responsible for splitting a set of observations into two or more groups, such as nest and non-nest sites, is a common biological problem. If we have information about individual sampling units, obtained from a number of variables, it is reasonable to ask if these variables can be used to define the groups. Discriminant analysis works by combining the variables in such a way that the differences between the predefined groups are maximised. It also provides a classification rule (an equation or discriminant function) that can be used with other cases to predict which group they belong to.
Discriminant analysis can be considered to be a special case of regression analysis where the response variable identifies group membership, for example used and unused habitat. It is possible to use normal regression analysis programs to carry out discriminant analysis.
Principal Components Analysis (An ordination method)
PCA is a geometrical ordination method which is used to compress a set of variables into a smaller number of derived variables or components. It is used to pick out patterns in the relationships between the variables in such a way that most of the original information can be represented by a reduced number of new variables. A useful metaphor is think about a photograph. This is a 2-dimensional representation of a 3-dimensional object. As long as an appropriate camera angle is chosen little information about the subject will be lost. Thus, the original 3 dimensions can be compressed into 2 dimensions with little information loss.
As with all statistical techniques there are assumptions about the data. The main one is that the derived components are normally distributed and uncorrelated (orthogonal). If PCA is being used to test statistical hypotheses the assumptions should be valid. The assumptions are less important when PCA is used as a descriptive and exploratory tool. In practice, if the principal components are normally distributed the assumptions may be considered valid. A more useful yardstick by which a PCA should be judged is 'do the results make biological sense?', a bad one does not!
Correspondence Analysis (An ordination method)
CA, is an ordination method which aims to simultaneously compress the rows and columns of a data table to achieve a single, simultaneous geometrical representation of objects and variables. CA has been applied widely in plant ecology. In these studies a table of species by sites is subjected to analysis. The table does not contain environmental data such as soil pH, etc. It is hoped that CA will reorganise the table such that environmental gradients can be identified and used to interpret the species distributions‘.
The CA axes have associated eigenvalues, in the range 0 - 1. Generally only those axes with eigenvalues over 0.5 can be said to show good separation between sites and species. The eigenvalues can be thought of as correlations between row and column scores. An eigenvalue of 1.0 implies that one sample (or group of samples) shares no species with all other samples. A modified analysis called Detrended Correspondence Analysis (DCA) is one of the most widely used ordination techniques. The axes of DCA are a useful measure of beta diversity.
Canonical Correspondence Analysis (An ordination method)
The best ordination techniques reduce dimensions by detecting the most ‘important ‘ dimensions (or gradients) in a data set and ignoring "noise" or unexplained variation. Most early ordination techniques were indirect and "exploratory". It was your job to work out what environmental gradients had been detected by the analysis. CCA is different because it allows statistical hypotheses to be tested. It many respects it can be considered as a special case of multiple regression where the y variable is a ordination axis representing a particular species or sites assemblage. As with all statistical tests a null hypothesis must be clearly stated and data must be collected in a repeatable manner.
For example, the habitat association problem, when applied to bird communities, is to discover how the bird assemblage responded to its habitat. Data are collected on species composition and the environmental variables at a number of points in space and time. CCA is a method that is able to detect unimodal relationships between the species and habitat variables, for example between species and altitude. It is particularly useful if the number of response variables is large compared to the number of cases. The relationship between the species and the environmental variables can be tested statistically. The effect of a particular environmental variable can be tested after elimination of possible effects of other (environmental) variables by specifying the latter as covariables.
Hierarchical clustering (An ordination method)
Cases are assigned
to clusters on the basis of some similarity or distance measure. Clusters
are organised into hierarchies, such that a tree of increasingly different
clusters is formed.
k-means clustering (An ordination method)
Cases are clustered into k (a value set by the user) nonhierarchical classes.The method is conceptually simple but computationally intensive. It is useful for large datasets that may be difficult to analyze using a hierarchical method. It can also be used to assign cases to classes if the class characteristics (e.g. class means) are known.