Inferential StatisticsDescriptive
statistics deal with the characteristics of available data. Not all statisticians support the use of significance tests. Many suggest that more useful and less misleading information is available from confidence limits. If you wish to read a critique of the use of significance tests see Nester and you are strongly advised to read Johnson (1999). A comprehensive debate about significance tests can be found in Harlow et al (1997). |
||||||||||||||||||||||
Significance testingThe methodology
is common to all statistical tests, all that differs is the identity
of the test statistic. You will need to think clearly and carefully
about the following. It is important that you understand the logic
underlying significance tests.Bbegin
by specifying a pair of hypotheses.
For example: The
phrase 'no significant difference' indicates that although
a difference may be observed it is not meaningful. This could
mean that:
(a) differences have arisen by chance (experimental error) rather than as a consequence of our experimental treatments or (b) the effect was so small that it is biologically unimportant. Begin the analysis by assuming that the Ho is true. This is analogous to the presumption of innocence in a trial. If the Ho is true, what is the biggest difference between the sample means that can occur by chance at a reasonable level of probability? In other words, how different can we expect means to become simply as a result of chance? If the observed difference is less than this chance amount we have insufficient evidence for an effect since the difference could have arisen by chance. In such circumstances we must conclude that there is no significant difference between the two mean values. Again this analogous to finding someone innocence if there was insufficient evidence of guilt. Many people find
the concept of statistical significance confusing. Often it is a matter
of phrasing, e.g. Michael Wood from the AMS Department, University
of Portsmouth, suggested that it may be better if a statistically
significant result was described as "surprising if the null hypothesis
is true". Another difficulty is that statistical significance
does not automatically imply practical significance. For example,
using a large enough sample size we may be
able to demonstrate that a particular drug decreases blood pressure
by 1%. Although this is a real drop in blood pressure it is of no
practical value. It is impossible
to define an absolute value for the effect of sampling variation
but an effect size can be defined that has a specified probability
of occurring. The specified probability is called the level of
significance and is symbolised by alpha. Alpha is pre-assigned
a value otherwise we could still be subjective. Alpha may be thought
of as a definition of 'improbability', anything with a probability
<= to alpha is deemed improbable and therefore unlikely to occur
by chance. Any event whose probability is > alpha is said to be
probable. To
understand how we arrive at our value for alpha we must first examine
the logic of the analysis. There are 5 steps to perform.
According to the rules of significance testing we establish a strict criterion for rejection of the null hypothesis. If P is less than alpha we can reject Ho and accept Ha. If we reject Ho we are reasonably confident that a real difference exists, i.e.there is evidence beyond 'reasonable doubt to presume guilt'.
Because we are working with probabilities we can never be certain that we have reached the correct conclusion (as in a court case). If alpha is 0.05 (5%) we will reject Ho if P is less than 5%. Suppose we find that P =0.04 (4%), i.e. there is 4% chance of obtaining our results if Ho is true. According to our rules we will reject Ho since P < alpha. But, an event that has a 4% chance of occurring should occur, on average, 1 in 25 times. Therefore we may have falsely rejected Ho because an event with a 4% probability has occurred. Indeed we expect to make such a mistake 1 in 20 times (0.05 is 5% or 1 in 20). Your response
to this obvious problem may be to set alpha to some very low value
such as 0.001 (0.1%) to overcome the problem of rejecting a true Ho.
However, now we have the opposite problem, we may fail to reject Ho
even though it is false! Falsely
rejecting a true Ho is called a TYPE I ERROR (finding an
innocent person guilty). The probability of committing a type I error
is always equal to alpha. Failure to reject a false Ho is called a
TYPE II ERROR (finding a guilty person innocent). It is more
difficult to calculate the probability of committing a type II error
as it depends upon the power of the test. When we write down a Ho it is either a true statement or a false statement. The purpose of the statistical test is to decide between these two alternatives. The outcome of our statistical test will be
A compromise value
of alpha is needed which takes account of the chances of making type
I and II errors. For most biological purposes 0.05 (5%) is used. In
certain circumstances it may be possible to attach 'costs' to the
two types of error, for example when a type I error could result in
damage to patients. If it is possible to cost the two errors the value
of alpha can be adjusted, for example we may wish to reduce alpha
to minimise the possibility of damaging patients. An example
Depending on the outcome of our experiment a rat eradication programme will be started. What are the consequences of type I and Type II errors in this situation? If
Ho is true
the rats have no signicant effect on the puffin. Which of these
mistakes would you rather make? Note that your conclusion will not
be independent of the sample size. This is because the power of the analysis is related to the sample size. If the
sample size is small you may only be able to detect a large effect,
whereas a very large sample would detect a very small effect (a real
but possibly inconsequential effect). So perhaps we should begin with
the question 'what size of effect do you wish to detect?'. SUMMARY FINALLY A
very small p value (e.g. 0.0001) does not signify a large effect
- it signifies that the observed data are highly improbable given the
null hypothesis. A very small p value can arise when an effect
is tiny but the sample sizes are large, conversely a larger p value
can arise when the effect is large but the sample size is small. In
other words the magnitude of p is at least partly dependent
on the power of the test
|