Anyone, who has heard the statement: ‘You can not just throw data away without an explanation or a reason’, is familiar with the headaches caused by outliers. Outliers are the observations that appear to be inconsistent with the reminder of the collected data (Iglewicz, 1993). The term outlier is used collectively for discordant observations and for contaminants. A discordant observation is defined as an observation that appears surprising or discrepant to the investigator (Iglewicz, 1983). A contaminant is defined as an observation from a different distribution then the rest of the data. Contaminants may or may not be noted by the investigator (Barnett, 1984). Figure 1 shows some examples of outliers.

Possible sources of outliers are: recording and measurement errors, incorrect distribution assumption, unknown data structure, or novel phenomenon (Iglewicz, 1993). Recording and measurement errors are often the first suspected source of outliers. Incorrect assumption about the data distribution can lead to mislabeling data as outliers. The data which does not fit well into the assumed distribution may fit well into a different distribution, as shown in Figure 2.
Unknown data structure and correlations can cause apparent outliers. A data set could be made up of subsets, which are subject to different mechanisms and should be analyzed independently of each other, as shown in Figure 3.
A data set indicative of a novel phenomenon can be often labeled as an outlier. For example, the measurements indicating existence of the hole in the ozone layer were initially thought to be outliers and they were automatically discarded. This oversight delayed the discovery of the phenomenon by several years (Berthouex, 1994)
There are two ways of managing data outliers. In the laboratory, good record keeping for each experiment is recommended. All data should be recorded with any possible explanation or additional information. In the data analysis, robust statistical methods are recommended. These methods are minimally effected by outliers and will be introduced in a later part of this report.
The first step in data analysis is to label suspected outliers for further study. Three different methods are available to the investigator for normally distributed data: z-score method, modified z-score method, and boxplot method (Iglewicz, 1993, Barnett, 1984). These techniques are based on a robust regresion methods. All of the experimental observations are standardized and the standardized values outside a predetermined bound are labeled as outliers (Rousseeuw, 1987).
In a z-score test, the mean and standard deviation of the entire data set are used to obtain a z-score for each data point, according to following formula:
A test heuristic states that an observation with a z-score greater than three should be labeled as an outlier. This method is not a reliable way of labeling outliers since both the mean and standard deviation are effected by the outliers.
In a modified z-score test the z-score is determined based on outlier resistant estimators. The median of absolute deviation about the median (MAD) is such an estimator.
MAD is calculated and used in place of standard deviation in z-score calculations, as shown below in Figure 4.
The test heuristic states that an observation with a modified z-score greater
than three and a half should be labeled as an outlier. This is a reliable
test since the parameters used to calculate the modified z-score are minimally
effected by the outliers.
In a boxplot test the graphical representation of data is used to label an observation as an outlier as shown in Figure 5. A box in drawn around the interquartile range. A line inside the box indicates the median value. Error bars are drawn at the 5% and the 95% confidence intervals. Any data outside the error bars are plotted as single points and labeled as possible ouliers.

Outliers can sometimes be accommodated in the data analysis. This process prevents the outliers from biasing the estimated population parameters. Some ways of accommodating outliers are the use of trimmed means, scale estimators, or confidence intervals. In calculations of a trimmed mean a fixed percentage of data is dropped from each end of an ordered data. The mean value is calculated for the remaining data. This trimming will drop the outliers from the data and it will often increase the efficiency of estimating the population mean. In calculations of scale estimators the median of the absolute deviation about the sample median (MAD) is used to calculate a measure of variability in the sample. This measure of variability is resistant to outliers and can be used in place of standard deviation, as it was done in the modified z-score test, for example:
The confidence interval can be adjusted using the Winsorized variance to minimize the effect of the outliers. This type of variance utilizes the trimmed mean in place of the population mean.
There are numerous tests for identifying outliers. Four common outlier tests for normal distributions are the Rosner test, Dixon test, Grubbs test, and the box plot rule. These techniques are based on hypothesis testing rather the regression methods.
Rosner’s Test for detecting up to k outliers can be used when the number of data points is 25 or more. This test identifies outliers that are both high and low, it is therefore always two-tailed (Gibbons, 1994). The data are ranked in ascending order and the mean and standard deviation are determined. The procedure entails removing from the data set the observation, x, that is farthest from the mean. Then a test statistic, R, is calculated:
The R statistic is then compared with a critical value (Gilbert, 1987). The null hypothesis, stating that the data fits a normal distribution, is then tested. If R is less than the critical value, the null hypothesis cannot be rejected, and therefore there are no outliers. If R is greater than the critical value, the null hypothesis is rejected and the presence of k outliers is accepted. This test can also be used with log-normally distributed data, when the logarithms of the data are used for computation.
Dixon's Test
Dixon’s test is generally used for detecting a small number of outliers (Gibbons, 1994). This test can be used when the sample size is between 3 and 25 observations. The data is ranked in ascending order, then based on the sample size, the tau statistic for the highest value or lowest value is computed.
The tau statistic is compared to a critical value at a chosen value
of alpha (Gibbons, 1994). If the tau statistic is less than the critical
value, the null hypothesis is not rejected, and the conclusion is that
no outliers are present. If the tau statistic is greater than the critical
value, the null hypothesis is rejected, and the conclusion is the most
extreme value is an outlier. To check for other outliers, the Dixon test
can be repeated, however, the power of this test decreases as the number
of repetitions increases.
Boxplot Rule
The boxplot rule is a visual test to inspect for outliers, see Figure 5 for example of a boxplot. The interquartile range is included into a box and the 5% and 95% confidence intervals are indicated with error bars outside of the box. Values that lie outside of the confidence interval are possible outliers (Iglewicz, 1993).
95% confidence interval limit:
and 5% confidence interval limit:
Grubbs' Test
Grubbs’ test is recommended by the EPA as a statistical test for outliers (US EPA, 1992). The EPA suggests taking the logarithms of environmental data, which are often log-normally distributed. The data are ranked in ascending order and the mean and standard deviation are calculated. The lowest or highest data point can be tested as an outlier.
The tau statistic for the smallest value is:
.
The tau statistic for the largest value is:
The tau statistic is compared with a critical tau value for the sample size and selected alpha (Taylor, 1987). If the tau statistic is greater than the tau critical, the null hypothesis is rejected and the conclusion is that the datum under consideration is an outlier. Figure 6 shows an example of a Grubbs' test
All of the above discussed statistical tests are used to determine if experimental observations are statistical outliers in normally distributed data sets. If an observation is statistically determined to be an outlier, the EPA suggests determining an explanation for this outlier before its exclusion from further analysis (US EPA, 1992). If an explanation cannot be found, then the observation should be treated as an extreme but valid measurement and it should be in further analysis (US EPA, 1992).
The tests for normal data set are easy to use and powerful, however,
the tests for non-normal data more difficult and not as powerful (Iglewicz,
1993). Some of these tests are included in Barnett and Lewis (1984). In
many situations the data can be transformed to approximate a normal distribution
and it can be analyzed using the techniques presented above (Iglewicz,
1993).
|
Sampling & Monitoring Primer Table of Contents |
Previous Topic |
Next Topic |
Send comments or suggestions to:
Student Authors: Agata Fallon,
afallon@vt.edu, and Christine Spada, cspada@vt.edu
Faculty Advisor: Daniel Gallagher, dang@vt.edu
Copyright © 1997 Daniel Gallagher
Last Modified: 09-10-1997