Detection and Accommodation of Outliers in Normally Distributed Data Sets

by Agata Fallon and Christine Spada


Table of contents

  1. Introduction
  2. Sources of outliers
  3. Managing outliers
  4. Labeling outliers
  5. Accommodating outliers
  6. Detecting outliers
  7. Conclusions
  8. Variables
  9. Links to related topics
  10. References


Introduction

Anyone, who has heard the statement: ‘You can not just throw data away without an explanation or a reason’, is familiar with the headaches caused by outliers. Outliers are the observations that appear to be inconsistent with the reminder of the collected data (Iglewicz, 1993). The term outlier is used collectively for discordant observations and for contaminants. A discordant observation is defined as an observation that appears surprising or discrepant to the investigator (Iglewicz, 1983). A contaminant is defined as an observation from a different distribution then the rest of the data. Contaminants may or may not be noted by the investigator (Barnett, 1984). Figure 1 shows some examples of outliers.


Figure 1. Examples of outliers Note: Outliers are indicated in red

Sources of outliers

Possible sources of outliers are: recording and measurement errors, incorrect distribution assumption, unknown data structure, or novel phenomenon (Iglewicz, 1993). Recording and measurement errors are often the first suspected source of outliers. Incorrect assumption about the data distribution can lead to mislabeling data as outliers. The data which does not fit well into the assumed distribution may fit well into a different distribution, as shown in Figure 2.


Figure 2. The effect of assumed data distribution on the presence of apparent outliers.

Unknown data structure and correlations can cause apparent outliers. A data set could be made up of subsets, which are subject to different mechanisms and should be analyzed independently of each other, as shown in Figure 3.


Figure 3. Effect of data subsets on the presence of apparent outliers

A data set indicative of a novel phenomenon can be often labeled as an outlier. For example, the measurements indicating existence of the hole in the ozone layer were initially thought to be outliers and they were automatically discarded. This oversight delayed the discovery of the phenomenon by several years (Berthouex, 1994)


Managing outliers

There are two ways of managing data outliers. In the laboratory, good record keeping for each experiment is recommended. All data should be recorded with any possible explanation or additional information. In the data analysis, robust statistical methods are recommended. These methods are minimally effected by outliers and will be introduced in a later part of this report.


Labeling outliers

The first step in data analysis is to label suspected outliers for further study. Three different methods are available to the investigator for normally distributed data: z-score method, modified z-score method, and boxplot method (Iglewicz, 1993, Barnett, 1984). These techniques are based on a robust regresion methods. All of the experimental observations are standardized and the standardized values outside a predetermined bound are labeled as outliers (Rousseeuw, 1987).

In a z-score test, the mean and standard deviation of the entire data set are used to obtain a z-score for each data point, according to following formula:



A test heuristic states that an observation with a z-score greater than three should be labeled as an outlier. This method is not a reliable way of labeling outliers since both the mean and standard deviation are effected by the outliers.

In a modified z-score test the z-score is determined based on outlier resistant estimators. The median of absolute deviation about the median (MAD) is such an estimator.



MAD is calculated and used in place of standard deviation in z-score calculations, as shown below in Figure 4.


Figure 4. Modified z-score calculation.


The test heuristic states that an observation with a modified z-score greater than three and a half should be labeled as an outlier. This is a reliable test since the parameters used to calculate the modified z-score are minimally effected by the outliers.

In a boxplot test the graphical representation of data is used to label an observation as an outlier as shown in Figure 5. A box in drawn around the interquartile range. A line inside the box indicates the median value. Error bars are drawn at the 5% and the 95% confidence intervals. Any data outside the error bars are plotted as single points and labeled as possible ouliers.


Figure 5. Example of a boxplot.

Accommodating outliers

Outliers can sometimes be accommodated in the data analysis. This process prevents the outliers from biasing the estimated population parameters. Some ways of accommodating outliers are the use of trimmed means, scale estimators, or confidence intervals. In calculations of a trimmed mean a fixed percentage of data is dropped from each end of an ordered data. The mean value is calculated for the remaining data. This trimming will drop the outliers from the data and it will often increase the efficiency of estimating the population mean. In calculations of scale estimators the median of the absolute deviation about the sample median (MAD) is used to calculate a measure of variability in the sample. This measure of variability is resistant to outliers and can be used in place of standard deviation, as it was done in the modified z-score test, for example:



The confidence interval can be adjusted using the Winsorized variance to minimize the effect of the outliers. This type of variance utilizes the trimmed mean in place of the population mean.


Detecting outliers

There are numerous tests for identifying outliers. Four common outlier tests for normal distributions are the Rosner test, Dixon test, Grubbs test, and the box plot rule. These techniques are based on hypothesis testing rather the regression methods.

Rosner's Test

Rosner’s Test for detecting up to k outliers can be used when the number of data points is 25 or more. This test identifies outliers that are both high and low, it is therefore always two-tailed (Gibbons, 1994). The data are ranked in ascending order and the mean and standard deviation are determined. The procedure entails removing from the data set the observation, x, that is farthest from the mean. Then a test statistic, R, is calculated:



The R statistic is then compared with a critical value (Gilbert, 1987). The null hypothesis, stating that the data fits a normal distribution, is then tested. If R is less than the critical value, the null hypothesis cannot be rejected, and therefore there are no outliers. If R is greater than the critical value, the null hypothesis is rejected and the presence of k outliers is accepted. This test can also be used with log-normally distributed data, when the logarithms of the data are used for computation.


Dixon's Test

Dixon’s test is generally used for detecting a small number of outliers (Gibbons, 1994). This test can be used when the sample size is between 3 and 25 observations. The data is ranked in ascending order, then based on the sample size, the tau statistic for the highest value or lowest value is computed.



The tau statistic is compared to a critical value at a chosen value of alpha (Gibbons, 1994). If the tau statistic is less than the critical value, the null hypothesis is not rejected, and the conclusion is that no outliers are present. If the tau statistic is greater than the critical value, the null hypothesis is rejected, and the conclusion is the most extreme value is an outlier. To check for other outliers, the Dixon test can be repeated, however, the power of this test decreases as the number of repetitions increases.

Boxplot Rule

The boxplot rule is a visual test to inspect for outliers, see Figure 5 for example of a boxplot. The interquartile range is included into a box and the 5% and 95% confidence intervals are indicated with error bars outside of the box. Values that lie outside of the confidence interval are possible outliers (Iglewicz, 1993).

95% confidence interval limit: and 5% confidence interval limit:

Grubbs' Test

Grubbs’ test is recommended by the EPA as a statistical test for outliers (US EPA, 1992). The EPA suggests taking the logarithms of environmental data, which are often log-normally distributed. The data are ranked in ascending order and the mean and standard deviation are calculated. The lowest or highest data point can be tested as an outlier.


The tau statistic for the smallest value is: . The tau statistic for the largest value is:

The tau statistic is compared with a critical tau value for the sample size and selected alpha (Taylor, 1987). If the tau statistic is greater than the tau critical, the null hypothesis is rejected and the conclusion is that the datum under consideration is an outlier. Figure 6 shows an example of a Grubbs' test


Figure 6. Grubbs' test calculation.

Conclusions

All of the above discussed statistical tests are used to determine if experimental observations are statistical outliers in normally distributed data sets. If an observation is statistically determined to be an outlier, the EPA suggests determining an explanation for this outlier before its exclusion from further analysis (US EPA, 1992). If an explanation cannot be found, then the observation should be treated as an extreme but valid measurement and it should be in further analysis (US EPA, 1992).

The tests for normal data set are easy to use and powerful, however, the tests for non-normal data more difficult and not as powerful (Iglewicz, 1993). Some of these tests are included in Barnett and Lewis (1984). In many situations the data can be transformed to approximate a normal distribution and it can be analyzed using the techniques presented above (Iglewicz, 1993).

Variables

Links to related topics at other Web pages

  1. Outliers from Multivariate Statistics: a Practical Guide
  2. Outliers and Data Having Undue Influence
  3. Outliers and Influence
  4. Novel Graphical Model for Identification of Outliers in a Time Series
  5. Q-Test

References

  1. Barnett, V. and Lewis, T.: 1984, Outliers in Statistical Data, John Wiley & Sons, New York.
  2. Berthouex, P.M. and Brown L.C.:1994, Statistics for Environmental Engineers, CRC Press, London.
  3. Environmental Protection Agency.: 1992, Statistical Training Course for Ground-Water Monitoring Data Analysis, EPA/530-R-93-003, Office of Solid Waste, Washington, DC.
  4. Gilbert, R. O.: 1987, Statistical Methods for Environmental Pollution Monitoring, Van Nostrand Reinhold, New York.
  5. Gibbons, R. D.: 1994, Statistical Methods for Groundwater Monitoring, John Wiley & Sons, New York.
  6. Iglewicz, B. and Hoaglin, D. C.: 1993 How to Detect and Handle Outliers, American Society for Quality Control, Milwaukee, WI.
  7. Rousseeuw, P.J. and Leroy, A.M.:1987 Robust Regression and Outlier Detection, John Wiley & Sons, New York.
  8. Taylor, J. K.: 1987, Quality Assurance of Chemical Measurements, Lewis Publishers, Chelsea, MI.



Sampling & Monitoring Primer Table of Contents

Previous Topic

Next Topic

Send comments or suggestions to:
Student Authors: Agata Fallon, afallon@vt.edu, and Christine Spada, cspada@vt.edu
Faculty Advisor: Daniel Gallagher, dang@vt.edu
Copyright © 1997 Daniel Gallagher
Last Modified: 09-10-1997