##### Data Science

# How to Use Statistics to Identify Outliers in Data

When modeling, it is important to clean the data sample to ensure that the observations best represent the problem.

Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. These are called outliers and often machine learning modeling and model skill in general can be improved by understanding and even removing these outlier values.

In this tutorial, you will discover more about outliers and two statistical methods that you can use to identify and filter outliers from your dataset.

After completing this tutorial, you will know:

- That an outlier is an unlikely observation in a dataset and may have one of many causes.
- That standard deviation can be used to identify outliers in Gaussian or Gaussian-like data.
- That the interquartile range can be used to identify outliers in data regardless of the distribution.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- What are Outliers?
- Test Dataset
- Standard Deviation Method
- Interquartile Range Method

## What are Outliers?

An outlier is an observation that is unlike the other observations.

It is rare, or distinct, or does not fit in some way.

Outliers can have many causes, such as:

- Measurement or input error.
- Data corruption.
- True outlier observation (e.g. Michael Jordan in basketball).

There is no precise way to define and identify outliers in general because of the specifics of each dataset. Instead, you, or a domain expert, must interpret the raw observations and decide whether a value is an outlier or not.

Nevertheless, we can use statistical methods to identify observations that appear to be rare or unlikely given the available data.

This does not mean that the values identified are outliers and should be removed. But, the tools described in this tutorial can be helpful in shedding light on rare events that may require a second look.

A good tip is to consider plotting the identified outlier values, perhaps in the context of non-outlier values to see if there are any systematic relationship or pattern to the outliers. If there is, perhaps they are not outliers and can be explained, or perhaps the outliers themselves can be identified more systematically.

## Test Dataset

Before we look at outlier identification methods, let’s define a dataset we can use to test the methods.

We will generate a population 10,000 random numbers drawn from a Gaussian distribution with a mean of 50 and a standard deviation of 5.

Numbers drawn from a Gaussian distribution will have outliers. That is, by virtue of the distribution itself, there will be a few values that will be a long way from the mean, rare values that we can identify as outliers.

We will use the *randn()* function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range.

The pseudorandom number generator is seeded to ensure that we get the same sample of numbers each time the code is run.

# generate gaussian data from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import std # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # summarize print('mean=%.3f stdv=%.3f' % (mean(data), std(data)))

Running the example generates the sample and then prints the mean and standard deviation. As expected, the values are very close to the expected values.

mean=50.049 stdv=4.994

## Standard Deviation Method

If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can use the standard deviation of the sample as a cut-off for identifying outliers.

The Gaussian distribution has the property that the standard deviation from the mean can be used to reliably summarize the percentage of values in the sample.

For example, within one standard deviation of the mean will cover 68% of the data.

So, if the mean is 50 and the standard deviation is 5, as in the test dataset above, then all data in the sample between 45 and 55 will account for about 68% of the data sample. We can cover more of the data sample if we expand the range as follows:

- 1 Standard Deviation from the Mean: 68%
- 2 Standard Deviations from the Mean: 95%
- 3 Standard Deviations from the Mean: 99.7%

A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event at approximately 1 in 370 samples.

Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of data, perhaps a value of 2 standard deviations (95%) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9%) can be used.

Let’s make this concrete with a worked example.

Sometimes, the data is standardized first (e.g. to a Z-score with zero mean and unit variance) so that the outlier detection can be performed using standard Z-score cut-off values. This is a convenience and is not required in general, and we will perform the calculations in the original scale of the data here to make things clear.

We can calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean.

# calculate summary statistics data_mean, data_std = mean(data), std(data) # identify outliers cut_off = data_std * 3 lower, upper = data_mean - cut_off, data_mean + cut_off

We can then identify outliers as those examples that fall outside of the defined lower and upper limits.

# identify outliers outliers = [x for x in data if x < lower or x > upper]

Alternately, we can filter out those values from the sample that are not within the defined limits.

# remove outliers outliers_removed = [x for x in data if x > lower and x < upper]

We can put this all together with our sample dataset prepared in the previous section.

The complete example is listed below.

# identify outliers with standard deviation from numpy.random import seed from numpy.random import randn from numpy import mean from numpy import std # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # calculate summary statistics data_mean, data_std = mean(data), std(data) # identify outliers cut_off = data_std * 3 lower, upper = data_mean - cut_off, data_mean + cut_off # identify outliers outliers = [x for x in data if x < lower or x > upper] print('Identified outliers: %d' % len(outliers)) # remove outliers outliers_removed = [x for x in data if x > lower and x < upper] print('Non-outlier observations: %d' % len(outliers_removed))

Running the example will first print the number of identified outliers and then the number of observations that are not outliers, demonstrating how to identify and filter out outliers respectively.

Identified outliers: 29 Non-outlier observations: 9971

So far we have only talked about univariate data with a Gaussian distribution, e.g. a single variable. You can use the same approach if you have multivariate data, e.g. data with multiple variables, each with a different Gaussian distribution.

You can imagine bounds in two dimensions that would define an ellipse if you have two variables. Observations that fall outside of the ellipse would be considered outliers. In three dimensions, this would be an ellipsoid, and so on into higher dimensions.

Alternately, if you knew more about the domain, perhaps an outlier may be identified by exceeding the limits on one or a subset of the data dimensions.

## Interquartile Range Method

Not all data is normal or normal enough to treat it as being drawn from a Gaussian distribution.

A good statistic for summarizing a non-Gaussian distribution sample of data is the Interquartile Range, or IQR for short.

The IQR is calculated as the difference between the 75th and the 25th percentiles of the data and defines the box in a box and whisker plot.

Remember that percentiles can be calculated by sorting the observations and selecting values at specific indices. The 50th percentile is the middle value, or the average of the two middle values for an even number of examples. If we had 10,000 samples, then the 50th percentile would be the average of the 5000th and 5001st values.

We refer to the percentiles as quartiles (“*quart*” meaning 4) because the data is divided into four groups via the 25th, 50th and 75th values.

The IQR defines the middle 50% of the data, or the body of the data.

The IQR can be used to identify outliers by defining limits on the sample values that are a factor *k* of the IQR below the 25th percentile or above the 75th percentile. The common value for the factor *k* is the value 1.5. A factor k of 3 or more can be used to identify values that are extreme outliers or “*far outs*” when described in the context of box and whisker plots.

On a box and whisker plot, these limits are drawn as fences on the whiskers (or the lines) that are drawn from the box. Values that fall outside of these values are drawn as dots.

We can calculate the percentiles of a dataset using the *percentile()* NumPy function that takes the dataset and specification of the desired percentile. The IQR can then be calculated as the difference between the 75th and 25th percentiles.

# calculate interquartile range q25, q75 = percentile(data, 25), percentile(data, 75) iqr = q75 - q25

We can then calculate the cutoff for outliers as 1.5 times the IQR and subtract this cut-off from the 25th percentile and add it to the 75th percentile to give the actual limits on the data.

# calculate the outlier cutoff cut_off = iqr * 1.5 lower, upper = q25 - cut_off, q75 + cut_off

We can then use these limits to identify the outlier values.

# identify outliers outliers = [x for x in data if x < lower or x > upper]

We can also use the limits to filter out the outliers from the dataset.

outliers_removed = [x for x in data if x > lower and x < upper]

We can tie all of this together and demonstrate the procedure on the test dataset.

The complete example is listed below.

# identify outliers with interquartile range from numpy.random import seed from numpy.random import randn from numpy import percentile # seed the random number generator seed(1) # generate univariate observations data = 5 * randn(10000) + 50 # calculate interquartile range q25, q75 = percentile(data, 25), percentile(data, 75) iqr = q75 - q25 print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr)) # calculate the outlier cutoff cut_off = iqr * 1.5 lower, upper = q25 - cut_off, q75 + cut_off # identify outliers outliers = [x for x in data if x < lower or x > upper] print('Identified outliers: %d' % len(outliers)) # remove outliers outliers_removed = [x for x in data if x > lower and x < upper] print('Non-outlier observations: %d' % len(outliers_removed))

Running the example first prints the identified 25th and 75th percentiles and the calculated IQR. The number of outliers identified is printed followed by the number of non-outlier observations.

Percentiles: 25th=46.685, 75th=53.359, IQR=6.674 Identified outliers: 81 Non-outlier observations: 9919

The approach can be used for multivariate data by calculating the limits on each variable in the dataset in turn, and taking outliers as observations that fall outside of the rectangle or hyper-rectangle.

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

- Develop your own Gaussian test dataset and plot the outliers and non-outlier values on a histogram.
- Test out the IQR based method on a univariate dataset generated with a non-Gaussian distribution.
- Choose one method and create a function that will filter out outliers for a given dataset with an arbitrary number of dimensions.

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

### API

### Articles

- Outlier on Wikipedia
- Anomaly detection on Wikipedia
- 68–95–99.7 rule on Wikipedia
- Interquartile range
- Box plot on Wikipedia

### Summary

In this tutorial, you discovered outliers and two statistical methods that you can use to identify and filter outliers from your dataset.

Specifically, you learned:

- That an outlier is an unlikely observation in a dataset and may have one of many causes.
- That standard deviation can be used to identify outliers in Gaussian or Gaussian-like data.
- That the interquartile range can be used to identify outliers in data regardless of the distribution.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Statistics to Identify Outliers in Data appeared first on Machine Learning Mastery.

Source link