Outliers: The Data Rebels
Learn what outliers are, how to find them, and how to decide what to do when one appears in your data.
An outlier is a value in a dataset that is very different from the rest of the values. It can be much higher or much lower than the other values.
For example, in a list of student heights from a class, most heights might be between 150 cm and 175 cm. If one student is recorded as 250 cm, that value is an outlier โ it does not fit with the others.
Why Outliers Happen
- A real but rare value: a genuine measurement that just happens to be unusual (a very tall student, a very expensive purchase).
- A data entry mistake: someone wrote 250 instead of 175, or added an extra zero.
- A measurement error: the tool was broken or used in the wrong way.
Not every outlier is a mistake. Some are real and useful. The goal is to notice them first, then decide what they mean.
One value that is very far from the others can change the mean (average) a lot. The median, on the other hand, is much less affected.
Imagine the daily allowance of 5 students (in EGP):
20, 25, 30, 30, 35
- Mean = (20 + 25 + 30 + 30 + 35) รท 5 = 28 EGP
- Median = middle value = 30 EGP
Now imagine one student forgot a comma and wrote 350 instead of 35:
20, 25, 30, 30, 350
- Mean = (20 + 25 + 30 + 30 + 350) รท 5 = 91 EGP
- Median = middle value = 30 EGP (no change)
One outlier changed the mean from 28 to 91 โ more than three times as much! The median stayed exactly the same.
The mean is sensitive to outliers. The median is not. When outliers are present, the median is usually a safer way to describe the typical value.
To decide if a value is really an outlier, we can use a simple rule that uses the quartiles of the data. This is called the IQR Rule.
The IQR rule and the boxplot are not the only ways to find outliers. Statisticians also use methods like the Z-score, the Modified Z-score, and several others built into machine-learning libraries. In this topic, we focus only on the boxplot and the IQR rule because they are the simplest, most visual, and most widely used in data analysis. As you grow in the field, you will meet the other methods too.
What Are Quartiles?
When we sort the data from smallest to largest, the quartiles split it into four equal parts:
- Q1 (first quartile): the value at the 25% mark.
- Q2 (second quartile): the value at the 50% mark โ this is the median.
- Q3 (third quartile): the value at the 75% mark.
The IQR (Interquartile Range)
The IQR is the distance between Q1 and Q3. It tells us how spread out the middle half of the data is.
The Outlier Limits (Fences)
We then build two limits, one above the data and one below it. Any value outside these limits is called an outlier.
Any value smaller than the Lower Limit or larger than the Upper Limit is treated as an outlier.
Suppose Q1 = 20, Q3 = 40 for a dataset.
- IQR = 40 โ 20 = 20
- Lower Limit = 20 โ (1.5 ร 20) = 20 โ 30 = โ10
- Upper Limit = 40 + (1.5 ร 20) = 40 + 30 = 70
Any value below โ10 or above 70 is an outlier. A value of 90, for example, is outside the upper limit, so it is an outlier.
Boxplots are charts that show Q1, Q2 (median), and Q3 as a box, with lines (called whiskers) reaching to the limits. Any value outside the whiskers is drawn as a separate dot โ that dot is an outlier.
Finding an outlier is only the first step. The next step is to decide what to do with it. There are three common choices:
1. Remove It
Use this only when you are sure the value is a mistake (for example, an obvious typing error like a height of 999 cm).
2. Correct It
If you can find the right value (by checking the original record), replace the wrong value with the correct one.
3. Keep and Report
If the value is unusual but real, keep it. Mention it in your report and explain that it might affect the average.
Never remove a value just because it looks "too big" or "too small." Always check first. Real outliers can carry important information.
When you write a report or build a chart, follow these simple steps to handle outliers in a clear and honest way:
- Sort the data and check if any value looks very different from the rest.
- Use the IQR Rule to confirm whether it is really an outlier.
- Decide if the value is a mistake, a real rare value, or unclear.
- If you keep the outlier, report both the mean and the median so the reader sees the full picture.
- Mention any value you removed and explain why.
Summary Comparison
| Measure | Affected by Outliers? | Best Used When |
|---|---|---|
| Mean | Yes, very much | The data has no extreme values. |
| Median | Almost not | The data has outliers or is skewed. |
| IQR | Almost not | You want to measure the spread of the middle of the data. |
- An outlier is a value that is far from the rest of the data.
- The mean is strongly affected by outliers; the median is much less affected.
- The IQR Rule uses Q1, Q3, and the formula 1.5 ร IQR to set upper and lower limits.
- For each outlier, decide whether to remove, correct, or keep and report it.
- Always be honest in reports: explain any values you removed and show both the mean and the median when outliers are present.