06.09.2021 Views

Lies, Damned Lies, or Statistics- How to Tell the Truth with Statistics, 2017a

Lies, Damned Lies, or Statistics- How to Tell the Truth with Statistics, 2017a

Lies, Damned Lies, or Statistics- How to Tell the Truth with Statistics, 2017a

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

26 1. ONE-VARIABLE STATISTICS: BASICS<br />

We can’t pick <strong>the</strong> x max and x min as those starting points, since <strong>the</strong>y will be <strong>the</strong> outliers<br />

<strong>the</strong>mselves, as we have noticed. So we will use our earlier idea of a value which is typical<br />

f<strong>or</strong> <strong>the</strong> larger part of <strong>the</strong> data, <strong>the</strong> quartile Q 3 ,andQ 1 f<strong>or</strong> <strong>the</strong> c<strong>or</strong>responding lower part of<br />

<strong>the</strong> data.<br />

Now we need <strong>to</strong> decide how far is far enough away from those quartiles <strong>to</strong> count as an<br />

outlier. If <strong>the</strong> data already has a lot of variation, <strong>the</strong>n a new data value would have <strong>to</strong> be<br />

quite far in <strong>or</strong>der f<strong>or</strong> us <strong>to</strong> be sure that it is not out <strong>the</strong>re just because of <strong>the</strong> variation already<br />

in <strong>the</strong> data. So our measure of far enough should be in terms of a measure of spread of <strong>the</strong><br />

data.<br />

Looking at <strong>the</strong> last section, we see that only <strong>the</strong> IQR is a measure of spread which is<br />

insensitive <strong>to</strong> outliers – and we definitely don’t want <strong>to</strong> use a measure which is sensitive<br />

<strong>to</strong> <strong>the</strong> outliers, one which would have been affected by <strong>the</strong> very outliers we are trying <strong>to</strong><br />

define.<br />

All this goes <strong>to</strong>ge<strong>the</strong>r in <strong>the</strong> following<br />

DEFINITION 1.5.13. [The 1.5 IQR Rule f<strong>or</strong> Outliers] Starting <strong>with</strong> a quantitative<br />

dataset whose first and third quartiles are Q 1 and Q 3 and whose inter-quartile range is<br />

IQR, a data value x is [officially, from now on] called an outlier if xQ 3 +1.5 IQR.<br />

Notice this means that x is not an outlier if it satisfies Q 1 − 1.5 IQR ≤ x ≤ Q 3 +1.5 IQR.<br />

EXAMPLE 1.5.14. Let’s see if <strong>the</strong>re were any outliers in <strong>the</strong> test sc<strong>or</strong>e dataset from<br />

Example 1.3.1. We found <strong>the</strong> quartiles and IQR in Example 1.5.7, so from <strong>the</strong> 1.5 IQR<br />

Rule, a data value x will be an outlier if<br />

<strong>or</strong> if<br />

xQ 3 +1.5 IQR =88+1.5 · 18 = 115 .<br />

Looking at <strong>the</strong> stemplot in Table 1.3.1, we conclude that <strong>the</strong> data values 25, 25, and40 are<br />

<strong>the</strong> outliers in this dataset.<br />

EXAMPLE 1.5.15. Applying <strong>the</strong> same method <strong>to</strong> <strong>the</strong> data in Example 1.4.2, using <strong>the</strong><br />

quartiles and IQR from Example 1.5.8, <strong>the</strong> condition f<strong>or</strong> an outlier x is<br />

<strong>or</strong><br />

xQ 3 +1.5 IQR =9.5+1.5 · 10.69575 = 25.543625 .<br />

Since none of <strong>the</strong> data values satisfy ei<strong>the</strong>r of <strong>the</strong>se conditions, <strong>the</strong>re are no outliers in this<br />

dataset.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!