# Lesson 2 - Summarizing Data: Measures of Central Tendency and Measures of Variability, Box Plot

We will first talk about the important concepts of statistical inference. Then a few descriptive measures of the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive measures of the other important characteristic of a data set, measure of variability, will be discussed. This lesson will be concluded by a discussion of box plots, which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.

Again, this lesson will focus on simple examples that can be calculated or drawn by hand.  This lesson will be followed by another lesson that will work through many of these procedures using Minitab.

 Lesson 2 Objectives Upon successful completion of this lesson, you will be able to: conceptualize statistical inference. use appropriate summary measures to describe different data sets. construct and use box plots.

Introduction to Lesson 2
by Course Author Dr. Mosuk Chow - (length 2:30)

To summarize a data set, we want to report different attributes of the data set. One important attribute is the central tendency of the data, the other important attribute is how spread out the data is. Then, some more attributes to report is the shape of the data, etc. In this lesson, we will mainly discuss the first two important attributes, central tendency and spread.

For the central tendency, we are talking about the where the center of the data is located.

For the spread, we are talking about the variability of the data.

# 2.1 - Measures of Central Tendency and Skewness

 Unit Summary Measures of Central Tendency Mean Median Mode Trimmed Mean Skewness Adding and Multiplying Constants

An Introduction to Statistical Methods and Data Analysis, (See your course schedule.)

### Measures of Central Tendency

Three of the many ways to measure central tendency are:

 1. Mean the average of the data 2. Median the middle value of the ordered data 3. Mode the value that occurs most often in the data

In most research experimental situations, examination of all members of a population is not typically conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a representative subset of the population.

Let's take a closer look at this diagram implies with Dr. Wiesner.

Descriptive measures of population are parameters. Descriptive measures of a sample are statistics. For example, a sample mean is a statistic and a population mean is a parameter. The sample mean is usually denoted by $$\bar{y}$$:

$\bar{y}=\frac{y_1+y_2+\ldots+y_n}{n}=\frac{\sum^n_{i=1} y_i}{n}$

where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.

### A Note on Notation!

What if we say we used $$x_i$$ for our measurements instead of $$y_i$$?  Is this a problem?  No.  The formula would simply look like this:

$\bar{x}=\frac{x_1+x_2+\ldots+x_n}{n}=\frac{\sum^n_{i=1} x_i}{n}$

The formulas are exactly the same.  The letters that you select to denote the measurements are up to you.  For instance, many textbooks use x instead of y to denote the measurements.

The point is to understand how the calculation that is expressed in the formula works.  In this case, the formula is calculating the mean by summing all of the observations and dividing by the number of observations.

There is some notation that you will come to see as standards, i.e, n will always equal sample size.  We will make a point of letting you know what these are.  However, when it comes to the variables, these labels can (and do) vary.

For example, in one study x may be used to denote weight and y may be used to denote height, (or the reverse may be used!), but n will always be used to denote sample size in each case.

Note that for the data set:

1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1

Steps to finding the median for a set of data:

1. Arrange the data in increasing order
2. Find the location of median in the ordered data by (n + 1) / 2
3. The value that represents the location found in Step 2 is the median. NOTE: if the sample size is an odd number then the location point will produce a median that is an observed value as in the example above.  If sample size is an even number, then the location will require one to take the mean of two numbers to calculate the median.  The result may or may not be an observed value as the example below illustrates.

Mean, median and mode are usually not equal. When the data is symmetric, the mean is equal to the median.

4. Trimmed Mean

One shortcoming of the mean is that: Means are easily affected by extreme values.

Consider the aptitude test scores of ten children below:

95, 78, 69, 91, 82, 76, 76, 86, 88, 80

Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1

If the entry 91 is mistakenly recorded as 9, the mean would be 73.9, which is very different from 82.1.

On the other hand, let us see the effect of the mistake on the median value:

The original data set in increasing order are:

69, 76, 76, 78, 80, 82, 86, 88, 91, 95

With n = 10, the median position is found by (10 + 1) / 2 = 5.5. Thus, the median is the average of the fifth (80) and sixth (82) ordered value and the median = 81

The data set (with 91 coded as 9) in increasing order is:

9, 69, 76, 76, 78, 80, 82, 86, 88, 95
where the median = 79

The medians of the two sets are not that different. Therefore the median is not that affected by the extreme value 9.

Measures that are not that affected by extreme values are called resistant.

A variation of the mean is the trimmed mean. A 10% trimmed mean drops the highest 10%, the lowest 10%, and averages the remaining. Let's calculate the trimmed mean for the data we were looking at above:

(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)
The 10% trimmed mean = 82.13

(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)
The 10% trimmed mean = 79. 38

The 10% trimmed mean of the two sets is not that different. The trimmed mean is not as affected by the extreme value 9 as the mean.

After reading this lesson you should know that there are quite a few options when one wants to describe central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about mainly about the mean. However, we need to be aware of one of its short comings, which is that it is easily affected by extreme values. One remedy is to use trimmed mean to estimate the central tendency. Remember, however, that this is very different from saying that one can trim data. Unless data points are known mistakes, one should not remove them from the data set! One should keep the extreme points and use more resistant measures. For example, use the sample median to estimate the population median. Or, use the sample trimmed mean to estimate the population trimmed mean. Again, this is very different from saying that it is OK to trim data from a data set.

### Skewness

Skewness is a measure of degree of asymmetry of the distribution.

1. Symmetric

Mean, median, and mode are all the same here;  the distribution is mound shaped, and no skewness is apparent.  The distribution is described as symmetric.

The above distribution is symmetric.

2. Skewed Left

Mean to the left of the median, long tail on the left.

The above distribution is skewed to the left.

3. Skewed Right

Mean to the right of the median, long tail on the right.

The above distribution is skewed to the right.

When one has very skewed data, it is better to use the median as measure of central tendency since the median is not much affected by extreme values.

### Example: The Skewed Nature of Salary Data

Salary distributions are almost always positively skewed, with a few people that make the most money. To illustrate this, consider your favorite sports team or even the company for which you work. There will be one or two players or personnel that earn the “big bucks”, followed by others who earn less. This will produce a shape that is skewed to the right. Knowing this can be a useful aid in negotiating a higher salary.

When one interviews for a position and the discussion gets around to compensation, it is common that the interviewer states an offer that is “typical for someone in your position”. That is, they are offering you the average salary for someone with your particular skill set (e.g. little experience). But is this average the mode, median, or mean? The company – for whom business is business! – will want to pay you the least they can while you prefer to earn the most you can. Since salaries tend to be skewed to the right, the offer will most likely reflect the mode or median. You simply need to ask to which “average” the offer refers and what is the mean of this average since the mean would be the highest of the three values. Once you have these averages, you can begin to negotiate toward the highest number.

What happens to the mean and median if we add or multiply each observation in a data set by a constant? Consider for example if an instructor curves an exam by adding five points to each student’s score. What effect does this have on the mean and the median? The result of adding a constant to each value has the intended effect of altering the mean and median by the constant. For example, if in the above example where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be 87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).

Similarly, if each observed data value was multiplied by a constant, the new mean and median would change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were doubled, the then the new mean and new median would be double the original mean and median. As we will learn shortly, the effect is not the same on the variance!

Why would you want to know this? One reason, especially for those moving onward to more applied statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods a required assumption is that the data is normal, or very near bell-shaped. When the data is not normal, statisticians will transform the data using numerous techniques e.g. logarithmic transformation. But, the log cannot be taken of all values, for instance the log of 0 is undefined. However, if we add a constant to all the data values making them all greater than zero, then a log can be taken without risk.We just need to remember the original data was transformed!!

# 2.2 - Measures of Variability

 Unit Summary Measures of Variability Range Interquartile Range (IQR) Variance and Standard Deviation Adding and Multiplying Constants Empirical Rule How to Roughly Approximate Standard Deviation Coefficient of Variation Z-score, Z-value, or Z

An Introduction to Statistical Methods and Data Analysis, (see course schedule).

### Measures of Variability

Think about the following, then click on the icon to the left to display the statistical answer.

If you can use two numbers to summarize Jessica's weight data, which two characteristics will you use as measures?

Why do we want to know variability?

There are many ways to describe variability including:

• Range
• Interquartile range (IQR)
• Variance and Standard deviation

Let's look at each of these in turn.

A. Range: R = maximum - minimum

1. Easy to calculate
2. Very much affected by extreme values (range is not a resistant measure of variability)

B. Interquartile range (IQR)

In order to talk about interquartile range, we need to first talk about percentiles.

The pth percentile of the data set is a measurement such that after the data are ordered from smallest to largest, at most, p% of the data are at or below this value and at most, (100 - p)% at or above it.

Thus, the median is the 50th percentile. Fifty percent or the data values fall at or below the median.

Also, Q1 = lower quartile = the 25th percentile and Q3 = upper quartile = the 75th percentile.

The interquartile range is the difference between upper and lower quartiles and denoted as IQR.

IQR = Q3 - Q1 = upper quartile - lower quartile = 75th percentile - 25th percentile.

Details about how to compute IQR will be given in Lesson 2.3.

Note: IQR is not affected by extreme values. It is thus a resistant measure of variability.

C. Variance and Standard Deviation

Two vending machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets is random. The following data are recorded for six trials at each vending machine:

Pieces of candy from vending machine A:

1, 2, 3, 3, 5, 4
mean = 3, median = 3, mode = 3

Pieces of candy from vending machine B:

2, 3, 3, 3, 3, 4
mean = 3, median = 3, mode = 3

Dotplots for the pieces of candy from vending machine A and vending machine B:

They have the same center, but what about their spreads? One way to compare their spreads is to compute their standard deviations. In the following section, we are going to talk about how to compute the sample variance and the sample standard deviation for a data set.

Variance is the average squared distance from the mean.

Population variance is defined as:

${\sigma}^2=\sum_{i=1}^N \frac{(y_i-\mu)^2}{N}$

In this formula μ is the population mean and the summation is over all possible values of the population. N is the population size.

The sample variance that is computed from the sample and used to estimate σ 2 is:

$s^2=\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}$

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by $$\bar{y}$$, the yi's  tend to be closer to $$\bar{y}$$ than to μ. To compensate, we divide by a smaller number, n - 1.  The sample variance (and therefore sample standard deviation) are the common default calculations used by software.  When asked to calculate the variance or standard deviation of a set of data, assume - unless otherwise instructed - this is sample data and therefore calculating the sample variance and sample standard deviation.

For example, let's find $$S^2$$ for the data set from vending machine A: 1, 2, 3, 3, 4, 5

$\bar{y}=\frac{1+2+3+3+4+5}{6}=3$

$s^2=\frac{(y_1-\bar{y})^2+\cdots +(y_n-\bar{y})^2}{n-1}=\frac{(1-3)^2+(2-3)^2+(3-3)^2+(3-3)^2+(4-3)^2+(5-3)^2}{6-1}=2$

Calculate S2 for the data set from vending machine B yourself and check that it is smaller than the S2 for data set AWork out your answer first, then click the graphic to compare answers.

Standard Deviation

The population standard deviation is notated by σ and found by $$\sigma=\sqrt{{\sigma}^2}$$ has the same unit as yi's. This is a desirable property since one may think about the spread in terms of the original unit.

σ is estimated by the sample standard deviation s :

$s=\sqrt{s^2}$

For the data set A,

$$s=\sqrt{2}=1.414$$ pieces of candy.

Calculate the standard deviation for the data set from vending machine B Work out your answer first, then click the graphic to compare answers.

The standard deviation is approximately the average distance the values of a data set are from the mean, and is a very useful measure. One reason is that it has the same unit of measurement as the data itself (e.g. if a sample of student heights were in inches then so, too, would be the standard deviation.  The variance would be in squared units, for example inches2. Also, the empirical rule, which will be explained in the following section, makes the standard deviation an important yardstick to find out approximately what percentage of the measurements fall within certain intervals.

What happens to measures of variability if we add or multiply each observation in a data set by a constant? We learned previously about the effect such actions have on the mean and the median, but do variation measures behave similarly? Not really.

When we add a constant to all values we are basically shifting the data upward (or downward if we subtract a constant). This has the result of moving the middle but leaving the variability measures (e.g. range, IQR, variance, standard deviation) unchanged.

On the other hand, if one multiplies each value by a constant this does effect measures of variation. The result on the variance is that the new variance is multiplied by the square of the constant, while the standard deviation, range, and IQR are multiplied by the constant. For example, if the observed values of Machine A in the example above were multiplied by three, the new variance would be 18 (the original variance of 2 multiplied by 9). The new standard deviation would be 4.242 (the original standard 1.414 multiplied by 3).  The range and IQR would also change by a factor of 3.

### Empirical Rule

Empirical Rule is sometimes referred to as the 68-95-99.7% Rule. If the set of measurements follows a bell-shaped distribution, then

 $$\bar{y} \pm s$$ contains about 68% of data $$\bar{y} \pm 2s$$ contains about 95% of data $$\bar{y} \pm 3s$$ contains about all of data

Review of the Empirical Rule
by Course Author Dr. Mosuk Chow - (length 2:00)

Summary Transcript

In Lesson 2.2 we describe the empirical rule. The empirical rule helps us to provide  an estimate for the standard deviation.  Empirical rule says that for a bell shaped curve roughly 68% of the data falls between one standard deviation of the sample mean. Roughly 95% of the data falls between two standard deviations of the mean. And, almost all of the data will fall between three standard deviations of the sample mean.

Using the empirical rule, if your data is roughly bell shaped, then one way to find a rough estimate for the standard deviation of the data set is, to find out the range and then use the range and divide by four.

The reason we divide by four instead of dividing by six is because this will give us a more conservative estimate.  In this case, "conservative" means a larger estimate as we'd prefer to over estimate instead of underestimate.

One important point, whenever we want to find out the standard deviation of the data set, we should use the formula for this.

The following five examples (a-e) show that the empirical rule is not that far off even when the underlying distribution is not bell shaped.

a. For the following graph, $$\bar{y}=5.5$$, $$s =1.49$$

60% within $$\bar{y} \pm s$$
(5.5 - 1.49, 5.5 + 1.49) = (4.01, 6.99)

94% within $$\bar{y} \pm 2s$$
(5.5 - 2.98, 5.5 + 2.98) = (2.52, 8.48)

100% within $$\bar{y} \pm 3s$$
(5.5 - 4.47, 5.5 + 4.47) = (1.03, 9.97)

b. For the following graph, $$\bar{y}=5.5$$, $$s=2.07$$

64% within $$\bar{y} \pm s$$
96% within $$\bar{y} \pm 2s$$
100% within $$\bar{y} \pm 3s$$

c. For the following graph, $$\bar{y}=5.5$$, $$s=2.89$$

60% within $$\bar{y} \pm s$$
100% within $$\bar{y} \pm 2s$$
100% within $$\bar{y} \pm 3s$$

d. For the following graph, $$\bar{y}=3.49$$, $$s=1.87$$

75% within $$\bar{y} \pm s$$
96% within $$\bar{y} \pm 2s$$
98.5% within $$\bar{y} \pm 3s$$

e. For the following graph, $$\bar{y}=2.57$$, $$s=1.87$$

87% within $$\bar{y} \pm s$$
95% within $$\bar{y} \pm 2s$$
97.6% within $$\bar{y} \pm 3s$$

### Approximating the Standard Deviation

Think about the following, then click on the icon to the left display the statistical application example.

How can one find an approximate value of s without going through the detailed computation? It follows from the empirical rule that approximately 95% of measurements lie in $$\bar{y} \pm 2s$$(almost all).

Range 4s

Approximate value of $$s\approx \frac{range}{4}$$

Why don't we say $$\bar{y} \pm 3s$$ contains all and divide by 6 to obtain the approximate value of s?

It is important to remember that one has to use the formula:

$$s=\sqrt{\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}}$$

to compute the sample standard deviation. The formula {Approximate value of $$s\approx \frac{range}{4}$$ } only gives a rough estimate of s.

For example, the actual ages (in years) of 36 millionaires sampled, arranged in increasing order is:

31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,
54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,
71, 71, 74, 75, 77, 79, 79, 79

The data range is from 31 to 79. Thus, using the 'shortcut' formula to approximate the value of s is as follows: (79-31) / 4 = 12 years.

### Shortcut Method for Calculating the Standard Deviation

Instead of using the formula for calculating the variance and standard deviation that involves comparing each observation to the mean, there is a shortcut method to calculating the variance and standard deviation.  This shortcut method is as follows:

1. Sum all the values in the data set.
2. Square this sum.
3. Divide this squared sum by the total number of observations, n, (call this the average sum squared).
4. Square each value in the data set.
5. Sum these squared values (called the sum of squares).
6. Subtract this sum of squares minus average sum squared.
7. Divide this difference by n - 1; this is the variance.
8. Take the square root to get the standard deviation.

For example, recall the data results for Vending Machine A at the beginning of this lesson: 1, 2, 3, 3, 4, and 5.  We calculated the variance to be 2 and the standard deviation to be 1.414.  Using the shortcut method:

1. 1 + 2 + 3 + 3 + 4 + 5 = 18
2. 18*18 = 324
3. 324/6 = 54
4. 1, 4, 9, 9, 16, and 25
5. 1 + 4 + 9 + 9 + 16 + 25 = 64
6. 64 - 54 = 10
7. 10/5 = 2
8. Square root of 2 equals 1.414

### Coefficient of Variation

Above we considered three measures of variation: Range, Interquartile Range (IQR),  and Variance (and its square root counterpart - Standard Deviation).  These are all measures we can calculate from one quantitative variable e.g. height, weight.  But how can we compare dispersion (i.e. variability) of data from two or more distinct populaions that have vastly different means?  A popular statistic to use in such situations is the Coefficient of Variation or CV.  This is a unit-free statistic and one where the higher the value the greater the dispersion.  The calcuation of CV is:

CV = Standard Deviation / Mean

### Example: Comparing Prices

You are shopping for toilet tissue.  As you compare prices of various brands, some offer price per roll while others offer price per sheet.  You are interested in determining which pricing method has less variability so you sample several of each and calculate the mean and standard deviation for the sampled items that are priced per roll, and the mean and standard deviation for the sampled items that are priced per sheet.  The table below summarizes your results.

 Item Mean Standard Deviation Price per Roll 0.9196 0.4233 Price per Sheet 0.01134 0.00553

Comparing the standard deviations the Per Sheet appear to have much less variability in pricing.  However, the mean is also much smaller.  The coefficient of variation allows us to make a relative comparison of the variability of these two pricing schemes:

$CV_{Roll}= 0.4233 / 0.9196 = 0.46 \; \text{and} \; CV_{Sheet} = 0.00553 / 0.01134 = 0.49$

Relatively speaking, the variation for Price per Sheet is greater than the variability for Price per Roll.

Another example to consider is hotel pricing.  Think of prices for luxury and budget hotels.  Which do you think would have the higher average cost per night?  Which would have the greater standard deviation?  The CV would allow you to compare this dispersion in costs in relative terms by accounting for the fact that the luxury hotels would have a greater mean and standard deviation.

### Z-value, Z-score, or Z

Z-value, or sometimes referred to as Z-score or simply Z, represents the number of standard deviations an observation is from the mean for a set of data.  To find the z-score for a particular observation we apply the following formula:

Z = (observed value – mean) / SD

### Example: Exam Scores

For a recent final exam the mean was 68.55 with a standard deviation of 15.45

If you scored an 80%: Z = (80 - 68.55) / 15.45 = 0.74, which means your score of 80 was 0.74 SD above the mean.

If you scored a 60%: Z = (60 - 68.55) / 15.45 = -0.55, which means your score of 60 was 0.55 SD below the mean.

Is it always good to have a positive Z score? It depends on the question.

For exams you would want a positive Z-score (indicates you scored higher than the mean). However, if one was analyzing days of missed work then a negative Z-score would be more appealing as it would indicate the person missed less than the mean number of days.

Characteristics of Z-scores

1. The scores can be positive or negative
2. For data that is symmetric (i.e. bell shaped) or nearly symmetric, a common application of Z-scores for identifying potential outliers is for any Z-scores that are beyond ± 3.
3. Maximum possible Z-score for a set of data is $$(n-1)/\sqrt{n}$$.
4. Sum of allsquared Z-scores for a set of data is n - 1.

# 2.3 - Box Plots

 Unit Summary How to Compute a Five Number Summary How to Compute IQR Skeletal Box Plots Matching the Shape of the Distribution and the Corresponding Box Plot Side-by-Side Box Plots

An Introduction to Statistical Methods and Data Analysis, (see course schedule).

### How to Compute a Five Number Summary

Think about the following question, then click the icon to the left to display an answer.

We want a graph that is not as detailed as a histogram, but still shows:

1. the skewness of the distribution
2. the central location
3. the variability

To create this plot we need the following:

• minimum value,
• Q1 (lower quartile),
• Q2 (median),
• Q3 (upper quartile), and
• maximum value.

This list is also called the five number summary.

NOTE: The method we will demonstrate for calculating Q1 and Q3 may differ from the method described in our textbook. The results may sometimes be different from the results in our textbook, but will always be the same as Minitab's result (which we will calculate later).

Recall that the mean is not a resistant measure (i.e. not as greatly affected by extreme observations) of the central location but the median is. Both the range and the standard deviations are not resistant measures of the spread, but the IQR is. Thus, in the box plot we use the median and IQR.

How do we compute quartiles? There are two steps to follow:

1. Find the location of the desired quartile:

If there are n data, arranged in increasing order, then the first quartile is at position $$\frac{1}{4} (n+1)$$, second quartile (i.e. the median) is at position $$\frac{2}{4} (n+1)$$. The third quartile is at position $$\frac{3}{4} (n+1)$$.

2. Find the value in that position for the ordered data.

Once we find the first and the third quartiles, we can compute the interquartile range (IQR) by:

IQR = Q3 - Q1

Roughly speaking, IQR gives the range of the middle 50% of the observations.

The final exam scores of 18 students are (in increasing order):

 Q1 Q2 Q3 24 58 61 67 71 73 76 79 82 83 85 87 88 88 92 93 94 97

In this example, n = 18.

For Q1, its position is:

$\frac{18+1}{4}=4.75$

The actual value of Q1:

Q1 = 67 (4th position) + 0.75 · (71 - 67) = 70

For the median, its position is:

$\frac{18+1}{2}=9.5$

The actual value of the median:

Q2 = 82 (9th position) + 0.5 · (83 - 82) = 82.5

For Q3, its position is:

$\frac{3(18+1)}{4}=14.25$

Q3 = 88 + 0.25 · (92 - 88) = 89

Thus the five number summary is:

 min Q1 Median (Q2) Q3 max 24 70 82.5 89 97

Five number summary: min, Q1, median, Q3, and max.

Using the five number summary, one can construct a skeletal box plot.

1. Mark the five number summary above the horizontal axis with vertical lines.
2. Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to form the whisker.

Here is a hand-drawn image of the skeletal box plot of the final exam score:

NOTE: Most statistical software do NOT create graphs of a skeletal box plot but instead opt for the box plot as follows below.  Box plots from statistical software are more detailed than skeletal box plots because they also show outliers. However, if there are no outliers, what is produced by the software is essentially the skeletal boxplot.  The following terminology will prepare us to understand and draw this more detailed type of the box plot.

Potential outliers are observations that lie outside the lower and upper limits.

Lower limit = Q1 - 1.5 * IQR
Upper limit = Q3 +1.5 * IQR

Adjacent values are the most extreme values that are not potential outliers. For the final exam score data:

IQR = Q3 - Q1 = 89 - 70 = 19.

Lower limit = Q1 - 1.5 · IQR = 70 - 1.5 *19 = 41.5
Upper limit = Q3 + 1.5 · IQR = 89 + 1.5 * 19 = 117.5

Since 24 lies outside the lower and upper limit, it is a potential outlier.

Statistical software will create a box plot of final exam score that may look like this:

### Matching the Shape of the Distribution and the Corresponding Box Plot

• A symmetric distribution and box plot:

• A left-skewed distribution and box plot:

• a right-skewed distribution and box plot:

### Side-by-Side Box Plots

When you have quantitative data that can be broken down by levels of a categorical variable, side-by-side boxplots offer an excellent graphic representation.  [NOTE: referring back to lesson one, think of having a quantitative response variable and a categorical explanatory variable.]  The graphs can be compared for shape, outliers, variability, etc.  For example, you may have group of subjects for which you have their heights and sex.  You are interested in comparing the heights for the females to the males.  The side-by-side boxplot produces an excellent visual comparison.  Below is such a boxplot comparing heights for a women's and men's basketball team.  By visual inspection one can conclude that the median height of males is higher and the IQR is slightly larger for the males - the latter based on the width of the box.

### Use a Boxplot or Z-score to identify outliers?

Which method is better for identifying outliers?  Both methods have both positive and negative features.

•  For symmetric or near symmetric data the results are very similar between methods, although box plots are more critical (i.e. likely to identify an outlier).  For a box plot, there is roughly a 0.7% chance of identifying a potential outlier.  For Z-scores, this is about 0.3% chance of identifying an outlier, especially for normal data.
• Sample size does play a role in this.  For Z-scores, you need about 11 as a minimum sample size for a Z-score to be effective. Alternatively, for larger sample sizes, even if perfectly normal, you will find Z-scores outside ±3.  For instance, you may have normal data of sample size of 1000, you can expect 2-3 outliers to have Z-scores outside ±3.
• Caution should be used when the data is skewed. Often the boxplot will show more potential outliers, but that does not make the Z-score a better choice!

# 2.4 - Practice Problems

1. Our statistics department surveyed a random sample of 5 staff personnel and 5 faculty on how often during a week they used public transportation in traveling for work.  The table below reflects the responses.

 Staff Faculty 4 2 2 2 5 0 1 5 1 3

a. What sampling method was used to gather this data?  What population of interest is best represented by the samples?

b. Calculate by hand the mean and standard deviation for number of times a week public transportation was used by staff and faculty.

c. Based on means and standard deviations, do you think there is a statistically significant difference between these two means?  Explain.

2.   The College of Dentistry at the University of Florida has made a commitment to develop its entire curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and  so on. It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional staff lab more free time for personal consultation in student – faculty interaction.   One such instructional modules developed and tested in the first 50 students proceeding through the curriculum the following measurements represent the number of hours it took the students to complete  the required modular material:

16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 6 5 23 21 10 17 15

Here is a link to the data (hours.txt) for the time it took students to complete the required material.

a. Calculate by hand the five number summary for these recorded completion times.  Helpful hint: you can use software such as Excel to sort the data.

b. Do we expect the Empirical Rule to describe adequately the variability of these data? Explain.

c. Calculate the standard deviation, s, by using the approximation formula and compare that answer to that real standard deviation of 8.45

d. The mean for this data set is 16. Using the actual s of 8.45, construct the intervals and check whether the Empirical Rule applies to this data set.