Lesson 2 - Summarizing Data: Measures of Central Tendency and Measures of Variability, Box Plot

We will first talk about the important concepts of statistical inference. Then a few descriptive measures of the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive measures of the other important characteristic of a data set, measure of variability, will be discussed. This lesson will be concluded by a discussion of box plots, which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.

Lesson 2 Objectives

Upon successful completion of this lesson, you will be able to:

  • conceptualize statistical inference.
  • use appropriate summary measures to describe different data sets.
  • construct and use box plot.

Introduction to Lesson 2
by Course Author Dr. Mosuk Chow - (length 2:30)

[Summary Transcript]

Lesson 2.1 - Data Description: Measures of Central Tendency and Skewness

Unit Summary

  • Measures of Central Tendency
    • Mean
    • Median
    • Mode
    • Trimmed Mean
  • Skewness

reading assignmentReading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.4. (Skip group data median and group data mean.)

 

Measures of Central Tendency

Three of the many ways to measure central tendency are:

1. Mean the average of the data
2. Median the middle value of the ordered data
3. Mode the value that occurs most often in the data

Examination of all members of a population is not typically conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a representative subset of the population.

subset

Descriptive measures of population are parameters. Descriptive measures of sample are statistics. For example, sample mean is a statistic and population mean is a parameter. The sample mean is usually denoted by \(\bar{y}\):

\[\bar{y}=\frac{y_1+y_2+\ldots+y_n}{n}=\frac{\sum^n_{i=1} y_i}{n}\]

where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.

Note that the data set:

1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1

Mean, median and mode are usually not equal. When the data is symmetric, mean is equal to median.

4. Trimmed Mean

One shortcoming of mean: Means are easily affected by extreme values.

Aptitude test scores of ten children:

95, 78, 69, 91, 82, 76, 76, 86, 88, 80

Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1

If the entry 91 is mistakenly recorded as 9, the mean will be 73.9, very different from 82.1.

On the other hand, let us see the effect of the mistake on the median value:

The original data sets in increasing order are:

69, 76, 76, 78, 80, 82, 86, 88, 91, 95
median = 81

The data set (with 91 coded as 9) in increasing order is:

9, 69, 76, 76, 78, 80, 82, 86, 88, 95
median = 79

The medians of the two sets are not that different. It is not that affected by the extreme value 9.

Measures that are not that affected by extreme values are called resistant.

A variation of the mean is the trimmed mean. A 10% trimmed mean drops the highest 10%, the lowest 10%, and averages the remaining.

(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)
10% trimmed mean = 82.13

(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)
10% trimmed mean = 79. 38

The 10% trimmed mean of the two sets is not that different. It is not as affected by the extreme value 9 as the mean.

Review of Measures of Central Tendency
by Course Author Dr. Mosuk Chow - (length 3:55)

[Summary Transcript]

 

Skewness

Skewness is a measure of degree of asymmetry of the distribution.

1. Symmetric

Mean, median, and mode are all the same here; mound shaped, no skewness (symmetric).

The above distribution is symmetric.

2. Skewed Left

Mean to the left of the median, long tail on the left.

The above distribution is skewed to the left.

3. Skewed Right

Mean to the right of the median, long tail on the right.

The above distribution is skewed to the right.

When one has very skewed data, it is better to use the median as measure of central tendency since the median is not much affected by extreme values.

Lesson 2.2 - Data Description: Measures of Variability

Unit Summary

  • Measures of Variability
    • Range
    • Interquartile Range (IQR)
    • Variance and Standard Deviation
  • Empirical Rule
  • How to Roughly Approximate Standard Deviation

 

reading assignmentReading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.5. (Skip quantile plot.)

 

Measures of Variability

Ponder the following, then click on the icon to the left to display the statistical answer.

If you can use two numbers to summarize Jessica's weight data, which two characteristics will you use as measures?

Why do we want to know variability?

Many ways to describe variability

A. Range: R = maximum - minimum

1. Easy to calculate
2. Very, very affected by extreme values (range is not a resistant measure of variability)

B. Interquartile range (IQR)

In order to talk about interquartile range, we need to first talk about percentile.

The pth percentile of the data set is a measurement such that after the data are ordered from smallest to largest, at most p% of the data are below this value and at most (100-p)% above it.

Thus, the median is the 50th percentile.

Also, Q1 = lower quartile = 25th percentile and Q3 = upper quartile = 75th percentile.

Interquartile range is the difference between upper and lower quartiles and denoted as IQR.

IQR = Q3 - Q1 = upper quartile - lower quartile = 75th percentile - 25th percentile.

Details about how to compute IQR will be given in Lesson 2.3.

Note: IQR is not affected by extreme values. It is thus a resistant measure of variability.

C. Variance and Standard Deviation

Two vending machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets is random. The following data are recorded:

Pieces of candy from vending machine A:

1, 2, 3, 3, 5, 4
mean = 3, median = 3, mode = 3

Pieces of candy from vending machine B:

2, 3, 3, 3, 3, 4
mean = 3, median = 3, mode = 3

Dotplots for pieces of candy from vending machine A and vending machine B:

They have the same center, what about their spreads? One way to compare their spreads is to compute their standard deviations. In the following section, we are going to talk about how to compute sample variance and sample standard deviation for a data set.

Variance is the average squared distance from the mean.

Population variance is defined as:

\[{\sigma}^2=\sum_{i=1}^N \frac{(y_i-\mu)^2}{N}\]

In this formula μ is the population mean and the summation is over all possible values of the population. N is the population size.

The sample variance that is computed from the sample and used to estimate σ 2 is:

\[s^2=\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}\]

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by \(\bar{y}\), the yi's  tend to be closer to \(\bar{y}\) than to μ. To compensate, we divide by a smaller number, n - 1.

For example, find \(S^2\) for data set A: 1, 2, 3, 3, 4, 5

\[\bar{y}=\frac{1+2+3+3+4+5}{6}=3\]

\[s^2=\frac{(y_1-\bar{y})^2+\cdots +(y_n-\bar{y})^2}{n-1}=\frac{(1-3)^2+(2-3)^2+(3-3)^2+(3-3)^2+(4-3)^2+(5-3)^2}{6-1}=2\]

 

Calculate S2 for data set B yourself and check that it is smaller than the S2 for data set AWork out your answer first, then click the graphic to compare answers.

Standard Deviation

\(\sigma=\sqrt{{\sigma}^2}\) has the same unit as yi's. This is a desirable property since one may think about the spread in terms of the original unit.

σ is estimated by the sample standard deviation s :

\[s=\sqrt{s^2}\]

For the data set A,

\(s=\sqrt{2}=1.414\) pieces of candy.

Calculate the standard deviation for data set BWork out your answer first, then click the graphic to compare answers.

The standard deviation is very useful. One reason is that it has the same unit as the measurements. Also, the empirical rule, which will be explained in the following section, makes the standard deviation an important yardstick to find out approximately what percentage of the measurements fall within certain intervals.

Empirical Rule

Empirical Rule: if the set of measurements follow a bell-shaped distribution, then

\(\bar{y} \pm s\)  contains about 68% of data
\(\bar{y} \pm 2s\)  contains about 95% of data
\(\bar{y} \pm 3s\)  contains about all of data

Review of the Empirical Rule
by Course Author Dr. Mosuk Chow - (length 2:00)

[Summary Transcript]

The following five examples (a-e) show that the empirical rule is not that far off even when the underlying distribution is not bell shaped.

a. For the following graph, \(\bar{y}=5.5\), \(s =1.49\)

60% within \(\bar{y} \pm s\)
(5.5 - 1.49, 5.5 + 1.49) = (4.01, 6.99)

94% within \(\bar{y} \pm 2s\)
(5.5 - 2.98, 5.5 + 2.98) = (2.52, 8.48)

100% within \(\bar{y} \pm 3s\)
(5.5 - 4.47, 5.5 + 4.47) = (1.03, 9.97)

b. For the following graph, \(\bar{y}=5.5\), \(s=2.07\)

64% within \(\bar{y} \pm s\)
96% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)

c. For the following graph, \(\bar{y}=5.5\), \(s=2.89\)

60% within \(\bar{y} \pm s\)
100% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)

d. For the following graph, \(\bar{y}=3.49\), \(s=1.87\)

75% within \(\bar{y} \pm s\) 
96% within \(\bar{y} \pm 2s\) 
98.5% within \(\bar{y} \pm 3s\)

e. For the following graph, \(\bar{y}=2.57\), \(s=1.87\)

87% within \(\bar{y} \pm s\)
95% within \(\bar{y} \pm 2s\)
97.6% within \(\bar{y} \pm 3s\)

Ponder the following, then click on the icon to the left display the statistical application example.

How can one find an approximate value of s without going through the detailed computation? It follows from the empirical rule that approximately 95% of measurements lie in \(\bar{y} \pm 2s\)(almost all).

Range 4s

Approximate value of \(s\approx \frac{range}{4}\)

Why don't we say \(\bar{y} \pm 3s\) contains all and divide by 6 to obtain the approximate value of s?

 

It is important to remember that one has to use the formula:

\(s=\sqrt{\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}}\)

to compute the sample standard deviation. The formula {Approximate value of \(s\approx \frac{range}{4}\) } only gives a rough estimate of s.

For example, the actual ages (in years) of 36 millionaires sampled, arranged in increasing order is:

31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,
54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,
71, 71, 74, 75, 77, 79, 79, 79

The data range is from 31 to 79. The approximate value of s is thus: (79-31) / 4 = 12 years.

Minitab logoMinitab command to compute the descriptive statistics for the data set:

1. Stat > Basic Statistics > Display Descriptive Statistics
2. Specify the quantitative variable in the variable text box
3. Select OK

Descriptive statistics: age of millionaires

Variable
N
Mean
Median
TrMean
StDev
SEMean
Minimum
Maximum
Q1
Q3
age
36
58.53
59.50
58.75
13.36
2.23
31.00
79.00
48.00
68.75

The standard deviation is 13.36 which is different from the approximate value. We know that the approximation is not that close since the histogram of the data is not that bell shaped.

Lesson 2.3 - Box Plots

Unit Summary

  • How to Compute a Five Number Summary
  • How to Compute IQR
  • Skeletal Box Plot
  • Box Plot

reading assignmentReading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.6.

 

How to Compute a Five Number Summary

Ponder the following, then click the icon to the left to display the statistical application example.

We want a graph that is not as detailed as a histogram, but still shows:

1. the skewness of the distribution
2. the central location
3. the variability

 

We need: min, Q1 (lower quartile), Q2 (median), Q3 (upper quartile), and max. This list is also called the five number summary.

Note: We do not follow our textbook's way to calculate Q1, Q2, and Q3.

The results may sometimes be different from the results in our textbook, but will always be the same as Minitab's result.

Recall that the mean is not a resistant measure of the central location but the median is. Both the range and the standard deviations are not resistant measures of the spread, but the IQR is. Thus, in the box plot we use the median and IQR.

How do we compute quartiles? There are two steps to follow:

  1. Find the location of the desired quartile:

    If there are n data, arranged in increasing order, then the first quartile is at position \(\frac{1}{4} (n+1)\), second quartile is at position \(\frac{2}{4} (n+1)\). The third quartile is at position \(\frac{3}{4} (n+1)\).

     

  2. Find the value in that position for the ordered data.

    Once we find the first and the third quartiles, we can compute the interquartile range (IQR) by:

    IQR = Q3 - Q1

    Roughly speaking, IQR gives the range of the middle 50% of the observations.

    The final exam scores of 18 students are (in increasing order):

         
    Q1
         
    Q2
         
    Q3
         
    24
    58
    61
    67
    71
    73
    76
    79
    82
    83
    85
    87
    88
    88
    92
    93
    94
    97

    In this example, n = 18.

    For Q1, its position is:

    \(\frac{18+1}{4}=4.75\)

    The actual value of Q1:

    Q1 = 67 (4th position) + 0.75 · (71 - 67) = 70

    For Q2, its position is:

    \(\frac{18+1}{2}=9.5\)

    The actual value of Q2:

    Q2 = 82 (9th position) + 0.5 · (83 - 82) = 82.5

    For Q3, its position is:

    \(\frac{3(18+1)}{4}=14.25\)

    Q3 = 88 + 0.25 · (92 - 88) = 89

    Thus the five number summary is:

    min
    Q1
    Q2
    Q3
    max
    24
    70
    82.5
    89
    97

Five number summary: min, Q1, Q2, Q3, and max.

Using the five number summary, one can construct a skeletal box plot.

  1. Mark the five number summary above the horizontal axis with vertical lines.
  2. Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to form the whisker.

The skeletal box plot of the final exam score:

Box plots are more detailed than skeletal box plots by also showing outliers. The following terminology will prepare us to draw the box plot.

Potential outliers are observations that lie outside the lower and upper limits.

Lower limit = Q1 - 1.5 · IQR
Upper limit = Q3 +1.5 · IQR

Adjacent values are the most extreme values that are not potential outliers. For the final exam score data:

IQR = Q3 - Q1 = 89 - 70 = 19.

Lower limit = Q1 - 1.5 · IQR = 70 - 1.5 · 19 = 41.5
Upper limit = Q3 + 1.5 · IQR = 89 + 1.5 · 19 = 117.5

Lower adjacent value = 58
Upper adjacent value = 97

Since 24 lies outside the lower and upper limit, it is a potential outlier.

Minitab logoMinitab command for a box plot: Graph > Box plot.

Box plot of final exam score:

How to tell the shape of the distribution by the box plot:

Lesson 2 - Homework

Practice Problems:

1. In a packing plant, a machine packs carton with jars. The times it takes each machine to pack 10 cartons are recorded. The results (machine.txt), in seconds, are shown in the following table:

New machine
Old machine
42.1
41.3
42.4
43.2
41.8
42.7
43.8
42.5
43.1
44.0
41.0
41.8
42.8
42.3
42.7
43.6
43.3
43.5
41.7
44.1

a. Compute the mean and standard deviation for the time to pack a carton for each machine.

b. Plot the data for each machine.

c. Describe the data for the two machines.

2.   The College of Dentistry at the University of Florida has made a commitment to develop its entire curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and  so on. It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional staff lab more free time for personal consultation in student – faculty interaction.   One such instructional modules developed and tested in the first 50 students proceeding through the curriculum the following measurements represent the number of hours it took the students to complete  the required modular material:

16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 4 5 23 21 10 17 15

Here is a link to the data (hours.txt) for the time it took students to complete the required material.

a. Calculate the mode, the median, and the mean for these recorded completion times.

b. Guess the value of s.

c. Compute s by using the shortcut formula and compare your answers to that of part (b) above.

d. We do expect the Empirical Rule to describe adequately the variability of these data? Explain. 

e.  Construct the intervals and check whether the Empirical Rule applies to this data set.

 

solutions logo for Practice Problems

ASK!  If you have a question about any part of these practice problems, please post your question to the discussion forum in ANGEL. 

 

Homework Problems to Submit

Now, find Homework 2 in ANGEL and submit it to the Dropbox by the due date.

If there are data referred to in the homework problems, you will also find these data files in ANGEL.