We will first talk about the important concepts of statistical inference. Then a few descriptive measures of the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive measures of the other important characteristic of a data set, measure of variability, will be discussed. This lesson will be concluded by a discussion of box plots, which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.
Lesson 2 Objectives 
Upon successful completion of this lesson, you will be able to:

Introduction to Lesson 2
by Course Author Dr. Mosuk Chow  (length 2:30)
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.4. (Skip group data median and group data mean.)
Three of the many ways to measure central tendency are:
1. Mean the average of the data 2. Median the middle value of the ordered data 3. Mode the value that occurs most often in the data
Examination of all members of a population is not typically conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a representative subset of the population.
Descriptive measures of population are parameters. Descriptive measures of sample are statistics. For example, sample mean is a statistic and population mean is a parameter. The sample mean is usually denoted by \(\bar{y}\):
\[\bar{y}=\frac{y_1+y_2+\ldots+y_n}{n}=\frac{\sum^n_{i=1} y_i}{n}\]
where n is the sample size and y_{i} are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.
Note that the data set:
1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1
Mean, median and mode are usually not equal. When the data is symmetric, mean is equal to median.
4. Trimmed Mean
One shortcoming of mean: Means are easily affected by extreme values.
Aptitude test scores of ten children:
95, 78, 69, 91, 82, 76, 76, 86, 88, 80
Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1
If the entry 91 is mistakenly recorded as 9, the mean will be 73.9, very different from 82.1.
On the other hand, let us see the effect of the mistake on the median value:
The original data sets in increasing order are:
69, 76, 76, 78, 80, 82, 86, 88, 91, 95
median = 81The data set (with 91 coded as 9) in increasing order is:
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
median = 79The medians of the two sets are not that different. It is not that affected by the extreme value 9.
Measures that are not that affected by extreme values are called resistant.
A variation of the mean is the trimmed mean. A 10% trimmed mean drops the highest 10%, the lowest 10%, and averages the remaining.
(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)
10% trimmed mean = 82.13(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)
10% trimmed mean = 79. 38The 10% trimmed mean of the two sets is not that different. It is not as affected by the extreme value 9 as the mean.
Review of Measures of Central Tendency
by Course Author Dr. Mosuk Chow  (length 3:55)
Skewness is a measure of degree of asymmetry of the distribution.
1. Symmetric
Mean, median, and mode are all the same here; mound shaped, no skewness (symmetric).
The above distribution is symmetric.
2. Skewed Left
Mean to the left of the median, long tail on the left.
The above distribution is skewed to the left.
3. Skewed Right
Mean to the right of the median, long tail on the right.
The above distribution is skewed to the right.
When one has very skewed data, it is better to use the median as measure of central tendency since the median is not much affected by extreme values.
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.5. (Skip quantile plot.)
Ponder the following, then click on the icon to the left to display the statistical answer.
If you can use two numbers to summarize Jessica's weight data, which two characteristics will you use as measures?
Why do we want to know variability?
Many ways to describe variability
A. Range: R = maximum  minimum
1. Easy to calculate
2. Very, very affected by extreme values (range is not a resistant measure of variability)
B. Interquartile range (IQR)
In order to talk about interquartile range, we need to first talk about percentile.
The pth percentile of the data set is a measurement such that after the data are ordered from smallest to largest, at most p% of the data are below this value and at most (100p)% above it.
Thus, the median is the 50th percentile.
Also, Q_{1} = lower quartile = 25th percentile and Q_{3} = upper quartile = 75th percentile.
Interquartile range is the difference between upper and lower quartiles and denoted as IQR.
IQR = Q_{3}  Q_{1} = upper quartile  lower quartile = 75th percentile  25th percentile.
Details about how to compute IQR will be given in Lesson 2.3.
Note: IQR is not affected by extreme values. It is thus a resistant measure of variability.
C. Variance and Standard Deviation
Two vending machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets is random. The following data are recorded:
Pieces of candy from vending machine A:
1, 2, 3, 3, 5, 4
mean = 3, median = 3, mode = 3Pieces of candy from vending machine B:
2, 3, 3, 3, 3, 4
mean = 3, median = 3, mode = 3Dotplots for pieces of candy from vending machine A and vending machine B:
They have the same center, what about their spreads? One way to compare their spreads is to compute their standard deviations. In the following section, we are going to talk about how to compute sample variance and sample standard deviation for a data set.
Variance is the average squared distance from the mean.
Population variance is defined as:
\[{\sigma}^2=\sum_{i=1}^N \frac{(y_i\mu)^2}{N}\]
In this formula μ is the population mean and the summation is over all possible values of the population. N is the population size.
The sample variance that is computed from the sample and used to estimate σ ^{2} is:
\[s^2=\sum_{i=1}^n \frac{(y_i\bar{y})^2}{n1}\]
Why do we divide by n  1 instead of by n? Since μ is unknown and estimated by \(\bar{y}\), the y_{i}'s tend to be closer to \(\bar{y}\) than to μ. To compensate, we divide by a smaller number, n  1.
For example, find \(S^2\) for data set A: 1, 2, 3, 3, 4, 5
\[\bar{y}=\frac{1+2+3+3+4+5}{6}=3\]
\[s^2=\frac{(y_1\bar{y})^2+\cdots +(y_n\bar{y})^2}{n1}=\frac{(13)^2+(23)^2+(33)^2+(33)^2+(43)^2+(53)^2}{61}=2\]
Calculate S^{2} for data set B yourself and check that it is smaller than the S^{2} for data set A. Work out your answer first, then click the graphic to compare answers.
Standard Deviation
\(\sigma=\sqrt{{\sigma}^2}\) has the same unit as y_{i}'s. This is a desirable property since one may think about the spread in terms of the original unit.
σ is estimated by the sample standard deviation s :
\[s=\sqrt{s^2}\]
For the data set A,
\(s=\sqrt{2}=1.414\) pieces of candy.
Calculate the standard deviation for data set B. Work out your answer first, then click the graphic to compare answers.
The standard deviation is very useful. One reason is that it has the same unit as the measurements. Also, the empirical rule, which will be explained in the following section, makes the standard deviation an important yardstick to find out approximately what percentage of the measurements fall within certain intervals.
Empirical Rule: if the set of measurements follow a bellshaped distribution, then
\(\bar{y} \pm s\)  contains about 68% of data 
\(\bar{y} \pm 2s\)  contains about 95% of data 
\(\bar{y} \pm 3s\)  contains about all of data 
Review of the Empirical Rule
by Course Author Dr. Mosuk Chow  (length 2:00)
The following five examples (ae) show that the empirical rule is not that far off even when the underlying distribution is not bell shaped.
a. For the following graph, \(\bar{y}=5.5\), \(s =1.49\)
60% within \(\bar{y} \pm s\)
(5.5  1.49, 5.5 + 1.49) = (4.01, 6.99)94% within \(\bar{y} \pm 2s\)
(5.5  2.98, 5.5 + 2.98) = (2.52, 8.48)100% within \(\bar{y} \pm 3s\)
(5.5  4.47, 5.5 + 4.47) = (1.03, 9.97)b. For the following graph, \(\bar{y}=5.5\), \(s=2.07\)
64% within \(\bar{y} \pm s\)
96% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)c. For the following graph, \(\bar{y}=5.5\), \(s=2.89\)
60% within \(\bar{y} \pm s\)
100% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)d. For the following graph, \(\bar{y}=3.49\), \(s=1.87\)
75% within \(\bar{y} \pm s\)
96% within \(\bar{y} \pm 2s\)
98.5% within \(\bar{y} \pm 3s\)e. For the following graph, \(\bar{y}=2.57\), \(s=1.87\)
87% within \(\bar{y} \pm s\)
95% within \(\bar{y} \pm 2s\)
97.6% within \(\bar{y} \pm 3s\)Ponder the following, then click on the icon to the left display the statistical application example.
How can one find an approximate value of s without going through the detailed computation? It follows from the empirical rule that approximately 95% of measurements lie in \(\bar{y} \pm 2s\)(almost all).
Range 4s
Approximate value of \(s\approx \frac{range}{4}\)
Why don't we say \(\bar{y} \pm 3s\) contains all and divide by 6 to obtain the approximate value of s?
It is important to remember that one has to use the formula:
\(s=\sqrt{\sum_{i=1}^n \frac{(y_i\bar{y})^2}{n1}}\)
to compute the sample standard deviation. The formula {Approximate value of \(s\approx \frac{range}{4}\) } only gives a rough estimate of s.
For example, the actual ages (in years) of 36 millionaires sampled, arranged in increasing order is:
31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,
54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,
71, 71, 74, 75, 77, 79, 79, 79
The data range is from 31 to 79. The approximate value of s is thus: (7931) / 4 = 12 years.
Minitab command to compute the descriptive statistics for the data set:
1. Stat > Basic Statistics > Display Descriptive Statistics
2. Specify the quantitative variable in the variable text box
3. Select OK
Descriptive statistics: age of millionaires
Variable

N

Mean

Median

TrMean

StDev

SEMean

Minimum

Maximum

Q_{1}

Q_{3}

age

36

58.53

59.50

58.75

13.36

2.23

31.00

79.00

48.00

68.75

The standard deviation is 13.36 which is different from the approximate value. We know that the approximation is not that close since the histogram of the data is not that bell shaped.
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, chapter 3.6.
Ponder the following, then click the icon to the left to display the statistical application example.
We want a graph that is not as detailed as a histogram, but still shows:
1. the skewness of the distribution
2. the central location
3. the variability
We need: min, Q_{1} (lower quartile), Q_{2} (median), Q_{3} (upper quartile), and max. This list is also called the five number summary.
Note: We do not follow our textbook's way to calculate Q_{1}, Q_{2}, and Q_{3}.
The results may sometimes be different from the results in our textbook, but will always be the same as Minitab's result.
Recall that the mean is not a resistant measure of the central location but the median is. Both the range and the standard deviations are not resistant measures of the spread, but the IQR is. Thus, in the box plot we use the median and IQR.
How do we compute quartiles? There are two steps to follow:
If there are n data, arranged in increasing order, then the first quartile is at position \(\frac{1}{4} (n+1)\), second quartile is at position \(\frac{2}{4} (n+1)\). The third quartile is at position \(\frac{3}{4} (n+1)\).
Once we find the first and the third quartiles, we can compute the interquartile range (IQR) by:
IQR = Q_{3}  Q_{1}
Roughly speaking, IQR gives the range of the middle 50% of the observations.
The final exam scores of 18 students are (in increasing order):
Q_{1}

Q_{2}

Q_{3}


24

58

61

67

71

73

76

79

82

83

85

87

88

88

92

93

94

97

In this example, n = 18.
For Q_{1}, its position is:
\(\frac{18+1}{4}=4.75\)
The actual value of Q_{1}:
Q_{1} = 67 (4th position) + 0.75 · (71  67) = 70
For Q_{2}, its position is:
\(\frac{18+1}{2}=9.5\)
The actual value of Q_{2}:
Q_{2} = 82 (9th position) + 0.5 · (83  82) = 82.5
For Q_{3}, its position is:
\(\frac{3(18+1)}{4}=14.25\)
Q_{3} = 88 + 0.25 · (92  88) = 89
Thus the five number summary is:
min Q_{1} Q_{2} Q_{3} max 24 70 82.5 89 97
Five number summary: min, Q_{1}, Q_{2}, Q_{3}, and max.
Using the five number summary, one can construct a skeletal box plot.
The skeletal box plot of the final exam score:
Box plots are more detailed than skeletal box plots by also showing outliers. The following terminology will prepare us to draw the box plot.
Potential outliers are observations that lie outside the lower and upper limits.
Lower limit = Q_{1}  1.5 · IQR
Upper limit = Q_{3} +1.5 · IQRAdjacent values are the most extreme values that are not potential outliers. For the final exam score data:
IQR = Q_{3}  Q_{1} = 89  70 = 19.
Lower limit = Q_{1}  1.5 · IQR = 70  1.5 · 19 = 41.5
Upper limit = Q_{3} + 1.5 · IQR = 89 + 1.5 · 19 = 117.5Lower adjacent value = 58
Upper adjacent value = 97Since 24 lies outside the lower and upper limit, it is a potential outlier.
Minitab command for a box plot: Graph > Box plot.
Box plot of final exam score:
How to tell the shape of the distribution by the box plot:
1. In a packing plant, a machine packs carton with jars. The times it takes each machine to pack 10 cartons are recorded. The results (machine.txt), in seconds, are shown in the following table:
New machine

Old machine


42.1

41.3

42.4

43.2

41.8

42.7

43.8

42.5

43.1

44.0

41.0

41.8

42.8

42.3

42.7

43.6

43.3

43.5

41.7

44.1

a. Compute the mean and standard deviation for the time to pack a carton for each machine.
b. Plot the data for each machine.
c. Describe the data for the two machines.
2. The College of Dentistry at the University of Florida has made a commitment to develop its entire curriculum around the use of selfpaced instructional materials such as videotapes, slide tapes, and so on. It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional staff lab more free time for personal consultation in student – faculty interaction. One such instructional modules developed and tested in the first 50 students proceeding through the curriculum the following measurements represent the number of hours it took the students to complete the required modular material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 4 5 23 21 10 17 15
Here is a link to the data (hours.txt) for the time it took students to complete the required material.
a. Calculate the mode, the median, and the mean for these recorded completion times.
b. Guess the value of s.
c. Compute s by using the shortcut formula and compare your answers to that of part (b) above.
d. We do expect the Empirical Rule to describe adequately the variability of these data? Explain.
e. Construct the intervals and check whether the Empirical Rule applies to this data set.
Now, find Homework 2 in ANGEL and submit it to the Dropbox by the due date.
If there are data referred to in the homework problems, you will also find these data files in ANGEL.