We will first talk about the important concepts of statistical inference. Then a few descriptive measures of the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive measures of the other important characteristic of a data set, measure of variability, will be discussed. This lesson will be concluded by a discussion of box plots, which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.
Again, this lesson will focus on simple examples that can be calculated or drawn by hand. This lesson will be followed by another lesson that will work through many of these procedures using Minitab.
Lesson 2 Objectives 
Upon successful completion of this lesson, you will be able to:

Introduction to Lesson 2
by Course Author Dr. Mosuk Chow  (length 2:30)
To summarize a data set, we want to report different attributes of the data set. One important attribute is the central tendency of the data, the other important attribute is how spread out the data is. Then, some more attributes to report is the shape of the data, etc. In this lesson, we will mainly discuss the first two important attributes, central tendency and spread.
For the central tendency, we are talking about the where the center of the data is located.
For the spread, we are talking about the variability of the data.
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (See your course schedule.)
Three of the many ways to measure central tendency are:
1. Mean the average of the data 2. Median the middle value of the ordered data 3. Mode the value that occurs most often in the data
In most research experimental situations, examination of all members of a population is not typically conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a representative subset of the population.
Let's take a closer look at this diagram implies with Dr. Wiesner.
Descriptive measures of population are parameters. Descriptive measures of a sample are statistics. For example, a sample mean is a statistic and a population mean is a parameter. The sample mean is usually denoted by \(\bar{y}\):
\[\bar{y}=\frac{y_1+y_2+\ldots+y_n}{n}=\frac{\sum^n_{i=1} y_i}{n}\]
where n is the sample size and y_{i} are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.
A Note on Notation!What if we say we used \(x_i\) for our measurements instead of \(y_i\)? Is this a problem? No. The formula would simply look like this: \[\bar{x}=\frac{x_1+x_2+\ldots+x_n}{n}=\frac{\sum^n_{i=1} x_i}{n}\] The formulas are exactly the same. The letters that you select to denote the measurements are up to you. For instance, many textbooks use x instead of y to denote the measurements. The point is to understand how the calculation that is expressed in the formula works. In this case, the formula is calculating the mean by summing all of the observations and dividing by the number of observations. There is some notation that you will come to see as standards, i.e, n will always equal sample size. We will make a point of letting you know what these are. However, when it comes to the variables, these labels can (and do) vary. For example, in one study x may be used to denote weight and y may be used to denote height, (or the reverse may be used!), but n will always be used to denote sample size in each case. 
Note that for the data set:
1, 1, 2, 3, 13
mean = 4, median = 2, mode = 1
Steps to finding the median for a set of data:
Mean, median and mode are usually not equal. When the data is symmetric, the mean is equal to the median.
4. Trimmed Mean
One shortcoming of the mean is that: Means are easily affected by extreme values.
Consider the aptitude test scores of ten children below:
95, 78, 69, 91, 82, 76, 76, 86, 88, 80
Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1
If the entry 91 is mistakenly recorded as 9, the mean would be 73.9, which is very different from 82.1.
On the other hand, let us see the effect of the mistake on the median value:
The original data set in increasing order are:
69, 76, 76, 78, 80, 82, 86, 88, 91, 95
With n = 10, the median position is found by (10 + 1) / 2 = 5.5. Thus, the median is the average of the fifth (80) and sixth (82) ordered value and the median = 81
The data set (with 91 coded as 9) in increasing order is:
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
where the median = 79The medians of the two sets are not that different. Therefore the median is not that affected by the extreme value 9.
Measures that are not that affected by extreme values are called resistant.
A variation of the mean is the trimmed mean. A 10% trimmed mean drops the highest 10%, the lowest 10%, and averages the remaining. Let's calculate the trimmed mean for the data we were looking at above:
(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)
The 10% trimmed mean = 82.13(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)
The 10% trimmed mean = 79. 38The 10% trimmed mean of the two sets is not that different. The trimmed mean is not as affected by the extreme value 9 as the mean.
After reading this lesson you should know that there are quite a few options when one wants to describe central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about mainly about the mean. However, we need to be aware of one of its short comings, which is that it is easily affected by extreme values. One remedy is to use trimmed mean to estimate the central tendency. Remember, however, that this is very different from saying that one can trim data. Unless data points are known mistakes, one should not remove them from the data set! One should keep the extreme points and use more resistant measures. For example, use the sample median to estimate the population median. Or, use the sample trimmed mean to estimate the population trimmed mean. Again, this is very different from saying that it is OK to trim data from a data set.
Skewness is a measure of degree of asymmetry of the distribution.
1. Symmetric
Mean, median, and mode are all the same here; the distribution is mound shaped, and no skewness is apparent. The distribution is described as symmetric.
The above distribution is symmetric.
2. Skewed Left
Mean to the left of the median, long tail on the left.
The above distribution is skewed to the left.
3. Skewed Right
Mean to the right of the median, long tail on the right.
The above distribution is skewed to the right.
When one has very skewed data, it is better to use the median as measure of central tendency since the median is not much affected by extreme values.
Salary distributions are almost always positively skewed, with a few people that make the most money. To illustrate this, consider your favorite sports team or even the company for which you work. There will be one or two players or personnel that earn the “big bucks”, followed by others who earn less. This will produce a shape that is skewed to the right. Knowing this can be a useful aid in negotiating a higher salary.
When one interviews for a position and the discussion gets around to compensation, it is common that the interviewer states an offer that is “typical for someone in your position”. That is, they are offering you the average salary for someone with your particular skill set (e.g. little experience). But is this average the mode, median, or mean? The company – for whom business is business! – will want to pay you the least they can while you prefer to earn the most you can. Since salaries tend to be skewed to the right, the offer will most likely reflect the mode or median. You simply need to ask to which “average” the offer refers and what is the mean of this average since the mean would be the highest of the three values. Once you have these averages, you can begin to negotiate toward the highest number.
What happens to the mean and median if we add or multiply each observation in a data set by a constant? Consider for example if an instructor curves an exam by adding five points to each student’s score. What effect does this have on the mean and the median? The result of adding a constant to each value has the intended effect of altering the mean and mean by the constant. For example, if in the above example where we have 10 aptitude scores, if 5 was added to each score the mean of this new data set would be 87.1 (the original mean of 82.1 plus 5) and the new median would be 86 (the original median of 81 plus 5).
Similarly, if each observed data value was multiplied by a constant, the new mean and median would change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original scores were doubled, the then the new mean and new median would be double the original mean and median. As we will learn shortly, the effect is not the same on the variance!
Why would you want to know this? One reason, especially for those moving onward to more applied statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical methods a required assumption is that the data is normal, or very near bellshaped. When the data is not normal, statisticians will transform the data using numerous techniques e.g. logarithmic transformation. But, the log cannot be taken of all values, for instance the log of 0 is undefined. However, if we add a constant to all the data values making them all greater than zero, then a log can be taken without risk.We just need to remember the original data was transformed!!
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (see course schedule).
Think about the following, then click on the icon to the left to display the statistical answer.
If you can use two numbers to summarize Jessica's weight data, which two characteristics will you use as measures?
Why do we want to know variability?
There are many ways to describe variability including:
Let's look at each of these in turn.
A. Range: R = maximum  minimum
 Easy to calculate
 Very much affected by extreme values (range is not a resistant measure of variability)
B. Interquartile range (IQR)
In order to talk about interquartile range, we need to first talk about percentiles.
The pth percentile of the data set is a measurement such that after the data are ordered from smallest to largest, at most, p% of the data are at or below this value and at most, (100  p)% at or above it.
Thus, the median is the 50th percentile. Fifty percent or the data values fall at or below the median.
Also, Q_{1} = lower quartile = the 25th percentile and Q_{3} = upper quartile = the 75th percentile.
The interquartile range is the difference between upper and lower quartiles and denoted as IQR.
IQR = Q_{3}  Q_{1} = upper quartile  lower quartile = 75th percentile  25th percentile.
Details about how to compute IQR will be given in Lesson 2.3.
Note: IQR is not affected by extreme values. It is thus a resistant measure of variability.
C. Variance and Standard Deviation
Two vending machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets is random. The following data are recorded for six trials at each vending machine:
Pieces of candy from vending machine A:
1, 2, 3, 3, 5, 4
mean = 3, median = 3, mode = 3Pieces of candy from vending machine B:
2, 3, 3, 3, 3, 4
mean = 3, median = 3, mode = 3Dotplots for the pieces of candy from vending machine A and vending machine B:
They have the same center, but what about their spreads? One way to compare their spreads is to compute their standard deviations. In the following section, we are going to talk about how to compute the sample variance and the sample standard deviation for a data set.
Variance is the average squared distance from the mean.
Population variance is defined as:
\[{\sigma}^2=\sum_{i=1}^N \frac{(y_i\mu)^2}{N}\]
In this formula μ is the population mean and the summation is over all possible values of the population. N is the population size.
The sample variance that is computed from the sample and used to estimate σ ^{2} is:
\[s^2=\sum_{i=1}^n \frac{(y_i\bar{y})^2}{n1}\]
Why do we divide by n  1 instead of by n? Since μ is unknown and estimated by \(\bar{y}\), the y_{i}'s tend to be closer to \(\bar{y}\) than to μ. To compensate, we divide by a smaller number, n  1. The sample variance (and therefore sample standard deviation) are the common default calculations used by software. When asked to calculate the variance or standard deviation of a set of data, assume  unless otherwise instructed  this is sample data and therefore calculating the sample variance and sample standard deviation.
For example, let's find \(S^2\) for the data set from vending machine A: 1, 2, 3, 3, 4, 5
\[\bar{y}=\frac{1+2+3+3+4+5}{6}=3\]
\[s^2=\frac{(y_1\bar{y})^2+\cdots +(y_n\bar{y})^2}{n1}=\frac{(13)^2+(23)^2+(33)^2+(33)^2+(43)^2+(53)^2}{61}=2\]
Calculate S^{2} for the data set from vending machine B yourself and check that it is smaller than the S^{2} for data set A. Work out your answer first, then click the graphic to compare answers.
Standard Deviation
The population standard deviation is notated by σ and found by \(\sigma=\sqrt{{\sigma}^2}\) has the same unit as y_{i}'s. This is a desirable property since one may think about the spread in terms of the original unit.
σ is estimated by the sample standard deviation s :
\[s=\sqrt{s^2}\]
For the data set A,
\(s=\sqrt{2}=1.414\) pieces of candy.
Calculate the standard deviation for the data set from vending machine B . Work out your answer first, then click the graphic to compare answers.
The standard deviation is approximately the average distance the values of a data set are from the mean, and is a very useful measure. One reason is that it has the same unit of measurement as the data itself (e.g. if a sample of student heights were in inches then so, too, would be the standard deviation. The variance would be in squared units, for example inches^{2}. Also, the empirical rule, which will be explained in the following section, makes the standard deviation an important yardstick to find out approximately what percentage of the measurements fall within certain intervals.
What happens to measures of variability if we add or multiply each observation in a data set by a constant? We learned previously about the effect such actions have on the mean and the median, but do variation measures behave similarly? Not really.
When we add a constant to all values we are basically shifting the data upward (or downward if we subtract a constant). This has the result of moving the middle but leaving the variability measures (e.g. range, IQR, variance, standard deviation) unchanged.
On the other hand, if one multiplies each value by a constant this does effect measures of variation. The result on the variance is that the new variance is multiplied by the square of the constant, while the standard deviation, range, and IQR are multiplied by the constant. For example, if the observed values of Machine A in the example above were multiplied by three, the new variance would be 18 (the original variance of 2 multiplied by 9). The new standard deviation would be 4.242 (the original standard 1.414 multiplied by 3). The range and IQR would also change by a factor of 3.
Empirical Rule is sometimes referred to as the 689599.7% Rule. If the set of measurements follows a bellshaped distribution, then
\(\bar{y} \pm s\)  contains about 68% of data 
\(\bar{y} \pm 2s\)  contains about 95% of data 
\(\bar{y} \pm 3s\)  contains about all of data 
Review of the Empirical Rule
by Course Author Dr. Mosuk Chow  (length 2:00)
Summary Transcript
In Lesson 2.2 we we describe the empirical rule. The empirical rule helps us to provide an estimate for the standard deviation. Empirical rule says that for a bell shaped curve roughly 68% of the data falls between one standard deviation of the sample mean. Roughly 95% of the data falls between two standard deviations of the mean. And, almost all of the data will fall between three standard deviations of the sample mean.
Using the empirical rule, if your data is roughly bell shaped, then one way to find a rough estimate for the standard deviation of the data set is, to find out the range and then use the range and divide by four.
The reason we divide by four instead of dividing by six is because this will give us a more conservative estimate. In this case, "conservative" means a larger estimate as we'd prefer to over estimate instead of underestimate.
One important point, whenever we want to find out the standard deviation of the data set, we should use the formula for this.
The following five examples (ae) show that the empirical rule is not that far off even when the underlying distribution is not bell shaped.
a. For the following graph, \(\bar{y}=5.5\), \(s =1.49\)
60% within \(\bar{y} \pm s\)
(5.5  1.49, 5.5 + 1.49) = (4.01, 6.99)94% within \(\bar{y} \pm 2s\)
(5.5  2.98, 5.5 + 2.98) = (2.52, 8.48)100% within \(\bar{y} \pm 3s\)
(5.5  4.47, 5.5 + 4.47) = (1.03, 9.97)b. For the following graph, \(\bar{y}=5.5\), \(s=2.07\)
64% within \(\bar{y} \pm s\)
96% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)c. For the following graph, \(\bar{y}=5.5\), \(s=2.89\)
60% within \(\bar{y} \pm s\)
100% within \(\bar{y} \pm 2s\)
100% within \(\bar{y} \pm 3s\)d. For the following graph, \(\bar{y}=3.49\), \(s=1.87\)
75% within \(\bar{y} \pm s\)
96% within \(\bar{y} \pm 2s\)
98.5% within \(\bar{y} \pm 3s\)e. For the following graph, \(\bar{y}=2.57\), \(s=1.87\)
87% within \(\bar{y} \pm s\)
95% within \(\bar{y} \pm 2s\)
97.6% within \(\bar{y} \pm 3s\)Approximating the Standard Deviation
Think about the following, then click on the icon to the left display the statistical application example.
How can one find an approximate value of s without going through the detailed computation? It follows from the empirical rule that approximately 95% of measurements lie in \(\bar{y} \pm 2s\)(almost all).
Range 4s
Approximate value of \(s\approx \frac{range}{4}\)
Why don't we say \(\bar{y} \pm 3s\) contains all and divide by 6 to obtain the approximate value of s?
It is important to remember that one has to use the formula:
\(s=\sqrt{\sum_{i=1}^n \frac{(y_i\bar{y})^2}{n1}}\)
to compute the sample standard deviation. The formula {Approximate value of \(s\approx \frac{range}{4}\) } only gives a rough estimate of s.
For example, the actual ages (in years) of 36 millionaires sampled, arranged in increasing order is:
31, 38, 39, 39, 42, 42, 45, 47, 48, 48, 48, 52, 52, 53,
54, 55, 57, 59, 60, 61, 64, 64, 66, 66, 67, 68, 68, 69,
71, 71, 74, 75, 77, 79, 79, 79
The data range is from 31 to 79. Thus, using the 'shortcut' formula to approximate the value of s is as follows: (7931) / 4 = 12 years.
Above we considered three measures of variation: Range, Interquartile Range (IQR), and Variance (and its square root counterpart  Standard Deviation). These are all measures we can calculate from one quantitative variable e.g. height, weight. But how can we compare dispersion (i.e. variability) of data from two or more distinct populaions that have vastly different means? A popular statistic to use in such situations is the Coefficient of Variation or CV. This is a unitfree statistic and one where the higher the value the greater the dispersion. The calcuation of CV is:
CV = Standard Deviation / Mean
You are shopping for toilet tissue. As you compare prices of various brands, some offer price per roll while others offer price per sheet. You are interested in determining which pricing method has less variability so you sample several of each and calculate the mean and standard deviation for the sampled items that are priced per roll, and the mean and standard deviation for the sampled items that are priced per sheet. The table below summarizes your results.
Item  Mean  Standard Deviation 
Price per Roll  0.9196  0.4233 
Price per Sheet  0.01134  0.00553 
Comparing the standard deviations the Per Sheet appear to have much less variability in pricing. However, the mean is also much smaller. The coefficient of variation allows us to make a relative comparison of the variability of these two pricing schemes:
\[CV_{Roll}= 0.4233 / 0.9196 = 0.46 \; \text{and} \; CV_{Sheet} = 0.00553 / 0.01134 = 0.49\]
Relatively speaking, the variation for Price per Sheet is greater than the variability for Price per Roll.
Another example to consider is hotel pricing. Think of prices for luxury and budget hotels. Which do you think would have the higher average cost per night? Which would have the greater standard deviation? The CV would allow you to compare this dispersion in costs in relative terms by accounting for the fact that the luxury hotels would have a greater mean and standard deviation.
Zvalue, or sometimes referred to as Zscore or simply Z, represents the number of standard deviations an observation is from the mean for a set of data. To find the zscore for a particular observation we apply the following formula:
Z = (observed value – mean) / SD
For a recent final exam the mean was 68.55 with a standard deviation of 15.45
If you scored an 80%: Z = (80  68.55) / 15.45 = 0.74, which means your score of 80 was 0.74 SD above the mean.
If you scored a 60%: Z = (60  68.55) / 15.45 = 0.55, which means your score of 60 was 0.55 SD below the mean.
Is it always good to have a positive Z score? It depends on the question.
For exams you would want a positive Zscore (indicates you scored higher than the mean). However, if one was analyzing days of missed work then a negative Zscore would be more appealing as it would indicate the person missed less than the mean number of days.
Characteristics of Zscores
Unit Summary 

Reading Assignment
An Introduction to Statistical Methods and Data Analysis, (see course schedule).
Think about the following question, then click the icon to the left to display an answer.
We want a graph that is not as detailed as a histogram, but still shows:
1. the skewness of the distribution
2. the central location
3. the variability
To create this plot we need the following:
This list is also called the five number summary.
NOTE: The method we will demonstrate for calculating Q_{1} and Q_{3} may differ from the method described in our textbook. The results may sometimes be different from the results in our textbook, but will always be the same as Minitab's result (which we will calculate later).
Recall that the mean is not a resistant measure (i.e. not as greatly affected by extreme observations) of the central location but the median is. Both the range and the standard deviations are not resistant measures of the spread, but the IQR is. Thus, in the box plot we use the median and IQR.
How do we compute quartiles? There are two steps to follow:
If there are n data, arranged in increasing order, then the first quartile is at position \(\frac{1}{4} (n+1)\), second quartile (i.e. the median) is at position \(\frac{2}{4} (n+1)\). The third quartile is at position \(\frac{3}{4} (n+1)\).
Once we find the first and the third quartiles, we can compute the interquartile range (IQR) by:
IQR = Q_{3}  Q_{1}
Roughly speaking, IQR gives the range of the middle 50% of the observations.
The final exam scores of 18 students are (in increasing order):
Q_{1}

Q_{2}

Q_{3}


24

58

61

67

71

73

76

79

82

83

85

87

88

88

92

93

94

97

In this example, n = 18.
For Q_{1}, its position is:
\[\frac{18+1}{4}=4.75\]
The actual value of Q_{1}:
Q_{1} = 67 (4th position) + 0.75 · (71  67) = 70
For the median, its position is:
\[\frac{18+1}{2}=9.5\]
The actual value of the median:
Q_{2} = 82 (9th position) + 0.5 · (83  82) = 82.5
For Q_{3}, its position is:
\[\frac{3(18+1)}{4}=14.25\]
Q_{3} = 88 + 0.25 · (92  88) = 89
Thus the five number summary is:
min Q_{1} Median (Q_{2}) Q_{3} max 24 70 82.5 89 97
Five number summary: min, Q_{1},_{ }median, Q_{3}, and max.
Using the five number summary, one can construct a skeletal box plot.
Here is a handdrawn image of the skeletal box plot of the final exam score:
NOTE: Most statistical software do NOT create graphs of a skeletal box plot but instead opt for the box plot as follows below. Box plots from statistical software are more detailed than skeletal box plots because they also show outliers. However, if there are no outliers, what is produced by the software is essentially the skeletal boxplot. The following terminology will prepare us to understand and draw this more detailed type of the box plot.
Potential outliers are observations that lie outside the lower and upper limits.
Lower limit = Q_{1}  1.5 * IQR
Upper limit = Q_{3} +1.5 * IQRAdjacent values are the most extreme values that are not potential outliers. For the final exam score data:
IQR = Q_{3}  Q_{1} = 89  70 = 19.
Lower limit = Q_{1}  1.5 · IQR = 70  1.5 *19 = 41.5
Upper limit = Q_{3} + 1.5 · IQR = 89 + 1.5 * 19 = 117.5Lower adjacent value = 58
Upper adjacent value = 97Since 24 lies outside the lower and upper limit, it is a potential outlier.
Statistical software will create a box plot of final exam score that may look like this:
Matching the Shape of the Distribution and the Corresponding Box Plot
When you have quantitative data that can be broken down by levels of a categorical variable, sidebyside boxplots offer an excellent graphic representation. [NOTE: referring back to lesson one, think of having a quantitative response variable and a categorical explanatory variable.] The graphs can be compared for shape, outliers, variability, etc. For example, you may have group of subjects for which you have their heights and sex. You are interested in comparing the heights for the females to the males. The sidebyside boxplot produces an excellent visual comparison. Below is such a boxplot comparing heights for a women's and men's basketball team. By visual inspection one can conclude that the median height of males is higher and the IQR is slightly larger for the males  the latter based on the width of the box.
Which method is better for identifying outliers? Both methods have both positive and negative features.
1. Our statistics department surveyed a random sample of 5 staff personnel and 5 faculty on how often during a week they used public transportation in traveling for work. The table below reflects the responses.
Staff

Faculty  
4

2

2

2

5

0

1

5

1

3

a. What sampling method was used to gather this data? What population of interest is best represented by the samples?
b. Calculate by hand the mean and standard deviation for number of times a week public transportation was used by staff and faculty.
c. Based on means and standard deviations, do you think there is a statistically significant difference between these two means? Explain.
2. The College of Dentistry at the University of Florida has made a commitment to develop its entire curriculum around the use of selfpaced instructional materials such as videotapes, slide tapes, and so on. It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional staff lab more free time for personal consultation in student – faculty interaction. One such instructional modules developed and tested in the first 50 students proceeding through the curriculum the following measurements represent the number of hours it took the students to complete the required modular material:
16 8 33 21 34 17 12 14 27 6
33 25 16 7 15 18 25 29 19 27
5 12 29 22 14 25 21 17 9 4
12 15 13 11 6 9 26 5 16 5
9 11 5 6 5 23 21 10 17 15
Here is a link to the data (hours.txt) for the time it took students to complete the required material.
a. Calculate by hand the five number summary for these recorded completion times. Helpful hint: you can use software such as Excel to sort the data.
b. Do we expect the Empirical Rule to describe adequately the variability of these data? Explain.
c. Calculate the standard deviation, s, by using the shortcut formula and compare that answer to that real standard deviation of 8.45
d. The mean for this data set is 16. Using the actual s of 8.45, construct the intervals and check whether the Empirical Rule applies to this data set.
If you have a question about any part of these practice problems, please post your question to the course discussion forum.