In the beginning of this course (in the very first lesson!), we learned how to distinguish between discrete and continuous data. Discrete data are, again, data with a finite or countably infinite number of possible outcomes. Continuous data, on the other hand, are data which come from an interval of possible outcomes. Examples of discrete data include the number of siblings a randomly selected person has, the total on the faces of a pair of six-sided dice, and the number of students you need to ask before you find one who loves Stat 414. Examples of continuous data include:
In each of these examples, the resulting measurement comes from an interval of possible outcomes. Recall that the measurement tool is often the restricting factor with continuous data. That is, if I say I weigh 120 pounds, I don't actually weigh exactly 120 pounds... that's just what my scale tells me. In reality, I might weigh 120.01284027401307 pounds... that's where the interval of possible outcomes comes in. That is, the possible measurements cannot be put into one-to-one correspondence with the integers.
In this lesson, we'll investigate (or in some cases, review?) ways of summarizing continuous data. We'll summarize the data graphically using histograms, stem-and-leaf plots, and box plots. We've already discussed a couple of ways of summarizing continuous data numerically via the sample mean and sample variance. Here, we'll investigate how to summarize continuous data numerically using order statistics and various functions of order statistics.
One more thing here.... we'll be learning how to summarize data by hand. In reality, you would rarely rarely rarely ever do that in practice. Maybe if you were stranded on a desert island? In reality, 999 times out of a 1000, you and I are going to use statistical software to calculate percentiles and to create histograms, stem-and-leaf plots, and box plots. What's important here is that you just get the idea of how such graphs are created and such statistics are calculated, so that you know what they tell you when you encounter them.
The material on this page should look awfully familiar as we briefly investigated histograms in the first lesson of the course. We review them again briefly here.
38 50 38 40 35 52 45 50 40 32 40 47 70 55 51
43 40 45 45 55 37 50 45 45 55 50 45 35 52 32
45 50 40 40 50 41 41 40 40 46 45 40 43 45 42
45 45 48 45 45 35 45 45 40 45 40 40 45 35 52
Recall that although the numbers look discrete, they are technically continuous. The measuring tools, which consisted of a piece of string and a ruler, were the limiting factors in getting more refined measurements. In most cases, it appears as if nose lengths come in five millimeter increments... 35, 40, 45, 55... but that's, again, just measurement error. In order to create a histogram of these continuous measurements, we will use the following guidelines.
First, you have to group the data into a set of classes, typically of equal length. There are many, many sets of rules for defining the classes. For our purposes, we'll just rely on our common sense — having too few classes is as bad as having too many.
Here's what the work would like for our nose length example if we used 5 mm classes centered at 30, 35, ... 70:
And, here is what the density histogram would like:
Note that a density histogram is just a modified relative frequency histogram. A density histogram is defined so that:
We've previously learned that the sample mean can be thought of as the "center" of a set of data, while the sample standard deviation indicates "how spread out" the data are from the sample mean. Now, if a histogram is "mound shaped" or "bell-shaped," then we can use the sample mean, sample standard deviation, and what is called the Empirical Rule to determine three intervals for which we would expect approximately 68%, 95%, and 99.7% of the data to fall.
The Empirical Rule tells us that if a histogram is at least approximately bell-shaped, then:
(1) approximately 68% of the data are in the interval:
\((\bar{x}-s,\bar{x}+s)\)
(2) approximately 95% of the data are in the interval:
\((\bar{x}-2s,\bar{x}+2s)\)
(3) approximately 99.7% of the data are in the interval:
\((\bar{x}-3s,\bar{x}+3s)\)
The federal government's average income from federal income taxes (on a per capita basis) for each of the 50 states in fiscal year 1991 is \$1252.44 with a standard deviation of \$393.75. Assuming the data are approximately bell-shaped, use the Empirical Rule to determine three intervals for which we would expect approximately 68%, 95%, and 99.7% of the data to fall.
Solution. The Empirical Rule tells us that we can expect 68% of the per capita taxes to fall between:
\(\bar{x}-s=\$ 1252.44-\$ 393.75=\$ 858.69\) and \(\bar{x}+s=\$ 1252.44+\$ 393.75=\$ 1646.19\)
The Empirical Rule also tells us that we can expect 95% of the per capita taxes to fall between:
\(\bar{x}-2s=\$ 1252.44-2(\$ 393.75)=\$ 464.94\) and \(\bar{x}+2s=\$ 1252.44+2(\$ 393.75)=\$ 2039.94\)
The Empirical Rule also tells us that we can expect 99.7% (virtually all!) of the per capita taxes to fall between:
\(\bar{x}-3s=\$ 1252.44-3(\$ 393.75)=\$ 71.19\) and \(\bar{x}+3s=\$ 1252.44+3(\$ 393.75)=\$ 2433.69\)
A random sample of 64 people were selected to take the Stanford-Binet Intelligence Test. After each person completed the test, they were assigned an intelligence quotient (IQ) based on their performance on the test. The resulting 64 IQs are as follows:
Once the data are obtained, it might be nice to summarize the data. We could, of course, summarize the data using a histogram. One primary disadvantage of using a histogram to summarize data is that the original data aren't preserved in the graph. A stem-and-leaf plot, on the other hand, summarizes the data and preserves the data at the same time.
The basic idea behind a stem-and-leaf plot is to divide each data point into a stem and a leaf. We could divide our first data point, 111, for example, into a stem of 11 and a leaf of 1. We could divide 85 into a stem of 8 and a leaf of 5. We could divide 83 into a stem of 8 and a leaf of 3. And so on. To create the plot then, we first create a column of numbers containing the ordered stems. Our IQ data set produces stems 6, 7, 8, 9, 10, 11, 12, 13, and 14. Once the column of stems are written down, we work our way through each number in the data set, and write its leaf in the row headed by its stem.
Here's what the our stem-and-leaf plot would look like after adding the first five numbers 111, 85, 83, 98, and 107:
and here's what the completed stem-and-leaf plot would look like after adding all 64 leaves to the nine stems:
Now, rather than looking at a list of 64 unordered IQs, we have a nice picture of the data that quite readily tells us that:
That's all well and good, but we could do better. First and foremost, no one in their right mind is going to want to create too many of these stem-and-leaf plots by hand. Instead, you'd probably want to let some statistical software, such as Minitab or SAS, do the work for you. Here's what Minitab's stem-and-leaf plot of the 64 IQs looks like:
Hmmm.... how does the plot differ from ours? First, Minitab tells us that there are n = 64 numbers and that the leaf unit is 1.0. Then, ignoring the first column of numbers for now, the second column contains the stems from 6 to 14. Note, though, that Minitab uses two rows for each of the stems 7, 8, 9, 10, 11, 12, and 13. Minitab takes an alternative here that we could have taken as well. When you opt to use two rows for each stem, the first row is reserved for the leaves 0, 1, 2, 3, and 4, while the second row is reserved for the leaves 5, 6, 7, 8, and 9. For example, note that the first 9 row contains the 0 to 4 leaves, while the second 9 row contains the 5 to 9 leaves. The decision to use one or two rows for the stems depends on the data. Sometimes the one row per stem option produces the better plot, and sometimes the two rows per stem plot option produces the better plot.
Do you notice any other differences between Minitab's plot and our plot? Note that the leaves in Minitab's plot are ordered. That's right... Minitab orders the data before producing the plot, and thereby creating what is called an ordered stem-and-leaf plot.
Now, back to that first column of numbers appearing in Minitab's plot. That column contains what are called depths. The depths are the frequencies accumulated from the top of the plot and the bottom of the plot until they converge in the middle. For example, the first number in the depths column is a 1. It comes from the fact that there is just one number in the first (6) stem. The second number in the depths column is also a 1. It comes from the fact that there is 1 leaf in the first (6) stem and 0 leaves in the second (the first 7) stem, and so 1 + 0 = 1. The third number in the depths column is a 3. It comes from the fact that there is 1 leaf in the first (6) stem, 0 leaves in the second (the first 7) stem, and 2 leaves in the third (the second 7) stem, and so 1 + 0 + 2 = 3. Minitab continues accumulating numbers down the column until it reaches 32 in the last 9 stem. Then, Minitab starts accumulating from the bottom of the plot. The 5 in the depths column comes, for example, from the fact that there is 1 leaf in the last (14) stem, 1 leaf in the second 13 stem, 0 leaves in the first 13 stem, 1 leaf in the second 12 stem, and 2 leaves in the first 12 stem, and so 1 + 1+ 0 + 1 + 2 = 5.
Let's take a look at another example.
Let's consider a random sample of 20 concentrations of calcium carbonate (CaCO_{3}) in milligrams per liter.
Create a stem-and-leaf plot of the data.
Solution. Let's take the efficient route, as most anyone would likely taken in practice, by letting Minitab generate the plot for us:
Minitab tells us that the leaf unit is 0.1, so that the stem of 127 and leaf of 8 represents the number 127.8. The depths column contains something a little different here, namely the 7 with parentheses around it. It seems that Minitab's algorithm for calculating the depths differs a bit here. It still accumulates the values from the top and the bottom, but it stops in each direction when it reaches the row containing the middle value (median) of the sample. The frequency of that row containing the median is simply placed in parentheses. That is, the median of the 20 numbers is 131.45. Therefore, because the 131 stem contains 7 leaves, the depths column for that row contains a 7 in parentheses.
In our previous example, the median of the 64 IQs is 99.5. Because 99.5 falls between two rows of the display, namely between the stems 99 and 100, Minitab calculates the depths instead as described in that example, and omits the whole "parentheses around the frequency of the median row" thing.
The primary advantage of creating an ordered stem-and-leaf plot is that you can readily read what are called the order statistics right off of the plot. If we have a sample of n observations represented as:
\(x_1,x_2,x_3,\cdots,x_n\)
then when the observations are ordered from smallest to largest, the resulting ordered data are called the order statistics of the sample, and are represented as:
\(y_1 \leq y_2 \leq y_3 \leq \cdots \leq y_n\)
That is, y_{1}, the smallest data point is the first order statistic. The second smallest data point, y_{2}, is the second order statistic. And so on, until we reach the largest data point and nth order statistic, y_{n}. From the order statistics, it is rather easy to find the sample percentiles.
Definition. If 0 < p < 1, then the (100p)th sample percentile has approximately np sample observations less than it, and n(1−p) sample observations greater than it. Some sample percentiles have special names:
The interquartile range (IQR) is the difference between the first and third quartiles. |
Here's the typical method used for finding a particular sample percentile:
1. Arrange the sample data in increasing order. That is, determine the order statistics:
\(y_1 \leq y_2 \leq y_3 \leq \cdots \leq y_n\)
2. If (n+1)p is an integer, then the (100p)th sample percentile is the (n+1)pth order statistic.
3. If (n+1)p is not an integer, but rather equals r plus some proper fraction, a/b say, then use a weighted average of the rth and (r+1)st order statistics. That is, define the (100p)th sample percentile as:
\(\tilde{\pi}_p=y_r+\left(\dfrac{a}{b}\right)(y_{r+1}-y_r)\)
Let's try this method out on an example or two.
Let's return to our random sample of 64 people selected to take the Stanford-Binet Intelligence Test. The resulting 64 IQs were sorted as follows:
That is, the first order statistic is y_{1} = 68, the second order statistic is y_{2} = 75, and the 64th order statistic is y_{64} = 141. Find the 25th sample percentile, the 50th sample percentile, 75th sample percentile, and the interquartile range.
Solution. Here, we have n = 64 IQs. To find the 25th sample percentile, we need to consider p = 0.25. In that case:
\((n+1)p=(64+1)(0.25)=(65)(0.25)=16.25\)
Because 16.25 is not an integer, we are going to need to interpolate linearly between the 16th order statistic (91) and 17th order statistic (91). That is, the 25th sample percentile (or first quartile) is 91, as determined by:
\(\tilde{\pi}_{0.25}=y_{16}+(0.25)(y_{17}-y_{16})=91+0.25(91-91)=91\)
To find the 50th sample percentile, we need to consider p = 0.50. In that case:
\((n+1)p=(64+1)(0.5)=(65)(0.5)=32.5\)
Because 32.5 is not an integer, we are going to need to interpolate linearly between the 32nd order statistic (99)and 33rd order statistic (100). That is, the 50th sample percentile (or second quartile or median) is 99.5 as determined by:
\(\tilde{\pi}_{0.5}=y_{32}+(0.5)(y_{33}-y_{32})=99+0.5(100-99)=99.5\)
To find the 75th sample percentile, we need to consider p = 0.75. In that case:
\((n+1)p=(64+1)(0.75)=(65)(0.75)=48.75\)
Because 48.75 is not an integer, we are going to need to interpolate linearly between the 48th order statistic (107) and 49th order statistic (107). That is, the 75th sample percentile (or third quartile) is 107 as determined by:
\(\tilde{\pi}_{0.75}=y_{48}+(0.75)(y_{49}-y_{48})=107+0.75(107-107)=107\)
The interquartile range IQR is then 107−91 = 16.
Let's return again to our IQ data, but this time suppose that the person deemed to have the largest IQ (141) couldn't take the pressure of the test and fainted before completing the test. In that case, the sorted data of the now n = 63 IQs look like this:
You should notice that the once largest observation (141) no longer exists in the data set. Find the 25th sample percentile, the 50th sample percentile, 75th sample percentile, and the interquartile range.
Solution. Here, we have n = 63 IQs. To find the 25th sample percentile, we need to consider p = 0.25. In that case:
\((n+1)p=(63+1)(0.25)=(64)(0.25)=16\)
Because 16 is an integer, the 25th sample percentile (or first quartile) is readily determined to be the 16th order statistic, that is, 91.
To find the 50th sample percentile, we need to consider p = 0.50. In that case:
\((n+1)p=(63+1)(0.5)=(64)(0.5)=32\)
Because 32 is an integer, the 50th sample percentile (or second quartile or median) is readily determined to be the 32nd order statistic, that is 99.
To find the 75th sample percentile, we need to consider p = 0.75. In that case:
\((n+1)p=(63+1)(0.75)=(64)(0.75)=48\)
Because 48 is an integer, the 75th sample percentile (or third quartile) is readily determined to be the 48th order statistic, that is, 107.
The interquartile range IQR is then again 107−91 = 16.
On the last page, we learned how to determine the first quartile, the median, and the third quartile for a sample of data. These three percentiles, along with a data set's minimum and maximum values, make up what is called the five-number summary. One nice way of graphically depicting a data set's five-number summary is by way of a box plot (or box-and-whisker plot).
Here are some general guidelines for drawing a box plot:
Drawn as such, a box plot does a nice job of dividing the data graphically into fourths. Note, for example, that the horizontal length of the box is the interquartile range IQR, the left whisker represents the first quarter of the data, and the right whisker represents the fourth quarter of the data.
Let's return to our random sample of 64 people selected to take the Stanford-Binet Intelligence Test. The resulting 64 IQs were sorted as follows:
We previously determined that the first quartile is 91, the median is 99.5, and the third quartile is 107. The interquartile range IQR is 16. Use these numbers, as well as the minimum value (68) and maximum value (141) to create a box plot of these data.
Solution. By following the guidelines given above, a hand-drawn box plot of these data looks something like this:
In reality, you will probably almost always want to use a statistical software package, such as Minitab, to create your box plots. If we ask Minitab to create a box plot for this data set, this is what we get:
Hmm. How come Minitab's box plot looks different than our box plot? Well, by default, Minitab creates what is called a modified box plot. In a modified box plot, the box is drawn just as in a standard box plot, but the whiskers are defined differently. For a modified box plot, the whiskers are the lines that extend from the left and right of the box to the adjacent values. The adjacent values are defined as the lowest and highest observations that are still inside the region defined by the following limits:
In this example, the lower limit is calculated as Q1 − 1.5 × IQR = 91−1.5(16) = 67. Therefore, in this case, the lower adjacent value turns out to be the same as the minimum value, 68, because 68 is lowest observation still inside the region defined by the lower bound of 67. Now, the upper limit is calculated as Q3 + 1.5 × IQR = 107+1.5(16) = 131. Therefore, the upper adjacent value is 128, because 128 is the highest observation still inside the region defined by the upper bound of 131. In general, values that fall outside of the adjacent value region are deemed outliers. In this case, the IQs of 136 and 141 are greater than the upper adjacent value, and are thus deemed as outliers. In Minitab's modified box plots, outliers are identified using asterisks.
Let's return to the example in which we have a random sample of 20 concentrations of calcium carbonate (CaCO_{3}) in milligrams per liter:
With a little bit of work, it can be shown that the five-number summary is as follows:
Use the five-number summary to create a box plot of these data.
Solution. By following the guidelines given above, a hand-drawn box plot of these data looks something like this:
In this case, the interquartile range IQR 132.7 − 130.12 = 2.58. Therefore, the lower limit is calculated as Q1 − 1.5 × IQR = 130.12−1.5(2.58) = 126.25. Therefore, the lower adjacent value is the same as the minimum value, 127.8, because 127.8 is lowest observation still inside the region defined by the lower bound of 126.25. The upper limit is calculated as Q3 + 1.5 × IQR = 132.7+1.5(2.58) = 136.57. Therefore, the upper adjacent value is the same as the maximum value, 134.8, because 134.8 is the highest observation still inside the region defined by the upper bound of 136.57. Because the lower and upper adjacent values are the same as the minimum and maximum values, respectively, the box plot looks the same as the modified box plot:
Histograms and box plots can be quite useful in suggesting the shape of a probability distribution. Here, we'll concern ourselves with three possible shapes: symmetric, skewed left, or skewed right.
Definition. For a distribution that is skewed left, the bulk of the data values (including the median) lie to the right of the mean, and there is a long tail on the left side. For a distribution that is skewed right, the bulk of the data values (including the median) lie to the left of the mean, and there is a long tail on the right side. For a distribution that is symmetric, approximately half of the data values lie to the left of the mean, and approximately half of the data values lie to the right of the mean. |
The following examples probably illustrate symmetry and skewness of distributions better than any formal definitions can.
Consider a random sample of weights (in pounds) of 40 female college students:
Do these data suggest that the distribution of female weights is symmetric, skewed right, or skewed left?
Solution. The histogram:
and box plot of the 40 weights:
suggest that the distribution of female weights is symmetric.
Consider a random sample of 26 grades on an easy statistics exam:
Do these data suggest that the distribution of exam scores is symmetric, skewed right, or skewed left?
Solution. The histogram:
and box plot of the 26 grades:
suggest that the distribution of easy exam scores is skewed to the left.
Consider the lifetimes (in years) of a random sample of 39 Energizer bunnies:
Do these data suggest that the distribution of lifetimes of Energizer bunnies is symmetric, skewed right, or skewed left?
Solution. The histogram:
and box plot of the lifetimes of 39 Energizer bunnies:
suggest that the distribution of lifetimes of Energizer bunnies is skewed to the right.