34.3 - Stratified Random Sampling

Printer-friendly versionPrinter-friendly version

In the two previous sections, we were concerned with taking a random sample from a data set without regard to whether an observation comes from a particular subgroup. When you are conducting a survey, it often behooves you to make sure that your sample contains a certain number of observations from each particular subgroup. We'll concern ourselves with such a restriction here. That is, in this section, we'll focus on ways of using SAS to obtain a stratified random sample, in which a subset of observations are selected randomly from each subgroup of observations as determined by the value of a stratifying variable. We'll also go back to sampling without replacement, in which once an observation is selected it cannot be selected again.

Selecting a Stratified Sample of Equal-Sized Groups

We'll first focus on the situation in which an equal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable.

Example 34.9. The following code illustrates how to select a stratified random sample of equal-sized groups. Specifically, the code tells SAS to randomly select 5 observations from each of the three subgroups — State College, Port Matilda, Bellefonte — as determined by the value of the variable city:

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In order to select a stratified random sample in SAS, we basically use code similar to selecting equal-sized random samples without replacement, except now we process within each subgroup. More specifically, here's how the program works step-by-step:

  • The sole purpose of the FREQ procedure is to determine the number of observations in the stat482.mailing data set that correspond to each level of the stratification variable city (hence, "table city"). The OUT = option tells SAS to create a data set called bycount that contains the variable city and two variables that contain the number (count) and percentage (percent) of records for each level of city.
  • The SORT procedure merely sorts the stat482.mailing data set by city and stores the sorted result in a temporary data set called mailing so that it can be processed by city in the next DATA step.
  • Merge, by city, the sorted data set mailing with the bycount data set, so that the number of observations per subgroup is available. Since the percentage of observations is not needed, drop it from the data set on input.
  • The rest of the code in the DATA step should look very familiar. That is, once the number of observations per subgroup in the original stat482.mailing data set is available, you can randomly select records from the subgroup as you would select equal-sized random samples without replacement, except you select within city (hence, "by city"). Every time SAS reads in a new city (hence, "if first.city"), the number of observations still needed in the subgroup's sample (k) is set to the number of observations desired in each of the subgroups (5, here).

Note that, again, the random = ranuni( ) and propn = k/n assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if ranuni( ) le k/n then do; Additionally, k and count could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.

Example 34.10. The following code illustrates an alternative way of randomly selecting a stratified random sample of equal-sized groups. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In summary, here's how this the approach works:

  • The first DATA step uses an IF-THEN-ELSE statement in conjunction with OUTPUT statements to divide the original mailing data set up into several data sets based on the value of city. (Here, we create three data sets, one for each city... namely, scollege, pmatilda, and bellefnt.)
  • Then, the macro select exactly mimics the creation of the sample3A data set in Example 10.5 on the Random Sampling Without Replacement page. That is, the macro generates a random number for each observation, the data set is sorted by the random number, and then the first num observations are selected.
  • Then, call the macro select three times once for each of the city data sets .... scollege, pmatilda, and bellefnt .... selecting five observations from each.
  • Finally, the final DATA step concatenates the three data sets, bellefnt, scollege, and pmatilda, with 5 observations each back into one data set called sample6A with the 15 randomly selected observations..

Lo and behold, when all is said and done, we have another stratified random sample of equal-sized groups! Approach #2 checked off. Now, onto one last approach!

Example 34.11. The following code illustrates yet another alternative way of randomly selecting a stratified random sample of equal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly five observations from each of the three city subgroups in the permanent SAS data set mailing:

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, the specifics about the code. The only things that should look new here are the STRATA statement and the form of the SAMPSIZE statement. The STRATA statement tells SAS to partition the input data set stat482.mailing into nonoverlapping groups defined by the variable city. The NOTSORTED option does not tell SAS that the data set is unsorted. Instead, the NOTSORTED option tells SAS that the observations in the data set are arranged in city groups, but the groups are not necessarily in alphabetical order. The SAMPSIZE statement tells SAS that we are interested in sampling five observations from each of the city groups.

Selecting a Stratified Sample of Unequal-Sized Groups

Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.

To select a stratified random sample of unequal-sized groups, we could use the code from Example 10.10 by passing the different group sample sizes into the macro select. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (n) ...and the other that contains the number of observations that need to be selected from each subgroup (k). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.

Example 34.12. The following code illustrates how to select a stratified random sample of unequal-sized groups. Specifically, the code tells SAS to randomly select 5, 6, and 8 observations, respectively, from each of the three subgroups — Bellefonte, Port Matilda, and State College — as determined by the value of the variable city:

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.

Now, how does the program work? The key to understanding the program is to understand the first DATA step. The remainder of the program is much like code we've seen before, like that in say Example 10.4, in which a random sample is selected without replacement. Now, the first DATA step creates a temporary data set called nselect that contains three variables city, n, and k:

  • To count the number of observations n from each city, we use a counter variable n in conjunction with the last.city variable. By default, SAS sets n to 0 on the first iteration of the DATA step, and then increases n by 1 for each subsequent iteration of the DATA step until it counts the number of observations for one of the levels of city.
  • To tell SAS the number of observations to select from each city, we use an INPUT statement in conjunction with a DATALINES statement. The numbers k are listed in the order of city ...so here we tell SAS we want to randomly select 5 observations from Bellefonte, 6 observations from Port Matilda, and 8 observations from State College.
  • To write the numbers n and k to the new data set nselect, we use the last.city variable in a subsetting IF statement. So here, when SAS finds the last record within a city subgroup, n and k are written to the nselect data set, and n is reset to 0 in preparation for counting the number of observations for the next city in the data set.

The second DATA step creates a temporary data set called sample7 by merging the stat482.mailing data set with the nselect data set. After merging, the code then randomly selects the deemed number of observations from each city just as we did previously for random samples without replacement.

Example 34.13. The following code illustrates an alternative way of randomly selecting a stratified random sample of unequal-sized groups. In selecting such a sample, rather than specifying the desired number sampled from each group, we could tell SAS to select an equal proportion of observations from each group. The following code does just that. Specifically, the code tells SAS to randomly select 25% of the observations from each of the three subgroups — Bellefonte, Port Matilda, and State College:

In this case, it probably makes most sense to first compare the code here with the code from the previous example. The only difference you should see is that rather than using an INPUT and DATALINES statement to read in the number of observations, k, to be selected from each of the subgroups, here we use the ceiling function, ceil( ), to determine k. Specifically, k is calculated using:

k=ceil(0.25*n); 

Now, if you think about it, if I tell you to select 25% of the n = 16 observations in a subgroup, you'd tell me that we need to select 4 observations. But what if I tell you to select 25% of the n = 15 observations in a subgroup? If you calculate 25% of 15, you get 3.75. Hmmm.... how can you select 3.75 observations? That's where the ceiling function comes in to play. The ceiling function, ceil(argument), returns the smallest integer that is greater than or equal to the argument. So, for example, ceil(3.75) equals 4 ... as does of course ceil(3.1), ceil(3.25), and ceil(3.87) ...you get the idea.

That's it ... that's all there is to it. Once k is determined using the ceiling function, the rest of the program is identical to the program in the previous example.

Now, try it out... launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, 25% of the observations from Bellefonte, Port Matilda, and State College. In this case, that translates to 4, 4, and 6 observations, respectively.

Example 34.14. The following code illustrates yet another alternative way... you've gotta be kidding me! ...of randomly selecting a stratified random sample of unequal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 5, 6, and 8 observations, respectively, from each of the three city subgroups in the permanent SAS data set stat482.mailing:

Straightforward enough! The only difference between this code and the code in Example 10.11 is that here the sample sizes are specified at 5, 6, and 8 rather than 5, 5, and 5. Note that you must list the stratum sample size values in the order in which the strata appear in the input data set. Like I said, straightforward enough.

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.