Reading assignment for Lesson 7: Ch. 12.112.3 of Sampling by Steven Thompson, 3rd edition.
In Section 7.1, we introduce cluster and systematic sampling and show their similar structure. Graphical representations of primary units and secondary units are given. Notations are introduced.
In Section 7.2, when primary units are selected by srs, unbiased estimators and ratio estimators for cluster sampling are provided. Basic principles to obtain estimators of low variances are discussed. Then we discuss why and when will we use cluster sampling. That is followed by an example showing how to compute the ratio estimator and the unbiased estimator when the cluster sampling with primary units selected by srs is used.
In Section 7.3, cluster sampling with primary units selected by probabilities proportional to size is discussed. Then an example is given.
Lesson 7 Objectives 
Upon successful completion of this lesson, you will be able to:

Unit Summary 

On the surface, systematic and cluster sampling are very different. In fact, the two designs share the same structure: the population is partitioned into primary units, each primary unit being composed of secondary units. Whenever a primary unit is included in the sample, the yvalues of every secondary unit within it are observed.
Example: An one in three systematic sampling where we randomly pick one from the first three units and then choose every three from that on.
Randomly pick a value from {1, 2, 3}. For example, if 2 is chosen, then we will pick {2, 5, 8, 11, 14}, the 's. The set {2, 5, 8, 11, 14} is an example of a primary unit.
It is not uncommon to have a systematic sample of size 1, such as the above 1 in 3 systematic sample. We just sample 1 primary unit.
In the following two graphs, we provide examples for two configurations of primary units:
The above figure has 50 primary units (PSU)
(the colored rectangle is an example of a primary unit of cluster sampling)
The above figure has 25 primary units (PSU)
(the colored units (collectively) is an example of a primary unit of a systematic sampling)
Primary units ( PSU) may be different from observation units. One can view the systematic sampling as a sampling of primary units. Once the primary units are selected, a cluster of secondary units are also selected.
For example, a systematic sample was drawn from a batch of produced computer chips. The first 400 chips are fine but due to a fault of the machine, the last 300 chips are defective. Systematic sampling will select uniformly over the defective and nondefective items and would give a very accurate estimate of the fraction of defective items.
Cluster Sampling and Systematic Sampling
A cluster/systematic sample is a probability sample in which each sampling unit is a collection, or cluster, of elements.
Notations of cluster and systematic sampling:
N : the number of primary units in the population
n : the number of primary units in the sample
M_{i }: the number of secondary units in the ith primary unit
\(M=\sum\limits_{i=1}^N M_i\): the total number of secondary units in the population
y_{ij} : the value of the variable of interest of jth secondary unit in the ith primary unit\(y_i=\sum\limits_{j=1}^{M_i}y_{ij}\): the total (i.e. sum) of yvalues in the ith primary unit
For Fig. 1 below, N = 50, n = 10, M_{i} = 8
Fig. 1
For Fig. 2 below, N = 25, n = 2, M_{i} = 16
Fig. 2
Figure 1 shows an example of cluster sampling and figure 2 shows an example of systematic sampling. Secondary units of a primary unit of the cluster sampling are close together whereas secondary units of a primary unit of the systematic sampling separate from each other.
More Notation:
Thus, the population total is:
\(\tau=\sum\limits_{i=1}^N \sum\limits_{j=1}^{M_i}y_{ij}=\sum\limits_{i=1}^N y_i\)
The population mean per primary unit is:
\(\mu_1=\tau/N\)
The population mean per secondary unit is
\(\mu=\tau/M\) .
Unit Summary 

When the primary units are selected by simple random sampling, frequently used estimators among many possible estimators are:
A. Unbiased estimator
\(\hat{\tau}=N\cdot \bar{y}=\dfrac{N\cdot \sum\limits_{i=1}^n y_i}{n}\)
Recall that y_{i} is the total of yvalues in the ith primary unit.
\(\hat{V}ar(\hat{\tau})=N\cdot (Nn)\dfrac{s^2_u}{n}\)
where \(s^2_u=\dfrac{1}{n1}\sum\limits_{i=1}^n(y_i\bar{y})^2\)
To estimate the mean per primary unit, τ / N, the mean and variance equations are given below:
\(\bar{y}=\dfrac{\hat{\tau}}{N}\), \(Var(\bar{y})=\dfrac{1}{N^2} Var(\hat{\tau})\)
To estimate the mean per secondary unit, the mean and variance equations are given below:
\(\hat{\mu}=\dfrac{\hat{\tau}}{M}\), \(Var(\hat{\mu})=\dfrac{1}{M^2} Var(\hat{\tau})\)
B. Ratio Estimator
If the primary unit total y_{i} is highly correlated with the primary unit size M_{i }, a ratio estimator based on size may be efficient.
\(\hat{\tau}_r=r \cdot M,\quad M=\sum\limits_{i=1}^N M_i\)
where \(r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i},\quad \hat{V}ar(\hat{\tau}_r)=\dfrac{N(Nn)}{n(n1)}\sum\limits_{i=1}^n (y_irM_i)^2\)
Since every secondary unit is observed within a selected primary unit, the within primary unit variance does not enter into the variances of the estimators. For example,
\(\hat{V}ar(\hat{\tau})=N(Nn)\cdot \dfrac{s^2_u}{n}\)
where \(s^2_u=\dfrac{1}{n1}\sum\limits_{i=1}^n (y_i\bar{y})^2\)
Thus, to obtain estimators of low variances,
With natural populations of spatially distributed plants, animals, or minerals, and human populations, the above condition is typically satisfied by systematic sampling where each cluster contains units that are far apart. Cluster sampling is more often than not carried out for reasons of convenience or practicality rather than to obtain the lowest variances.
Will it give us a more precise estimator? The answer is no for most cases.
We do use cluster sampling out of necessity even though it will give us a larger variance.
If the objective of sampling is to obtain a specified amount of information about a population parameter at minimum cost, cluster sampling sometimes gives more information per unit cost than simple random sampling, stratified sampling and systematic sampling due to the cost of sampling units within a cluster may be much lower.
Cluster sampling is an effective design in two different scenarios:
Example of Cluster Sampling using a Ratio Estimator
A sociologist wants to estimate the average yearly vacation budget for each household in a certain city. It is given that there are 3,100 households in the city. The sociologist marked off the city into 400 blocks and treated them as 400 clusters. He then randomly sampled 24 clusters interviewing every household living in that cluster. The data are given in the table below:
Cluster

Number of households M_{i}

Total vacation budget per cluster y_{i}

1

7

12,000

2

9

15,000

3

5

8,000

4

8

13,000

5

12

18,000

6

5

7,000

7

4

6,000

8

8

13,000

9

14

22,000

10

6

9,800

11

3

7,000

12

13

18,000

13

8

12,340

14

4

5,000

15

6

8,900

16

9

14,000

17

3

4,000

18

10

11,400

19

4

5,000

20

7

13,000

21

6

8,900

22

5

8,700

23

7

10,000

24

6

9,200

169

259,240

To use minitab to plot total for cluster verus cluster size:
Mtb > scatterplot
then choose total for cluster as Y variable and cluster size as X variable
To use minitab to display descriptive statistics:
Mtb > Stat > Display Descriptive Statistics
Here is a plot of this data so that we can see if the cluster size is proportional to the total for the cluster.
Minitab output of descriptive statistics:
The ratio estimator for cluster sample (ratiotosize):
If primary unit total y_{i} is highly correlated with cluster size M_{i} , a ratio estimator based on size may be efficient. The ratio estimator of the population total is:
\(\hat{\tau}_r=r\cdot M \quad \text{where } r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\)
The ratio estimator is biased but the bias is small when the sample size is large. Here is the variance:
\(\hat{V}ar(\hat{\tau}_r)=\dfrac{N(Nn)}{n(n1)}\sum\limits_{i=1}^n (y_irM_i)^2\)
To estimate the population mean per secondary unit we have: μ = τ / M
The ratio estimator is:
\(\hat{\mu}_r=\dfrac{\hat{\tau}_r}{M}=r\)
\(\hat{V}ar(\hat{\mu}_r)=\dfrac{N(Nn)}{n(n1)}\cdot \dfrac{1}{M^2} \sum\limits_{i=1}^n (y_irM_i)^2\)
Back to the example. To estimate the average yearly vacation budget for each household we will use:
\(\hat{\mu}_r=r=\dfrac{\sum\limits_{i=1}^n y_i}{\sum\limits_{i=1}^n M_i}\)
In this example we see that N = 400, the total number of blocks, and n = 24. M in this case is as follows:
\(M=\sum\limits_{i=1}^N M_i=3100\)
Find the ratio estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the ratio estimator.
[Come up with an answer to this question and then click on the icon to reveal the solution.]
If we used the unbiased estimator would our variance be larger or smaller?
For this example, we also want to compute the unbiased estimator for comparison purposes.
Find the unbiased estimator for the average yearly vacation budget for each household in that city. Also, find the estimated variance for the unbiased estimator.
[Come up with an answer to this question and then click on the icon to reveal the solution.]
Remark 1: This variance is huge and we should be very unhappy using the unbiased estimate. We can thus see that when cluster total is proportional to cluster size, it is better to use the ratio estimate than the unbiased estimator.
Remark 2: Can we use formula to compute variances by the simple random sampling ? Unfortunately, No! We would have to have collected this data via simple random sampling in order to calculate the variance by the formula corresponding to simple random sampling. Note: it is a big mistake if you do not compute the variance according to its sampling scheme!
Here is the code for R for this example:
Datafile: Vacation.txt
R code: Chapter7_Vacation Budget.R.txt
Unit Summary 

The primary units selected with probabilities proportional to size:
\(p_i=M_i/M\)
The HansenHurwitz (p.p.s.) estimator is:
\(\hat{\tau}_p=\dfrac{M}{n}\sum\limits_{i=1}^n \left(\dfrac{y_i}{M_i}\right)\)
Denote by \(\bar{y}_i=\dfrac{y_i}{M_i}\)
\(\hat{V}ar(\hat{\tau}_p)=\dfrac{M^2}{n(n1)}\sum\limits_{i=1}^n (\bar{y}_i\hat{\mu}_p)^2\) where
\(\hat{\mu}_p=\dfrac{\hat{\tau}_p}{M}\) is unbiased for μ.
Thus we also see that:
\(\hat{V}ar(\hat{\mu}_p)=\dfrac{1}{n(n1)}\sum\limits_{i=1}^n (\bar{y}_i\hat{\mu}_p)^2\)
Example: Estimating population mean per secondary unit when primary units are selected by pps
From the "Total number of computer help requests" example in Lesson 3.1, 3 clusters out of 10 clusters are sampled (n = 3) with replacement. The data are:
y_{1 }= 420, y_{2 }= 1785, y_{3} = 2198
M_{1 }= 650, M_{2 }= 2840, M_{3} = 3200
[Come up with an answer to this question and then click on the icon to reveal the solution.]
Remark: For an example to review an estimate the population total, refer to earlier lecture notes on the HansenHurwitz estimator and the probabilities proportional to size as they were referred to in the Palm Tree total estimator examples.
Find the HW 7 assignment in the Homework folder in Canvas.