3.2 The Hansen-Hurwitz estimator

Unit Summary

  • Hansen-Hurwitz estimators (sampling with replacement)
  • How to random sample with unequal probability (sampling with replacement)
  • Compute the Hansen-Hurwitz estimator
  • When and how to use p.p.s.
  • Palm tree total example

For this section, sampling is with replacement.

Think About It!

 Think About It!

Why do we use or talk about sampling with replacement?

[Come up with an answer to this question and then click on the icon to reveal the solution.]

Let pi, i = 1, ... , N denote the probability that a given population unit will be selected.

The Hansen-Hurwitz estimator is:

\(\hat{\tau}_p=\dfrac{1}{n} \sum\limits^n_{i=1} \dfrac {y_i}{p_i}\)

Since, \(E\left(\dfrac{y_i}{p_i}\right)=\tau \)

where \(\tau=\sum\limits^N_{i=1} y_i=\text{the population total}\)
thus, \(E(\hat{\tau}_p)=\tau \)
and \(\hat{\tau}_p\) is an unbiased estimator for τ.

Since \(Var\left(\dfrac{y_i}{p_i}\right)=\sum\limits^N_{i=1} p_i \left(\dfrac{y_i}{p_i}-\tau \right)^2\), \(Var(\hat{\tau}_p)=\dfrac{1}{n}\sum\limits^N_{i=1} p_i \left(\dfrac{y_i}{p_i}-\tau \right)^2\)

An unbiased estimator for \(Var(\hat{\tau}_p)\) is:

\(\hat{V}ar(\hat{\tau}_p)=\dfrac{1}{n}\cdot \dfrac{\sum\limits^n_{i=1} \left(\dfrac{y_i}{p_i}-\hat{\tau}_p \right)^2}{n-1}\)

and an approximate (1 - α) 100% confidence interval for τ is:

\(\hat{\tau}_p \pm t \cdot \sqrt{\hat{V}ar(\hat{\tau}_p)}\)

For population mean μ = τ/N one uses:

\(\hat{\mu}_p=\dfrac{1}{N} \left(\dfrac{1}{n}\cdot \sum\limits^n_{i=1}\dfrac{y_i}{p_i}\right)=\dfrac{\hat{\tau}_p}{N}\)

\(E(\hat{\mu}_p)=\dfrac{\tau}{N}= \mu\)

\(\hat{V}ar(\hat{\mu}_p)=\dfrac{1}{N^2}\cdot \hat{V}ar(\hat{\tau}_p)\)

How do we perform unequal probability sampling according to given pi?

Example 1

Estimate the total number of computer help requests for last year in a large firm.

The director of computer support department plans to sample 3 divisions of a large firm that has 10 divisions, with varying numbers of employees per division. Since number of computer support requests within each division should be highly correlated with the number of employees in that division, the director decides to use unequal probability sampling with replacement with pi proportional to number of employees in that division.

table

A. How do we practically implement unequal probability sampling according to the given pi's?

B. With the divisions selected by probability proportional to size, how do we construct the Hansen-Hurwitz estimator for τ?

Answer to A:

table

We can perform probability proportional to size by using Minitab to calculate this for us:

1. Generate a column C1 that contains the value 1-15650

Calc > Make patterned data > Simple set of numbers

Minitab

2. Sample 3 values with replacement from the column that contains 1-15650

    Calc > Random Data > Sample from columns

Minitab

The values generated by minitab are given below:

Minitab

The values given by minitab are 1085, 6261, 9787. These numbers fall into division 2, division 5 and division 8.

So, we decide to sample division 2, division 5 and division 8.  We check the record to find the number of requests for these divisions.  The results are given below:

For division 2, y1 = the number requests = 420
For division 5, y2 = the number of requests = 1785
For division 8, y3 = the number of requests = 2198

(For this random sample shown in the example, the division are distinct. For other random samples, it is possible that the same division may be selected more than once.)

The basic assumption is that number of requests is proportional to the size of the division.

Answer to B:

We will need to compute the Hansen-Hurwitz estimator for Example 1 as follows:

Computing the Hansen-Hurwitz

The Hansen-Hurwitz estimator for τ is

\begin{align}
\hat{\tau}_p &=\dfrac{1}{3}\left(420 \cdot \dfrac{15650}{650}+1785 \cdot \dfrac{15650}{2840}+2198 \cdot \dfrac{15650}{3200}\right) \\  &=\dfrac{1}{3}(10112.31+9836.36+10749.59)\\
&=10232.75 \\
\end{align}

Each of the values, 10112.31, 9836.36, and 10749.59, look fairly stable so it looks like the variance will not be too large.

\begin{align}
\hat{V}ar(\hat{\tau}_p) &=\dfrac{1}{3}\cdot \dfrac{\sum\limits^3_{i=1} \left(\dfrac{y_i}{p_i}-\hat{\tau}_p \right)^2}{3-1}\\
 &=\dfrac{1}{3}\cdot \dfrac{1}{2}((10112.31-10232.75)^2+(9836.36-10232.75)^2+(10749.59-10232.75)^2)\\
 &=73125.74\\
\end{align}

\(\hat{S}D(\hat{\tau}_p)=270.418\)

(See Example 1 on p. 68-69 in the text to see an example of when a unit is chosen more than once.)

We will see that in ths example pi are chosen proportional to the values of a known positive auxiliary variable such as size, \(p_i=\dfrac{x_i}{\sum x_i}\), the Hansen-Hurwitz estimator is also called p.p.s. (probability proportional to size).

Now, we need to ask ourselves, when and why would we need to use an unequal probability sampling? Let's think about the 'when' first.

When would we elect to use p.p.s.? What about if we were sampling from Penn State departments? They are of very different sizes, some are very large and others are very small. Would we automatically choose to use p.p.s.? The idea is that the thing that you are interested in has to be related to the size. If the thing that you are interested in is related to size, then you would want to use p.p.s. However, if what you are interested in has nothing to do with the size of the department, then there is no reason to use p.p.s.

Now, let us address the 'why'. By definition,

\(\tau=\sum\limits_{i=1}^N y_i\) and  \(Var(\hat{\tau}_p)=\dfrac{1}{n}\sum\limits^N_{i=1} p_i \left(\dfrac{y_i}{p_i}-\tau \right)^2\)

For the special and unrealistic case yi / pi = constant, the constant will be τ and the  \(Var(\hat{\tau}_p)\)will be zero. Therefore, you want yi / pi to be close to a constant. However, in reality, prior to sampling, the yi are unknown and we can not choose pi proportional to yi . If we know yi is approximately proportional to a known variable such as xi , then we can choose pi proportional to xi . \(\hat{\tau}_p\)will have low variances.

palm treeExample: Palm Trees

We want to estimate the total number of palm trees on 100 islands in a tropical paradise. The area of each island is known and it is reasonable to think that the number of palm trees on each island is approximately proportional to the size of the island.

We know that the sizes of the island are given (e.g., size of island 1 is 1 square mile, size of island 29 is 5 square mile and size of island 36 is 2 square miles.  The total size of these 100 islands are 100 square miles.  We find that p1, ... , pN are:

timeline

How can we sample 4 islands by probabilities p1, ... , p100?

Answer:

  1. Assign an interval width of pi to ith unit
  2. Generate 4 random numbers form a uniform distribution on (0,1)
  3. Choose the units that correspond to the interval containing the random number.

In this example, we use Minitab >> Calc >> Random data >> Uniform and get: 0.335257, 0.0065551, 0.401869, 0.318977

The units selected are the islands 29, 1, 36, and 29, (since 0.335257 falls between 0.31 and 0.36, 0.0065551 falls between 0 and 0.01, 0.401869 falls between 0.40 and 0.42, and 0.318977 falls between 0.31 and 0.36.) The measurements (yi) are:

i
size
pi
yi
1
1
0.01
14
29
5
0.05
50
29
5
0.05
50
36
2
0.02
25

Given these results we should now be able to estimate how many total palm trees are there on all of the islands put together:

 \begin{align}
\hat{\tau}_p &=\dfrac{1}{4}\left(\dfrac{14}{0.01}+\dfrac{50}{0.05}+\dfrac{50}{0.05}+\dfrac{25}{0.02}\right) \\
&=\dfrac{1}{4}(1400+1000+1000+1250)\\
&=1162.5 \\
\end{align}

\begin{align}
\hat{V}ar(\hat{\tau}_p) &=\dfrac{1}{n(n-1)} \sum\limits^n_{i=1} \left(\dfrac{y_i}{p_i}-\hat{\tau}_p \right)^2\\
 &=\dfrac{1}{4\cdot3}[ (1400-1162.5)^2+(1000-1162.5)^2+(1000-1162.5)^2+(1250-1162.5)^2]\\
 &=9739.58\\
\end{align}

\(\hat{S}D(\hat{\tau}_p)=98.69\)

If we are interested in the mean number of trees per island in that population, then

\(\hat{\mu}_p=\dfrac{\hat{\tau}_p}{N}=\dfrac{1162.5}{100}=11.625\)

\begin{align}
\hat{V}ar(\hat{\mu}_p) &=\dfrac{1}{N^2} \cdot \hat{V}ar(\hat{\tau}_p)\\
&=\dfrac{1}{(100)^2}\cdot 9739.58\\
&=0.973958\\
\end{align}

\(\hat{S}D(\hat{\mu}_p)=0.987\)