Sampling procedures for harvesting sectors

PGCA Sampling Procedures for Harvesting Sectors.pdf

Regional Economic Data Collection Program for Gulf Coast Alaska

Sampling procedures for harvesting sectors

OMB: 0648-0571

Document [pdf]
Download: pdf | pdf
2005
Gulf Coast Alaska Fisheries Economic
Activity Survey
SAMPLING PROCEDURES FOR HARVESTING SECTORS1
The overall project objective is to estimate the employment and labor income information for
each of three disaggregated harvesting sectors using data to be collected via a mail survey.
Using ex-vessel revenue information, an unequal probability sampling (UPS) procedure will be
employed to determine the sampling plan for each of the three harvesting sectors. The procedure
is described below.
In the literature, there exist many methods for conducting UPS without replacement (see, for
example, Brewer and Hanif 1983; Sarndal 1992). One critical weakness with most of these
methods is that the variance estimation is very difficult because the structure of the 2nd order
inclusion probabilities (πij)2 is complicated. One method that overcomes this problem is Poisson
sampling. However, one problem with Poisson sampling is that the sample size is a random
variable, which increases the variability of the estimates produced. An alternative method that is
similar to Poisson sampling but overcomes the weakness of the Poisson sampling is Pareto
sampling (Rosen 1997)3 which yields a fixed sample size.
In this project, there are two tasks that we need to do for estimating the population parameters
using UPS without replacement. First, the optimal sample size needs to be determined. Second,
once the optimal sample size is determined, the population parameters and confidence intervals
need to be estimated. For the first task, we will use the variance of Horvitz-Thompson (HT)
estimator from Poisson sampling in Part I below.4 For the second task, we will use the Pareto
sampling method described in Part II below (Slanta 2006). In determining the optimal sample
size in Part I, we will use information on an auxiliary variable (ex-vessel revenue). To estimate
the population parameters in Part II, we use actual response sample information on the variables
of interest (employment and labor income).

1

Part I: Estimating Sample Size
Step 1: Estimation of Optimal Sample Size (n*)
(A) Obtaining Initial Probabilities
To obtain the initial values of the inclusion probabilities (πi) for unit i in the population, we
multiply the auxiliary value of unit i (Xi, i.e., the ex-vessel value of vessel i in the population) by
a proportionality constant (t)5:

π = tX
i

(1)

i

where πi
Xi

: probability of vessel i being included in the survey sample
: value of the auxiliary variable (ex-vessel value of vessel i in the
population)

Here, t is given by
N

t=

∑X

i

i

V +

(2)

N

∑X

2
i

i

where N
V

: population size
: desired variance (of HT estimator of the population total); Poisson
variance. Here, V is given as:
2
⎛ εX ⎞
⎟
⎜
V =
⎟
⎜z
⎝ 1−(α / 2) ⎠
where ε is the error allowed by the investigator [e.g., if ε is 0.1, then 10% error of
true population total ( X =

N

∑X
i =1

i

) is allowed]; and z is percentile of the standard

normal distribution. Therefore, choosing a desired variance V is equivalent to
N
(1 − π i ) X i2
setting the values of ε and z. The value of V calculated using V = ∑
i =1

πi

(Poisson variance; Brewer and Hanif 1983, page 82) with πi’s being the final
values of N inclusion probabilities obtained from Step 1, will be equal to the
desired variance given at the beginning of Step 1.
Some of the resulting πi’s could be larger than one. The number of certainty units (i.e., the
number of units for which πi >1) is denoted C1. If πi > 1, then we force this inclusion probability
to equal one (πi = 1).

2

(B) Iterations and Determination of Optimal Sample Size
We recalculate t using the noncertainty units (i.e., the units for which πi <1) obtained in (A)
above, i.e.,
M1

∑X

t=

i

i

V +

(2’)

M1

∑X

2
i

i

where M1

: number of noncertainty units from (A), where M1 = N – C1.

Using equation (1) above, we calculate the inclusion probabilities for the noncertainty units by
multiplying the t value [from equation (2’)] by the ex-vessel values of the noncertainty units. If
the resulting πi’s are larger than one, we force them to equal one. The resulting numbers of
certainty and noncertainty units are denoted C2 ( = C1 + additional number of certainty units) and
M2 ( = M1 – additional number of certainty units), respectively, where C2 + M2 = N. Next, for
M2 units of noncertainty, we calculate the t and πi’s again. This is an iterative process. We
continue this process until the noncertainty population stabilizes (i.e., until there is no additional
certainty unit).
If the noncertainty population stabilizes after kth iteration, there will be Ck units of certainty units
and Mk units of noncertainty units and Ck+ Mk = N. Summing over the probabilities for all these
certainty and noncertainty units, we obtain the optimal sample size (n*) as:
n* =

N

∑π

(3)

i

i

At this stage the optimal sample size may not be an integer number. In this stage, we also
compute the optimal sample size under simple random sampling (SRS)6, nsrs, and compare it
with n*.
Step 2: Determining Number of Mailout Surveys
(A) Adjustment of Probabilities
Once the optimal sample size (n*) is determined in Step 1, we divide the sample size (n*) by the
expected response rate (obtained from previous studies) to determine the number of surveys that
need to be mailed out to achieve n*. The number thus derived is denoted na (this number may
not still be an integer value). We next adjust the inclusion probabilities for the Mk noncertainty
units obtained in Step 1 above as:
⎡
⎤
⎢ π ⎥
π i = (na − C k ) ⎢ M k i ⎥
⎢
⎥
⎢ ∑π i ⎥
⎣ i
⎦

(4)

3

If the resulting probabilities are larger than one (πi > 1), we make them certainties (πi = 1). The
resulting numbers of certainty and noncertainty units are denoted Ck+1 and Mk+1, respectively.
Next, we adjust the probabilities of the new set of noncertainty units (Mk+1) in a similar way
using equation (4’) below:
⎡
⎤
⎢ π ⎥
(4’)
π i = (na − C k +1 ) ⎢ M k +1i ⎥
⎢
⎥
⎢ ∑π i ⎥
⎣ i
⎦
We continue this process until the noncertainty population stabilizes. The resulting numbers of
certainty and noncertainty units are Cq and Mq, respectively.
(B) Apply Minimum Probability Rule
At this point, we impose a minimum probability rule. UPS can have excessively large weights
(= 1/πi) and if they report a large value, then the population estimate and its variance would be
very large. In order to avoid this problem, we can impose a minimum value of the inclusion
probabilities. If m is the minimum imposed probability, then we do the following:
If πi < m, then set πi = m for each i, where i = 1, ..., N.
The value for m here is determined arbitrarily. The only cost involved in using this rule is a
small increase in sample size.7
(C) Finding an Integer Value for Sample Size
Next, we add up all the resulting inclusion probabilities. The resulting sum is denoted nb ( > na),
which may not be an integer value. Next, we adjust again the probabilities for noncertainty units
including the units for which the minimum probabilities were imposed as:
⎡
⎤
⎢
π i ⎥⎥
⎢
π i = ( nc − C q ) M q
⎢
⎥
⎢ ∑π i ⎥
⎣ i
⎦

(5)

where nc is the smallest integer value larger than nb (e.g., if nb = 15.3, then nc = 16). Finally, we
add up the resulting (certainty and noncertainty) probabilities. The sum of all these probabilities
is the final survey sample size (i.e., the number of surveys to be sent out to), and is denoted nm (=
nc).

4

Part II: Estimation of Population Parameters and Confidence Intervals

Step 3: Implementation of Pareto Sampling
After the mailout sample size (nm) for each sector is determined in Step 2, the mailout sample is
selected from each sector’s population using Pareto sampling. The probability of each unit
(vessel) being in the sample in a given sector is proportional to the unit’s (vessel’s) ex-vessel
revenue. Because the majority of gross revenue within each sector comes from a small number
of vessels, a random sample of vessels would only include a small portion of the total ex-vessel
values.
According to Brewer and Hanif (1983), there are fifty different approaches that are used for
UPS. Most of these approaches suffer from the weakness that it is very hard to estimate the
variance. Poisson sampling overcomes this problem, and is relatively easy to implement.
However, the limitation of Poisson sampling is that the sample size is a random variable.
Therefore, in this project, we will use Pareto sampling (Rosen 1997 and Saavedra 1995) which
overcomes the limitation of Poisson sampling. The mailout sample size will be nm as determined
in Step 2 (C) above. We will use the inclusion probabilities obtained from Equation (5) above in
implementing Pareto sampling.
The procedure of this sampling method (Block and Crowe 2001) is briefly described here:
1.
2.
3.
4.

Determine the probability of selection (πi) for each unit i as in Equation (5) above.
Generate a Uniform (0,1) random variable Ui for each unit i
Calculate Qi = Ui (1 – πi ) / [πi (1 - Ui )]
Sort units in ascending order by Qi, and select nm smallest ones in sample.

From the above, it is clear that we will have a fixed sample size with Pareto sampling.
Step 4: Mailing out Surveys and Obtaining Actual Response Sample
Next, we will send out the surveys to the nm units (vessel owners). Actual response sample will
be obtained and the size of the actual response sample is denoted r.
Step 5: Estimation of Population Parameters (Population Total)
Using the information in the actual response sample, we calculate population parameters for
variables of interest (employment and labor income in our project), not for ex-vessel revenue,
using HT estimator (Horvitz and Thompson 1952). We are interested in estimating the
population totals (not population means) of the variables of interest. The HT estimator is given
as:
r

YˆHT = ∑ wi y i

(6)

i =1

5

where r
wi
yi

: number of respondents
: weight for ith unit ( = 1/πi ). Note that the weights are calculated here
using the information on the auxiliary variable, not that on the variables
of interest
th
: response sample data of i unit (employment or labor income)

However, the HT estimator needs to be adjusted for non-response. The estimator is adjusted in
the following way.
⎛ N
⎜ ∑Xj
⎜ j =1
Yˆ = ⎜ r
⎜⎜ ∑ wi X i
⎝ i =1

⎞
⎟
⎟ˆ
⎟ YHT
⎟⎟
⎠

where N
Xi

: population size
: auxiliary variable of ith unit (respondents only)

(7)

Usually, we apply this adjustment to the certainties separately from the noncertainties, and then
add the two together to get a final estimate. If there are no respondents within any of the two
groups of certainty units and noncertainty units, then we collapse the two groups before applying
the adjustment. Specifically, the final estimate of population total is given by:
⎛ N1
⎜ ∑Xj
⎜ j =1
Yˆ = ⎜ r1
⎜ ∑ wi X i
⎜ i =1
⎝

⎛ N2
⎞
⎜ ∑Xj
⎟ r
⎜ j =1
⎟ 1
⎟ ∑ wi y i + ⎜ r2
⎜ ∑ wi X i
⎟ i =1
⎜ i =1
⎟
⎝
⎠

⎞
⎟ r
⎟ 2
⎟ ∑ wi y i
⎟ i =1
⎟
⎠

(8)

: number of certainty units in the population
where N1
: number of noncertainty units in the population
N2
: number of respondents from certainty units
r1
: number of respondents from noncertainty units, and
r2
N1 + N2 = N and r1 + r2 = r.
Step 6: Estimation of Variance for YˆHT and Yˆ

Here we will calculate the variances of the population estimates for the variables of interest. The
variance estimate for Pareto sampling is given in Rosen (1997, Equation (4-11), p. 173) as:
⎧
⎪
⎪
n
m
Var (YˆHT ) =
⎨
nm − 1 ⎪
⎪
⎩

⎡ nm
⎛y
⎢∑ (1 − π i )⎜⎜ i
⎢⎣ i =1
⎝πi

⎞
⎟⎟
⎠

2

⎡ nm ⎛ 1 − π i
⎢∑ y i ⎜⎜
⎣ i =1 ⎝ π i

⎤
⎥ −
⎥⎦

nm

∑ (1 − π
i =1

6

i

)

⎞⎤
⎟⎟⎥
⎠⎦

2

⎫
⎪
⎪
⎬
⎪
⎪
⎭

(9)

Since we have adjusted for nonresponse, we need to incorporate the variability due to
nonresponse into the variance. If we assume that the response mechanism is fixed 8, then we
have a ratio estimator and its variance can be found in Hansen, Hurwitz, and Madow (1953, page
514). This variance is a Taylor expansion, and is given as:
⎛ σˆ 2 ( A) σˆ 2 (B ) 2 COV ( A, B ) ⎞
⎟⎟
Var Yˆ = Yˆ 2 ⎜⎜
+
−
2
AB
B2
⎠
⎝ A

()

(10)

where
r

A = ∑ wi y i
i =1
r

B = ∑ wi X i
i =1

2
⎧
⎡ r
⎤ ⎫
⎪
⎢∑ (1 − π i )(wi yi )⎥ ⎪
r
⎡
⎤
⎪
n
i =1
⎦ ⎪
σˆ 2 ( A) = m ⎨⎢∑ (1 − π i )(wi yi )2 ⎥ − ⎣
⎬
nm
nm − 1 ⎪⎣ i = 1
⎦
(1 − π i ) ⎪⎪
∑
⎪
i =1
⎩
⎭
2
⎧
⎡ r
⎤ ⎫
⎪
⎢∑ (1 − π i )(wi X i )⎥ ⎪
nm ⎪⎡ r
2⎤
⎦ ⎪
⎣ i =1
2
σˆ (B ) =
⎨⎢∑ (1 − π i )(wi X i ) ⎥ −
⎬
nm
nm − 1 ⎪⎣ i = 1
⎦
⎪
(1 − π i )
∑
⎪
⎪
i =1
⎩
⎭

⎧
⎡ r
⎤⎡ r
⎤⎫
(
)(
)
w
y
1
π
−
⎪ r
⎢∑
i
i i ⎥ ⎢∑ (1 − π i )(wi X i )⎥ ⎪
⎤ ⎣ i =1
nm ⎪⎡
⎦ ⎣ i =1
⎦⎪ .
2
COV ( A, B ) =
⎬
⎨⎢∑ (1 − π i )wi y i X i ⎥ −
nm
nm − 1 ⎪⎣ i =1
⎦
⎪
(1 − π i )
∑
⎪
⎪
i =1
⎭
⎩

Step 7: Calculation of Confidence Intervals
Confidence intervals are calculated using response sample statistics obtained in steps 5 and 6.
We only choose one sample, but if there were many independent samples chosen then we would
expect on average that approximately 100(1-α) % of the confidence intervals constructed in the
following manner will contain the truth.
⎛⎜ Yˆ − z
Var (Yˆ ) , Yˆ + z
Var (Yˆ ) ⎞⎟
α /2
α /2
⎝
⎠

where Yˆ

(11)

: Estimated population total for employment or labor income.
7

Note that it is possible to use t-statistics if the sample size is small.

8
File Type	application/pdf
File Title	C:\PRA\OMB83I pre-ps.WP6.wpd
Author	rroberts
File Modified	2007-04-25
File Created	2007-04-25