ECDS Sample Design

Attachment H.pdf

2017 Early Career Doctorates Survey

ECDS Sample Design

OMB: 3145-0235

Document [pdf]

Download: pdf | pdf

Attachment H: ECDS Sample Design

Attachment H – ECDS Sample Design
ECDS Sample Design Plan
The Early Career Doctorates Survey (ECDS) plans to collect data from about 18,000 early career
doctorates (ECD). The sample design will be a two-stage stratified sample of U.S. academic institutions,
federally funded research and development centers (FFRDCs), and the National Institutes of Health (NIH)
Intramural Research Programs (IRPs), and individuals working at these institutions. At the first stage, we
will select approximately 350 institutions with selection probability proportional to their size (PPS),
where the size measure is described later in this document. We expect 300 of these institutions will
participate in the survey and provide lists of ECD working at their institution. These lists will then be
used as the sampling frame for the second, individual stage, of the data collection. At the second stage,
individual sample members will be selected within institutions such that their overall (unconditional)
selection probabilities are equal across sample members within each of the following domains of analysis:
employment sector (institution type), postdoc status, sex, citizenship, and race/ethnicity.
This self-weighted sample design is also known as an equal probability of selection method (epsem)
sample. The sampling weight is calculated as the inverse of probability of selection. When the probability
of selection is equal for all sampling units, their sampling weights are also equal (constant). When the
sampling weights vary across units, this variability would increase variance of the estimate. Thus, a
sample with equal selection probability will be more efficient than that with unequal selection probability.
The ECDS has several domains of interest, some of which are relatively small in the population. A
completely proportional sample allocation would achieve equal probabilities of selection across all
domains and strata, but to achieve adequate precision across all of the domains with a proportionate
allocation would require an extremely large sample size. Therefore, to achieve adequate precision within
and across domains while controlling the total sample size across domains, sampling rates will be allowed
to vary across the domains and strata. The domains of analysis, specification of sample size, and selection
of institutions (first-stage sampling) and ECD (second-stage sampling), treatment of missing frame
variables, and sample release strategy are discussed in detail below.
In summary, the steps in sample selection are done as follows:
(a) Step 1: Determine domain minimum effective sample size based on pre-specified values of
coefficient of variations (CVs) by domain of analysis to allocate sample of ECD by domain.
(b) Step 2: Determine the sample size of institutions (first-stage sampling) by sampling stratum based
on expected respondents of ECD per institution.
(c) Step 3: Calculate composite measure of size for the first-stage sampling for each institution in the
frame, determine certainty institutions, and draw sample of non-certainty institutions.
(d) Step 4: Collect list of potential ECD from each sampled institution. Impute missing values in the
sampling variables as necessary, and evaluate imputation results.
(e) Step 5: Calculate the second-stage sampling rate for each institution by sampling stratum and
domain, and draw samples of ECD.

Attachment H: ECDS Sample Design

A. Domain of Analysis and Sample Size
The first exercise is to allocate a sample size of approximately 16,750 ECD in the U.S. academic
institutions, 1 850 ECD in the FFRDCs, and 400 ECD in the NIH IRPs into each domain of analysis. 2 Note
that these are respondent sample sizes and will need to be inflated by the anticipated response and
eligibility rates. The allocation to the analytic domains is determined based on the level of detail needed
in the ECDS tables and the information available on the frame, and the sample sizes are determined so as
to produce estimates with specific precision defined by the desired CV within each domain. Because it is
important that the analyses that are produced for these domains are supported by adequate sample sizes,
the domain population size information (when available) can be used to allocate the sample across
domains; this gives the flexibility to over- or under-sample certain domains. 3 Therefore, based on the
specific precision requirement, a threshold or minimum effective sample size should be determined for
each domain.
For the U.S. academic institutions, the domains of interest are cells defined by the Institution Type,
Postdoc Status, Sex, and Citizenship-Race-Ethnicity variables as shown in Table 1. A priori counts of
ECD by gender, citizenship, and race/ethnicity are not available for the FFRDCs and NIH IRP, and as a
result the same strata as in the U.S. academic institutions (GSS Substrata) cannot be constructed. Instead,
the composite size measures for these two strata will only include overall size and postdoc status.
Allocating the sample to the domains proportionally means that larger domains would get larger sample
sizes and smaller domains would get smaller sample sizes; this allocation would provide smaller
variances and would be efficient for estimation that cuts across domains. However, this sample size
allocation might end up with some small domains with too small of a sample size for analysis of interest,
and hence would not meet the pre-specified precision requirements. An alternative option is to allocate
the samples equally across domains. This equal sample allocation across domains usually has an
advantage of higher statistical power for tests for comparisons. However, this comes with a price that the
variance used in the analysis might be larger due to variation in the weights resulting from oversampling
or undersampling some domains. Therefore, the allocation for the sample size of ECDS started with an
approximate proportional sample allocation, but iterated in order to satisfy a minimum sample size
threshold for domain level based on the required precision of analysis. This results in an allocation that
satisfies precision constrains for multiple domains but is no longer exactly proportional. For example,
small domains that are of interest for analysis are sampled at higher rates compared to domains that are
not as rare.

The sampling frame for U.S. academic institutions can be developed from the National Science Foundation –
National Institutes of Health Survey of Graduate Students and Postdoctorates in Science and Engineering (GSS)
data.
2
Ideally the sample should be allocated proportionally to the U.S. academic institution, FFDC and NIH-IRP. In
doing so, however, the FFRDC and NIH-IRP will have small sample size compared to that sample allocated to the
U.S. academic institutions. In such case, the comparison across institution type will not be optimal (may not detect a
meaningful difference for a given power of the test, or for a specified minimum detectable difference the power of
the test is low). This allocation is subject to change after discussion with the National Science Foundation (NSF).
3

When information on domain size is not available and sample size allocation across domains may not be
controlled, a random sample might produce proportional sample size across domains but not guarantee. When the
sample is proportional, small domain will receive smaller sample size, which could lead to an estimation issue such
as issue of reliability of the estimate.

Attachment H: ECDS Sample Design
Table 1. Domains of interest, expected coefficient of variation (CV) and associated minimum
sample sizes needed for a total sample size of 18,000 ECDS from U.S. Academic Institutions (GSS
Institutions), FFRDCs, and NIH IRP’s
Domain level

Category

GSS substrata
× Postdoc Status ×
Sex × CitizenshipRace-Ethnicity for
first 2 strata
(Medical schools/
centers, and Very
high research
activity)c

Med-schools; Postdoc; Non-U.S. citizen; Female
Med-schools; Postdoc; Non-U.S. citizen; Male
Med-schools; Postdoc; U.S. citizen–White; Female
Med-schools; Postdoc; U.S. citizen–White; Male
Med-schools; Postdoc; U.S. citizen–Asian; Female
Med-schools; Postdoc; U.S. citizen–Asian; Male
Med-schools; Postdoc; U.S. citizen–Minority; Female
Med-schools; Postdoc; U.S. citizen–Minority; Male
Med-schools; Non-Postdoc; Non-U.S. citizen; Female
Med-schools; Non-Postdoc; Non-U.S. citizen; Male
Med-schools; Non-Postdoc; U.S. citizen–White; Female
Med-schools; Non-Postdoc; U.S. citizen–White; Male
Med-schools; Non-Postdoc; U.S. citizen–Asian; Female
Med-schools; Non-Postdoc; U.S. citizen–Asian; Male
Med-schools; Non-Postdoc; U.S. citizen–Minority; Female
Med-schools; Non-Postdoc; U.S. citizen–Minority; Male
Very-High-Research; Postdoc; Non-U.S. citizen; Female
Very-High-Research; Postdoc; Non-U.S. citizen; Male
Very-High-Research; Postdoc; U.S. citizen–White; Female
Very-High-Research; Postdoc; U.S. citizen–White; Male
Very-High-Research; Postdoc; U.S. citizen–Asian; Female
Very-High-Research; Postdoc; U.S. citizen–Asian; Male
Very-High-Research; Postdoc; U.S. citizen–Minority; Female
Very-High-Research; Postdoc; U.S. citizen–Minority; Male
Very-High-Research; Non-Postdoc; Non-U.S. citizen; Female
Very-High-Research; Non-Postdoc; Non-U.S. citizen; Male
Very-High-Research; Non-Postdoc; U.S. citizen–White; Female
Very-High-Research; Non-Postdoc; U.S. citizen–White; Male
Very-High-Research; Non-Postdoc; U.S. citizen–Asian; Female
Very-High-Research; Non-Postdoc; U.S. citizen–Asian; Male
Very-High-Research; Non-Postdoc; U.S. citizen–Minority; Female
Very-High-Research; Non-Postdoc; U.S. citizen–Minority; Male

FFRDC
Postdoc Status

Non-Postdoc
Postdoc

Minimum Expected
sample
CVb
a
size
313
0.06
505
0.06
222
0.08
223
0.08
109
0.10
110
0.10
104
0.10
102
0.10
77
0.12
102
0.12
429
0.08
408
0.08
178
0.08
180
0.08
143
0.10
114
0.10
341
0.07
536
0.07
393
0.07
401
0.07
133
0.12
165
0.10
113
0.12
108
0.10
173
0.12
244
0.08
1,067
0.05
970
0.05
239
0.12
300
0.08
178
0.12
177
0.12
438
406

0.05
0.05

NIH IRP
Postdoc Status

Non-Postdoc
123
0.12
Postdoc
287
0.08
a
The minimum sample size in this column is the sample size threshold that is set to ensure that all domains would
have effective sample sizes larger than or equal to the threshold sample sizes. In this exercise, the minimum sample
size is calculated based on the pre-specified expected CV under the conservative calculation using proportion of 0.5
and the design effect calculated from Pilot ECDS data.
b
Expected (or desired) CVs were provided by the NSF. The expected CVs are developed based on reviewing
analytical goals and the estimated CVs achievable under the full sample size of 18,000.
c

Constraints were not set for the domains Postdoc Status x Sex x Citizenship-Race-Ethnicity in the “GSS high research activity
and “GSS All other colleges and universities.” The population sizes are so small in these domains that achieving adequate
precision would require selecting a very high proportion of the ECD.

Attachment H: ECDS Sample Design
In table 1, we present the list of domains of interest for analyses and tabulations for the U.S. academic
institutions. We proposed the minimum sample size by inflating the effective sample size by the design
effect due to unequal weight variation. 4 Precision in this table is expressed as the coefficient of variation
for estimating a proportion of an ECD characteristic within domain, where the proportion is set at 0.5.
The computation of the minimum e sample size in Table 1 is described below.
Suppose we want to estimate a proportion (or mean) of a certain characteristic for ECD within a certain
domain 𝑑𝑑, for example, a proportion of respondents who expressed a change in their career track interest
within U.S. citizen early career doctorates. Let 𝑃𝑃𝑑𝑑 denotes the proportion of U.S. citizen early career
doctorates who expressed change in their career track interest. The estimate of 𝑃𝑃𝑑𝑑 , denoted by 𝑝𝑝𝑑𝑑 is
calculated based on sample of size 𝑛𝑛𝑑𝑑 , and this estimate has variance 𝑉𝑉𝑉𝑉𝑉𝑉(𝑝𝑝𝑑𝑑 ). For the purpose of
calculating the sample size, this variance can be expressed as:
𝑉𝑉𝑉𝑉𝑉𝑉(𝑝𝑝𝑑𝑑 ) = �1 −

𝑛𝑛𝑑𝑑 𝑃𝑃𝑑𝑑 (1 − 𝑃𝑃𝑑𝑑 )
�
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑
𝑁𝑁𝑑𝑑
𝑛𝑛𝑑𝑑

where 𝑁𝑁𝑑𝑑 denotes the population size, and 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑 is the overall design effect due to unequal weight
variation and from clustering as a result of the two-stage sampling. The design effect estimate 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑
used in the allocation was obtained from the Pilot ECDS. This formula can be inverted for sample size
calculation:
𝑛𝑛𝑑𝑑 =

𝑃𝑃𝑑𝑑 (1 − 𝑃𝑃𝑑𝑑 )𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑
.
𝑃𝑃 (1 − 𝑃𝑃𝑑𝑑 )
𝑉𝑉𝑉𝑉𝑉𝑉(𝑝𝑝𝑑𝑑 ) + 𝑑𝑑
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑
𝑁𝑁𝑑𝑑

Replacing the variance with 𝐶𝐶𝐶𝐶 2 (𝑝𝑝𝑑𝑑 ) = 𝑉𝑉𝑉𝑉𝑉𝑉(𝑝𝑝𝑑𝑑 )/𝑃𝑃𝑑𝑑2 , and using 𝑃𝑃𝑑𝑑 = 0.5, this formula can be simplified
as:
𝑛𝑛𝑑𝑑 =

𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑
.
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷
𝐶𝐶𝐶𝐶 2 (𝑝𝑝𝑑𝑑 ) + 𝑁𝑁 𝑑𝑑
𝑑𝑑

To achieve the required minimum effective sample size above, the sample of 18,000 ECD needs to be
allocated to the domains by accounting for the increase in design effect due to weight variation within
domains. This is done iteratively in the following steps:
•
•
•
•
•
•
•

Allocate 18,000 proportionally to all domains,
adjusting for effective sample size in FFRDC and NIH IRP domains x Postdoc Status,
adjusting for effective sample size in domains defined by GSS stratum (only for GSS medical
schools and centers, and GSS very high research activity university, ignoring GSS high research
activity and all other GSS colleges and university) x Postdoc x Gender x Race-Foreign,
adjusting for effective sample size in domains defined by Postdoc x Gender x Race-Foreign,
adjusting for effective sample size in domains defined by GSS substrata x Postdoc,
adjusting for effective sample size in domains defined by GSS substrata,
adjusting for effective sample size in domains defined by postdoc,

An effective sample size can be defined as a ratio of actual sample size to the design effect due to unequal weight
variation: 𝑛𝑛𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑛𝑛⁄𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑤𝑤 , where 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑤𝑤 = 𝑛𝑛 ∑𝑖𝑖 𝑤𝑤𝑖𝑖2 ⁄(∑𝑖𝑖 𝑤𝑤𝑖𝑖 )2 . When there is no variability in the weights, 𝑛𝑛𝑒𝑒𝑒𝑒𝑒𝑒 =
𝑛𝑛. The effective sample size is used here instead of just the sample size because when the weights vary within
domain, this weight variation will increase the variance of estimates.). The effective sample size has taken into
account such weight variation.

Attachment H: ECDS Sample Design
•
•
•

adjusting for effective sample size in domains defined by citizenship-race-ethnicity,
adjusting for effective sample size in domains defined by gender, and
adjusting for effective sample size in overall domain.

In each step above, the sample size allocation takes into account the design effect due to unequal weights
variation, to ensure that the minimum effective sample size would produce precision that meets the prespecified CV. The adjustments are carried out as follows:
(a) Proportionally allocate the sample size of 18,000 to the 68 domains defined by 64 domains of
GSS Institution Type × Postdoc Status × Sex × Citizenship-Race-Ethnicity (the lowest domain
level) and the 4 domains of FFRDC/NIH × Postdoc Status.
(b) Calculate the design effects and the effective sample sizes (at the first cycle, the design effect is 1
because sample sizes are proportional). Check if any of the domains above has the effective
sample size less than that specified in the above table.
(c) For a given level of domains, adjust the sample size in domains where the effective sample size is
less than that specified as follows. Suppose in a specific level of domains, there are d1 domains
(d1>0) where their sample size is less than specified. For these d1 domains, calculate the
adjustment factor as:
𝑛𝑛𝑚𝑚𝑚𝑚𝑚𝑚
𝑎𝑎𝑎𝑎 =
,
𝑛𝑛𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
where 𝑛𝑛𝑚𝑚𝑚𝑚𝑚𝑚 and 𝑛𝑛𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 are, respectively, the threshold/minimum effective sample size and the
original sample size in the sampling cell. Inflate the original sample size in the sampling cell in
these d1 domains by multiplying it by 𝑎𝑎𝑎𝑎; that is, 𝑛𝑛𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝑛𝑛𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ×𝑎𝑎𝑎𝑎 . For the remaining
domains, recalculate the sample size by allocating the remaining total sample size proportionally
to the remaining domains.

(d) For the next domain level, calculate the design effect and the effective sample size within each
domain. Check if any of these domains has sample size less than specified. Suppose there are d2
domains (d2>0) where their sample size is less than specified. For d2 domains where the effective
sample size is less than specified, calculate the adjustment factor 𝑎𝑎𝑎𝑎 and inflate the original
sample size in the sampling cell in these d2 domains by multiplying it by 𝑎𝑎𝑎𝑎. For the remaining
domains, recalculate the sample size by allocating the remaining sample size, proportionally to
the remaining domains, while also keeping the minimum sample size assigned in the previous
iteration. That is, when allocating the sample proportionally, maintain the minimum sample size
requirement; this is done by using distribution of sample allocation in the previous step (that meet
the minimum sample size threshold).
(e) Repeat these processes for all other levels of domain. In each level of domain, check if any of
domains has the sample size below threshold, and then adjust as in (c) or (d).
Note that if similar variables (Postdoc Status, Sex, and Citizenship-Race-Ethnicity) are available to
construct domains in the FFRDCs and NIH IRPs as in the GSS data, the population counts in these two
strata can be combined into the above exercise to allocate a total of 18,000 sampled ECD.
Table 2 shows the resulting numbers of responding, eligible ECD that are needed to satisfy the precision.
Table 3 gives there sampling rates. Note that the sampling rates within the domains and first stage
institution strata are not all constant; this variability in the sampling rates is a consequence of allocating
the fixed sample size of 18,000 to the strata and domains in order to satisfy multiple variance constraints.
Under this allocation, the design effect due to weight variation for the GSS is 1.09, and for the entire
6

Attachment H: ECDS Sample Design
sample (GSS, FFRDC, and NIH IRP combined) is 1.11. Note also that these are numbers of responding,
eligible ECD; the actual number to be sampled will be obtained by inflating by the anticipated response
and eligibility rates.
Table 2. Sample size allocation of 18,000 ECD by sampling domains
Stratum
GSS
Medical schools and centers
Very high research activity
High research activity
All other colleges and univ.
FFRDC

Non-Postdoc
White
Asian
Female
Male
Female
Male

Foreign
Female
Male
124
281
182
189

102
244
163
114

468
1,068
800
1,194

520
972
835
1,014

302
343
175
297

Minority
Female
Male

164
300
165
145

143
178
181
298

114
177
142
218

NIH IRP

Total
NonPostdoc
1,937
3,563
2,643
3,469
439
123

Total

12,174

Stratum
GSS
Medical schools and centers
Very high research activity
High research activity
All other colleges and univ.
FFRDC

Foreign
Female Male
367
432
63
59

Postdoc
White
Asian
Female Male Female Male

522
845
136
114

256
442
56
35

233
446
60
46

104
126
10
9

104
156
18
14

Minority
Female Male
112
122
11
7

101
107
10
10

Total
Postdoc

Stratum
Total

1,799
2,676
364
294
406

3,736
6,239
3,007
3,763
845

287

410

5,826

18,000

NIH IRP
Total

Table 3. Sampling Rates for the second stage strata for the allocation shown in Table 2
Stratum
GSS
Medical schools and centers
Very high research activity
High research activity
All other colleges and univ.
FFRDC

Foreign
Female
Male
13.7%
12.8%
17.8%
12.4%

6.1%
5.5%
9.4%
5.3%

Non-Postdoc
White
Asian
Female
Male
Female
Male
4.0%
6.5%
9.4%
5.4%

4.0%
5.0%
9.4%
5.4%

8.8%
8.8%
14.5%
10.1%

4.0%
5.9%
9.5%
5.4%

Minority
Female
Male
5.3%
5.4%
9.8%
5.6%

5.4%
6.0%
9.8%
5.6%

NIH IRP

Total
NonPostdoc
4.9%
6.2%
10.0%
5.8%
8.9%
19.9%

Stratum
GSS
Medical schools and centers
Very high research activity
High research activity
All other colleges and univ.
FFRDC

Foreign
Female
Male
7.5%
7.4%
11.8%
16.4%

6.7%
6.7%
10.8%
14.4%

Postdoc
White
Asian
Female
Male
Female
Male
7.8%
9.0%
12.0%
16.1%

NIH IRP

6.8%
6.8%
10.6%
14.6%

8.9%
8.9%
9.3%
11.3%

8.5%
8.4%
9.0%
11.9%

Minority
Female
Male
15.2%
12.8%
12.2%
14.9%

15.6%
11.5%
10.2%
12.7%

Total
Postdoc
7.8%
7.6%
11.0%
14.6%
15.6%
16.4%

Attachment H: ECDS Sample Design

B. First-stage Sampling: Selection of Institutions
A total of approximately 300 responding institutions will be included in this survey. The sample of
institutions will be selected through a probability proportional to size (PPS) sampling. First, the type of
institutions (U.S. academic institution, FFRDC, and NIH IRP) serves as sampling strata in this first-stage
of sampling (Primary Sampling Unit/PSU strata). The selection of GSS institutions will be independent of
the selection of the FFRDS institutions and NIH programs. All NIH IRPs (25 programs) will be selected
with certainty, while the institutions in the other strata will be sampled. The first-stage sampling strata
that will also the base for domain of analysis, and the population of institutions by stratum is given in
Table 4 below:
Table 4. Institution count in the population by stratum
Stratum
number

Description of type of institutions

Number of
institutions in the
population

Expected number
of responding
institutions in the
sample

1
2
3
4
5

GSS Medical schools and centers
GSS Very high research activity universities
GSS High research activity universities
GSS All other colleges and universities
FFRDC

172
109
98
461
43

53
76
54
67
25

NIH IRP

908

300

Total

For the purposes of this sampling plan, h, i, j, and k, respectively, indicate indexes for stratum, institution,
domain, and ECD as follows:
ℎ = index for the first-stage sampling stratum; ℎ = 1, … ,6 (U.S. academic institution, FFRDC,
NIH IRP)
𝑖𝑖 = index for institution; 𝑖𝑖 = 1, … , 𝐼𝐼ℎ , where 𝐼𝐼ℎ = the total number of eligible institutions in
stratum ℎ in the frame
𝑗𝑗 = index for domain; 𝑗𝑗 = 1, … , 𝐽𝐽, where 𝐽𝐽 is the number of domains of interest
𝑘𝑘 = index for ECD.

Under the PPS sampling, the measure of size for each eligible institution 𝑖𝑖 within stratum ℎ in the frame
will be determined as a composite measure of size 𝑆𝑆ℎ𝑖𝑖 as follows (see Folsom, Potter, and Williams, 1987
for more details on composite size measures):

where

𝑆𝑆ℎ𝑖𝑖

𝐽𝐽

𝑗𝑗=1

𝑁𝑁ℎ𝑖𝑖𝑖𝑖
= � 𝑛𝑛ℎ𝑗𝑗
= � 𝑓𝑓ℎ𝑗𝑗 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
𝑁𝑁ℎ𝑗𝑗

(1)

𝑓𝑓ℎ𝑗𝑗 = the sample fraction of ECD for domain 𝑗𝑗 in PSU stratum ℎ; 𝑓𝑓ℎ𝑗𝑗 = 𝑛𝑛ℎ𝑗𝑗 ⁄𝑁𝑁ℎ𝑗𝑗
𝑁𝑁ℎ𝑖𝑖𝑖𝑖 = the total number of ECD for domain 𝑗𝑗 in institution 𝑖𝑖 within PSU stratum ℎ
𝑛𝑛ℎ𝑗𝑗 = the sample size of ECD allocated for domain 𝑗𝑗 in PSU stratum ℎ
𝐼𝐼ℎ
𝑁𝑁ℎ𝑗𝑗 = the total number of ECD for domain 𝑗𝑗 in PSU stratum ℎ; 𝑁𝑁ℎ𝑗𝑗 = ∑𝑖𝑖=1
𝑁𝑁ℎ𝑖𝑖𝑖𝑖 .
8

Attachment H: ECDS Sample Design
Note that a composite of size 𝑆𝑆ℎ𝑖𝑖 is a summation of measure of size across 𝐽𝐽 domains; that is, 𝑆𝑆ℎ𝑖𝑖 =
∑𝐽𝐽𝑗𝑗=1 𝑆𝑆ℎ𝑖𝑖𝑖𝑖 , where
𝑆𝑆ℎ𝑖𝑖𝑖𝑖 =

𝑛𝑛ℎ𝑗𝑗 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
.
𝑁𝑁ℎ𝑗𝑗

(2)

In addition, the sum of composite measure of sizes across all institutions in the GSS frame constitutes the
total sample size of ECD in the first four strata (GSS strata), which is n = 16,748:
4

𝐼𝐼ℎ

� � 𝑆𝑆ℎ𝑖𝑖

ℎ=1 𝑖𝑖=1

𝐼𝐼ℎ

𝐽𝐽

𝐼𝐼

𝐽𝐽

∑ ℎ 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
𝑁𝑁ℎ𝑖𝑖𝑖𝑖
= � � � 𝑛𝑛ℎ𝑗𝑗
= � � 𝑛𝑛ℎ𝑗𝑗 𝑖𝑖=1
= � � 𝑛𝑛ℎ𝑗𝑗 = � 𝑛𝑛ℎ = 𝑛𝑛.
𝑁𝑁ℎ𝑗𝑗
𝑁𝑁ℎ𝑗𝑗
ℎ=1 𝑖𝑖=1 𝑗𝑗=1

ℎ=1 𝑗𝑗=1

ℎ=1

(3)

Similarly, the sums for the last two strata (FFRDC and NIH IRP) are 844 and 410, respectively.
The sample size of ECD allocated for each domain, 𝑛𝑛ℎ𝑗𝑗 , needs to be determined prior to sample selection
(done in the previous section), and the domain size 𝑁𝑁ℎ𝑖𝑖𝑖𝑖 needs to be available.
Since all programs in NIH IRP will be selected, we do not need to calculate the selection probabilities as
will done for the other strata as follows. Given the composite measure of size 𝑆𝑆ℎ𝑖𝑖 above, the probability
selection for each institution in the first five PSU strata can be determined as follows:

where

𝜋𝜋ℎ𝑖𝑖 = 𝑚𝑚ℎ

𝑆𝑆ℎ𝑖𝑖
.
𝐼𝐼ℎ
∑𝑖𝑖=1 𝑆𝑆ℎ𝑖𝑖

(4)

𝑚𝑚ℎ = the sample size of institutions (PSUs) allocated for stratum ℎ.

For large institutions, the value of selection probability above may be greater than 1. Such institutions
will be selected with certainty. We will identify the institutions selected with certainty in strata 1-5
iteratively. That is,
(a) The first round of iteration is calculating selection probabilities as in formula (4)
(b) Identify certainty institutions based on selection probabilities calculated in (a), and set aside these
certainty institutions from the frame. So we have 𝑚𝑚ℎ𝐶𝐶1 and 𝑚𝑚ℎ𝑁𝑁𝐶𝐶1, respectively, denotes the sample
size of certainty institutions and non-certainty institutions identified at the first round of iteration,
where 𝑚𝑚ℎ = 𝑚𝑚ℎ𝐶𝐶1 + 𝑚𝑚ℎ𝑁𝑁𝑁𝑁1. (Note: superscripts C1 indicates certainty in the first round and NC1
indicates Non-Certainty in the first round.)
(c) After dropping the certainty institutions from the frame, recalculate the selection probability for
the non-certainty institutions:
𝜋𝜋ℎ𝑖𝑖 = (𝑚𝑚ℎ − 𝑚𝑚ℎ𝐶𝐶1 )×

𝑆𝑆ℎ𝑖𝑖

𝐶𝐶1 )
(𝐼𝐼ℎ −𝑚𝑚ℎ
∑𝑖𝑖=1
𝑆𝑆ℎ𝑖𝑖

(5)

(d) Continue with second round of iteration, that is to identify new certainty institutions 𝑚𝑚ℎ𝐶𝐶2 based
on selection probability in (5), and recalculate the selection probability under the new sample
size.
(e) Repeat the process of calculating selection probability and identifying the certainty institutions
until there are no more certainty institutions identified in the frame.
9

Attachment H: ECDS Sample Design
Suppose 𝑚𝑚ℎ𝐶𝐶 = 𝑚𝑚ℎ𝐶𝐶1 + 𝑚𝑚ℎ𝐶𝐶2 + ⋯, and 𝑚𝑚ℎ𝑁𝑁𝑁𝑁 , respectively, denotes the final sample size of certainty
institutions and non-certainty institutions, where 𝑚𝑚ℎ = 𝑚𝑚ℎ𝐶𝐶 + 𝑚𝑚ℎ𝑁𝑁𝑁𝑁 . Among the remaining non-certainty
institutions in the frame, we draw a sample of institutions in each stratum, with size 𝑚𝑚ℎ𝑁𝑁𝑁𝑁 institutions. At
the end of this process, the probability of selection is determined as:
Certainty U.S. academic institutions in stratum h: 𝜋𝜋ℎ𝑖𝑖 = 1

(6)

Certainty FFRDC: 𝜋𝜋5𝑖𝑖 = 1

Certainty NIH IRP: 𝜋𝜋6𝑖𝑖 = 1

Non-certainty U.S. academic institutions in stratum h: 𝜋𝜋ℎ𝑖𝑖 = 𝑚𝑚ℎ𝑁𝑁𝑁𝑁 ×
Non-certainty FFRDC: 𝜋𝜋5𝑖𝑖 = 𝑚𝑚5𝑁𝑁𝑁𝑁 ×

𝑆𝑆5
(𝐼𝐼5 −𝑚𝑚𝐶𝐶
)
∑𝑖𝑖=1 5 𝑆𝑆5𝑖𝑖

𝑆𝑆ℎ𝑖𝑖

(𝐼𝐼ℎ −𝑚𝑚𝐶𝐶
ℎ ) 𝑆𝑆
∑𝑖𝑖=1
ℎ𝑖𝑖

C. Second-Stage Sampling: Selection of ECD
1. Sample Allocation
In this second-stage sample selection, we will select a total of 16,748 ECD from the U.S. academic
institutions, 844 ECD from the FFRDC, and 410 ECD from the NIH IRPs. The sample allocation for each
domain 𝑛𝑛𝑗𝑗 has been determined earlier (table 2). Now, the goal in this stage is to, first, allocate 𝑛𝑛𝑗𝑗 to each
sampled institutions so that this allocation will result in a self-weighting sample within domain. That is, at
the end of sampling process, the unconditional selection probability of ECD is the same across ECD
within domain. Second, we will determine a sampling method for selecting ECD within sampled
institutions.
The following sample size allocation is exercised for the initial calculation:
•

Initial sample size within institution:
To achieve self-weighting sample within domain, the sample size in the certainty institutions
should be allocated proportionally based on the composite measure of size, while the sample size
in the non-certainty institutions should be allocated equally across non-certainty institutions as
follows:
Certainty U.S. academic institutions: 𝑛𝑛ℎ𝑖𝑖 = 𝑛𝑛ℎ
Certainty FFRDC: 𝑛𝑛5𝑖𝑖 = 𝑛𝑛5

𝐼𝐼

𝑆𝑆5𝑖𝑖

5 𝑆𝑆
∑𝑖𝑖=1
5𝑖𝑖

Certainty NIH IRP: 𝑛𝑛6𝑖𝑖 = 𝑛𝑛6

= 𝑆𝑆5𝑖𝑖

𝑆𝑆6𝑖𝑖
𝐼𝐼6
∑𝑖𝑖=1 𝑆𝑆6𝑖𝑖

= 𝑆𝑆6𝑖𝑖

𝑆𝑆ℎ𝑖𝑖
𝐼𝐼ℎ
∑𝑖𝑖=1 𝑆𝑆ℎ𝑖𝑖

Non-certainty U.S. academic institutions: 𝑛𝑛ℎ𝑖𝑖 =
Non-certainty FFRDC: 𝑛𝑛5𝑖𝑖 =

(𝐼𝐼 −𝑚𝑚𝐶𝐶
5 ) 𝑆𝑆
5𝑖𝑖
𝑚𝑚5𝑁𝑁𝑁𝑁

5
∑𝑖𝑖=1

= 𝑆𝑆ℎ𝑖𝑖

(𝐼𝐼 −𝑚𝑚𝐶𝐶
ℎ ) 𝑆𝑆
ℎ𝑖𝑖
𝑁𝑁𝑁𝑁
𝑚𝑚ℎ

ℎ
∑𝑖𝑖=1

(7)

Attachment H: ECDS Sample Design
•

Sample size within institution and domain:
The allocation of sample size within institution to each domain (within institution) is:
𝑛𝑛ℎ𝑖𝑖𝑖𝑖 = 𝑛𝑛ℎ𝑖𝑖

𝑆𝑆ℎ𝑖𝑖𝑖𝑖
.
𝑆𝑆ℎ𝑖𝑖

(8)

The following expressions are obtained by substituting 𝑛𝑛ℎ𝑖𝑖 and 𝑆𝑆ℎ𝑖𝑖𝑖𝑖 in (8) with that in (7) and (2),
respectively:
Certainty U.S. academic institutions: 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 = 𝑆𝑆ℎ𝑖𝑖
Certainty FFRDC: 𝑛𝑛5 = 𝑆𝑆5𝑖𝑖
Certainty NIH IRP: 𝑛𝑛6𝑖𝑖𝑖𝑖 =

𝑆𝑆5𝑖𝑖𝑖𝑖

𝑛𝑛ℎ𝑖𝑖𝑖𝑖 =

𝐶𝐶
(𝐼𝐼 −𝑚𝑚ℎ
)

𝑚𝑚ℎ𝑁𝑁𝑁𝑁

Non-certainty FFRDC:
𝑛𝑛5𝑖𝑖𝑖𝑖 =

𝑆𝑆ℎ𝑖𝑖

(𝐼𝐼 −𝑚𝑚5𝐶𝐶 )

5
∑𝑖𝑖=1

𝑚𝑚5𝑁𝑁𝑁𝑁

𝑆𝑆ℎ𝑖𝑖

= 𝑆𝑆ℎ𝑖𝑖𝑖𝑖 =

𝑛𝑛ℎ𝑗𝑗 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
𝑁𝑁ℎ𝑗𝑗

= 𝑆𝑆5𝑖𝑖𝑖𝑖 = 𝑛𝑛5𝑗𝑗 𝑁𝑁5𝑖𝑖𝑖𝑖 /𝑁𝑁5𝑗𝑗 = 𝑁𝑁5𝑖𝑖𝑖𝑖 ×𝑓𝑓5𝑗𝑗

𝑆𝑆5𝑖𝑖
𝑆𝑆
𝑆𝑆6𝑖𝑖 𝑆𝑆6𝑖𝑖𝑖𝑖

= 𝑁𝑁ℎ𝑖𝑖𝑖𝑖 ×𝑓𝑓ℎ𝑗𝑗

= 𝑆𝑆6𝑖𝑖𝑖𝑖 = 𝑛𝑛6𝑗𝑗 𝑁𝑁6𝑖𝑖𝑖𝑖 /𝑁𝑁6𝑗𝑗 = 𝑁𝑁6𝑖𝑖𝑖𝑖 ×𝑓𝑓6𝑗𝑗

6𝑖𝑖

Non-certainty U.S. academic institutions:
ℎ
∑𝑖𝑖=1

𝑆𝑆ℎ𝑖𝑖𝑖𝑖

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

ℎ
ℎ
𝑆𝑆1𝑖𝑖 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
1 𝑛𝑛ℎ𝑗𝑗 𝑁𝑁ℎ𝑖𝑖𝑖𝑖 ∑𝑖𝑖=1
×
×
=
×
×𝑓𝑓ℎ𝑗𝑗
𝑁𝑁ℎ𝑗𝑗
𝑆𝑆ℎ𝑖𝑖
𝑆𝑆ℎ𝑖𝑖
𝑚𝑚ℎ𝑁𝑁𝑁𝑁

𝑆𝑆5𝑖𝑖

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

5
5
𝑆𝑆1𝑖𝑖 𝑁𝑁5𝑖𝑖𝑖𝑖
1 𝑛𝑛5 𝑁𝑁5𝑖𝑖𝑖𝑖 ∑𝑖𝑖=1
×
×
=
×
×𝑓𝑓5𝑗𝑗
𝑁𝑁𝑁𝑁
𝑁𝑁5𝑗𝑗
𝑆𝑆5𝑖𝑖
𝑆𝑆5𝑖𝑖
𝑚𝑚5

To see whether the above sample allocations produce self-weighting sample, we can calculate the
unconditional selection probability of ECD. The unconditional probability of ECD selection is a
multiplication of institution selection probability and the conditional ECD selection probability within
institution, where the conditional probability in the second stage is calculated as:
𝜋𝜋ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 =

𝑛𝑛ℎ𝑖𝑖𝑖𝑖
.
𝑁𝑁ℎ𝑖𝑖𝑖𝑖

(9)

Therefore, the unconditional selection probability of ECD 𝑘𝑘 in domain 𝑗𝑗 in institution 𝑖𝑖 and stratum ℎ can
be calculated as follows:
𝜋𝜋ℎ𝑖𝑖𝑗𝑗𝑗𝑗 = 𝜋𝜋ℎ𝑖𝑖 ×𝜋𝜋ℎ𝑗𝑗𝑗𝑗|𝑖𝑖

(10)

Attachment H: ECDS Sample Design
The following expressions are obtained by substituting 𝜋𝜋ℎ𝑖𝑖 , 𝜋𝜋ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 , and 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 in (10), with that in (6), (9),
and (8), respectively:
Certainty U.S. academic institutions: 𝜋𝜋ℎ𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋ℎ𝑖𝑖 ×𝜋𝜋ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×

1
𝑁𝑁 𝑓𝑓 = 𝑓𝑓5𝑗𝑗
𝑁𝑁5𝑖𝑖𝑖𝑖 5𝑖𝑖𝑖𝑖 5𝑗𝑗
1
1×
𝑁𝑁 𝑓𝑓 = 𝑓𝑓6𝑗𝑗
𝑁𝑁6𝑖𝑖𝑖𝑖 6𝑖𝑖𝑖𝑖 6𝑗𝑗

Certainty FFRDC: 𝜋𝜋5𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋5𝑖𝑖 ×𝜋𝜋5𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×
Certainty NIH IRP: 𝜋𝜋6𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋6𝑖𝑖 ×𝜋𝜋6𝑗𝑗𝑗𝑗|𝑖𝑖 =
Non-certainty U.S. academic institutions:

𝜋𝜋ℎ𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋ℎ𝑖𝑖 ×𝜋𝜋ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 = �𝑚𝑚ℎ − 𝑚𝑚ℎ𝐶𝐶 �

Non-certainty FFRDC:

𝜋𝜋5𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋5𝑖𝑖 ×𝜋𝜋5𝑗𝑗𝑗𝑗|𝑖𝑖

1
𝑁𝑁 𝑓𝑓
𝑁𝑁ℎ𝑖𝑖𝑖𝑖 ℎ𝑖𝑖𝑖𝑖 ℎ𝑗𝑗

𝑆𝑆ℎ𝑖𝑖

𝐶𝐶 )
(𝐼𝐼 −𝑚𝑚ℎ

ℎ
∑𝑖𝑖=1

𝑆𝑆ℎ𝑖𝑖

= 𝑓𝑓ℎ𝑗𝑗

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

ℎ
ℎ
𝑆𝑆ℎ𝑖𝑖 𝑁𝑁ℎ𝑖𝑖𝑖𝑖
1 ∑𝑖𝑖=1
×
𝑓𝑓 = 𝑓𝑓ℎ𝑗𝑗
𝑁𝑁𝑁𝑁
𝑆𝑆ℎ𝑖𝑖 ℎ𝑗𝑗
𝑁𝑁ℎ𝑖𝑖𝑖𝑖
𝑚𝑚ℎ

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

5
5
𝑆𝑆5𝑖𝑖 𝑁𝑁5𝑖𝑖𝑖𝑖
1 ∑𝑖𝑖=1
= �𝑚𝑚5 − 𝑚𝑚5𝐶𝐶 �
×
×
×𝑓𝑓5𝑗𝑗 = 𝑓𝑓5𝑗𝑗
𝑁𝑁𝑁𝑁
(𝐼𝐼5 −𝑚𝑚5𝐶𝐶 )
𝑆𝑆5𝑖𝑖
𝑚𝑚5
∑𝑖𝑖=1
𝑆𝑆5𝑖𝑖 𝑁𝑁5𝑖𝑖𝑖𝑖

𝑆𝑆5𝑖𝑖

We can see that within domain 𝑗𝑗, the allocation in (8) results in equal selection probability within the
stratum but not across the strata. This is because the institutional strata are also analytic domains and
higher sampling rates are needed in some of the strata in order to satisfy the precision requirements.
The allocation in (8) can be adjusted to result in equal selection probability across strata as follows:
Certainty U.S. academic institutions: 𝑛𝑛1𝑖𝑖𝑖𝑖 = 𝑁𝑁1𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗

(11)

Certainty FFRDC: 𝑛𝑛2𝑖𝑖𝑖𝑖 = 𝑁𝑁2𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗

Certainty NIH IRP: 𝑛𝑛3𝑖𝑖𝑖𝑖 = 𝑁𝑁3𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗

Non-certainty U.S. academic institutions:

Non-certainty FFRDC:

𝑛𝑛1𝑖𝑖𝑖𝑖 =
𝑛𝑛2𝑖𝑖𝑖𝑖 =

(𝐼𝐼 −𝑚𝑚1𝐶𝐶 )

1
∑𝑖𝑖=1

𝑚𝑚1𝑁𝑁𝑁𝑁

𝑆𝑆1𝑖𝑖

𝑁𝑁1𝑖𝑖𝑖𝑖
×𝑓𝑓𝑗𝑗
𝑆𝑆1𝑖𝑖

(𝐼𝐼 −𝑚𝑚2𝐶𝐶 )
𝑆𝑆2𝑖𝑖 𝑁𝑁2𝑖𝑖𝑖𝑖
×
𝑁𝑁𝑁𝑁
𝑆𝑆2𝑖𝑖
𝑚𝑚2

2
∑𝑖𝑖=1

×𝑓𝑓𝑗𝑗

where 𝑓𝑓𝑗𝑗 is an overall sample fraction for domain 𝑗𝑗 calculated across all strata. (Note that the sample
allocation (11) may produce non-integer sample size. We will come back to this issue later.)
Now, if we substitute the sample allocation in (11) into (10), the resulting unconditional ECD selection
probability within domain are all equal to 𝑓𝑓𝑗𝑗 as shown below:
Certainty U.S. academic institutions: 𝜋𝜋1𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋1𝑖𝑖 ×𝜋𝜋1𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×
1
𝑁𝑁 𝑓𝑓 = 𝑓𝑓𝑗𝑗
𝑁𝑁2𝑖𝑖𝑖𝑖 2𝑖𝑖𝑖𝑖 𝑗𝑗
1
1×
𝑁𝑁 𝑓𝑓 = 𝑓𝑓𝑗𝑗
𝑁𝑁3𝑖𝑖𝑖𝑖 3𝑖𝑖𝑖𝑖 𝑗𝑗

Certainty FFRDC: 𝜋𝜋2𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋2𝑖𝑖 ×𝜋𝜋2𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×
Certainty NIH IRP: 𝜋𝜋3𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋3𝑖𝑖 ×𝜋𝜋3𝑗𝑗𝑗𝑗|𝑖𝑖 =

1
𝑁𝑁 𝑓𝑓
𝑁𝑁1𝑖𝑖𝑖𝑖 1𝑖𝑖𝑖𝑖 𝑗𝑗

= 𝑓𝑓𝑗𝑗

(12)

Attachment H: ECDS Sample Design
Non-certainty FFRDC:

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

1
1
𝑆𝑆1𝑖𝑖 𝑁𝑁1𝑖𝑖𝑖𝑖
1 ∑𝑖𝑖=1
𝐶𝐶
𝜋𝜋1𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋1𝑖𝑖 ×𝜋𝜋1𝑗𝑗𝑗𝑗|𝑖𝑖 = �240 − 𝑚𝑚1 �
×
𝑓𝑓 = 𝑓𝑓𝑗𝑗
𝐶𝐶
(𝐼𝐼1 −𝑚𝑚1 )
𝑆𝑆1𝑖𝑖 𝑗𝑗
𝑚𝑚1𝑁𝑁𝑁𝑁
∑𝑖𝑖=1
𝑆𝑆1𝑖𝑖 𝑁𝑁1𝑖𝑖𝑖𝑖
Non-certainty U.S. academic institutions:

𝑆𝑆1𝑖𝑖

𝜋𝜋2𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋2𝑖𝑖 ×𝜋𝜋2𝑗𝑗𝑗𝑗|𝑖𝑖
2. Sample Selection

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

2
2
𝑆𝑆1𝑖𝑖 𝑁𝑁2𝑖𝑖𝑖𝑖
1 ∑𝑖𝑖=1
= �40 − 𝑚𝑚2𝐶𝐶 �
×
×
×𝑓𝑓𝑗𝑗 = 𝑓𝑓𝑗𝑗
𝑁𝑁𝑁𝑁
(𝐼𝐼2 −𝑚𝑚2𝐶𝐶 )
𝑆𝑆2𝑖𝑖
𝑚𝑚2
∑𝑖𝑖=1
𝑆𝑆2𝑖𝑖 𝑁𝑁2𝑖𝑖𝑖𝑖

𝑆𝑆2𝑖𝑖

Equation (11) gives the sample allocation 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 for selecting the ECD within sampled institutions,
however, these numbers are not integer. One may round these number to integer and use them as the
sample size with rounding. However, the original sampling rate:
𝜋𝜋ℎ𝑗𝑗|𝑖𝑖 =

𝑛𝑛ℎ𝑖𝑖𝑖𝑖
,
𝑁𝑁ℎ𝑖𝑖𝑖𝑖

(13)

where the numerator is unrounded institutional-level domain sample size, would not be retained.
To overcome this, we can implement a PPS sampling with the sampling rate (13) used as the measure of
size. If these sampling rates are used as the measure of size in PPS sampling when selecting ECD, the
selection will result in random rounding that result in integer sample size. The PPS sequential sampling
where the frame is sorted by PSU strata, institution, and domain variables, can be used to select the
18,000 ECD.

D. Treatment of Missing Variables for Defining Domains
Selection of sampled ECD in the second stage of sampling will use stratification based on the following
variables:
•
•
•

Postdoc Status (2 levels): Postdoc, Non-Postdoc
Sex (2 levels): Male, Female
Citizenship-Race-Ethnicity (4 levels): Non-U.S. citizen, U.S. citizen–White, U.S. citizen–Asian,
U.S. citizen–Other

We will request these information to be included on the ECD lists from the institutions sampled in the
first stage of sampling. We expect to be able to get complete information on postdoc status, however
some institutions may not provide this information. Sex and Citizenship-Race-Ethnicity will be missing
entirely from some lists and for a subset of individuals on other lists. This section describes procedures
for imputing missing values prior to the selection of the second stage sample members.
1. Imputation for Postdoc Status
We anticipate that almost all institutions will be able to provide the postdoc status for the individuals on
their ECD list (or tell us which job titles represent postdoc positions). In the pilot ECDS, only one

Attachment H: ECDS Sample Design
FFRDC did not provide this information. We will use the job titles and pilot ECDS responses to impute
postdoc status where missing in the frame.
2. Imputation for Sex
We anticipate that most institutions will be able to provide sex information for most individuals on their
ECD list, but as many as 10% of list members may be missing this information. Any missing sex data will
be imputed using several external databases. First, we will attempt to match the list member to the Survey
of Earned Doctorates (SED) and, if we are able to link to these data sets, use the sex of the individual
from the SED to fill in the missing ECD frame data. The linking process will be explained in a separate
section later.
For all remaining cases where we have a name, we propose to use the database of names by sex
maintained by the Social Security Administration (SSA) to impute the missing sex data. These databases
provide a list of first names for individuals born in the U.S. in a given year, along with the count by sex.
The databases include all name/sex combinations that occurred at least 5 times in a given year. A
description of the database is at http://www.ssa.gov/oact/babynames/limits.html, and the national-level
data is at http://www.ssa.gov/oact/babynames/names.zip. We would pool the names and counts to arrive
at percentages that are male and female for each name. We would start by using first name. If the
percentage of times a name is a given sex is very high (for example, greater than 90 percent), then any
individuals with missing sex and that first name will be assigned to that sex.
Next, the middle name will be examined for any ECD that still do not have sex assigned, and any of ECD
whose middle name is in the list with a high percentage being one sex (for example, greater than 90
percent) will be assigned to that sex.
After this step, we will randomly assign sex to any remaining using the database of names. For each ECD
with missing sex whose first name appears on the list, we would generate a uniform random number,
compare this random number to the distribution, and impute the sex. For example, if 40 percent of a given
name is Male and 60 percent is Female, and the generated random number is 0.40 or less, then the sex
would be imputed as “Male,” and random numbers greater than 0.40 would impute the sex to “Female.”
Any cases with names that are not in the SSA Names by Sex database will be examined and assigned
manually. Some of these may be foreign names with entries in similar name by sex databases focusing on
names from other countries.
For cases without names, we will randomly assign sex based on the distribution of individuals by sex
within the institution. As with the name based imputation, if 65 percent of the predicted number of ECD
at the focal institution were men based on the combined GSS and IPEDS data, then a random number is
0.65 or less, then the sex would be imputed as “Male,” and random numbers greater than 0.65 would
impute the sex to “Female.”
3. Imputation for Race/Ethnicity
Although most institutions track by race/ethnicity, some institutions may not be willing or able to provide
it for many individuals on their ECD lists. When race/ethnicity is missing, we suggest using a
combination of logical editing and imputation to fill in the missing values. As with sex, we will attempt to
match the case to the SED and fill in missing race/ethnicity for the ECDS frames in an individual match
can be found.

Attachment H: ECDS Sample Design
For the remaining cases that are missing race/ethnicity but include last name, we will use the U.S. Census
database of surnames (http://www2.census.gov/topics/genealogy/2000surnames/names.zip) that gives the
percentage of times each of the surnames that is white, black, Asian/Pacific Islander, American Indian, or
Hispanic. A description of the database is located here:
http://www.census.gov/topics/population/genealogy/data/2000_surnames.html. In reviewing this
database, we see that some surnames fall almost exclusively into only one of the race or ethnic groups.
We would extract names that are highly likely to be of one particular race or ethnicity (for example more
than 80 percent Asian/Pacific Islander), and assign any missing ECD with that last name to that
race/ethnicity.
All surnames with missing race/ethnicity that are not in the Census database will be manually reviewed in
conjunction with the first and middle names to see if a logical assignment can be made (e.g., Hispanic or
Asian/Pacifica Islander). Finally, a random assignment using the database of surnames would be used to
fill in any missing data that remain within the cases with surnames. For a given name, we will use the
percentage provided in the census database for that name to randomly assign the name to the
race/ethnicity. For example if the percentage for a particular name were as follows:
Race/ethnicity
White
Black or African American
Asian or Pacific Islander
American Indian or Alaskan Native
Two or more races
Hispanic

Percentage
73.35
22.22
0.40
0.85
1.63
1.55

Cumulative Percentage
73.35
95.57
95.97
96.82
98.45
100.00

Then, a missing race with random number 0.8 will be imputed with Black or African American (since 0.8
= 80 percent is between 73.35 and 95.57 percent).

4. Imputation for Citizenship
Based on the results of the pilot survey, not all institutions will provide an indicator for whether an
individual is a U.S. citizen. SED collects data on the citizenship status of doctorates in two variables: (1)
citizenship at birth, and (2) citizenship at doctoral graduation. Though there could be citizenship status
change between the time of graduation and time of survey, this data can be used to impute missing
citizenship status in the list from institutions.5 For any ECD with missing citizenship status and can be
linked to SED, we will use the two variables of citizenship status for imputation as follows:
SED citizenship at birth
U.S. citizen
Non-U.S. citizen
U.S. citizen
Non-U.S. citizen

SED citizenship at graduation
U.S. citizen
U.S. citizen
Non-U.S. citizen
Non-U.S. citizen

Imputed citizenship status
U.S. citizen
U.S. citizen
Non-U.S. citizen
Number of years since graduated:
< 5 years: Non-U.S. citizen
≥ 5 years: random imputation (below)

Citizenship status will be collected during the ECDS survey, so we can assess the magnitude of misclassification
error that may occur during the sampling.

Attachment H: ECDS Sample Design

For cases where citizenship at birth and at graduation are both non-U.S. citizen, and the number of years
since graduated is greater than or equal to 5 years, random imputation will be based on the number of
years since graduated, assuming the longer the years the more likely to change the citizenship status. For
example, we would assign cases with number of years since graduated 5, 6, 7 years probability of U.S.
citizen 0.4, and cases with number of years since graduated 8, 9, 10 years probability of U.S. citizen 0.6..
For cases that cannot be linked with SED, we will use any indication of non-U.S. origin of doctoral
degree provided on the frame to impute missing citizenship status to Non-U.S. citizen. Then we will use
the name-race/ethnicity database and impute missing citizenship status to U.S. citizen when the race is
White with high percentage (greater than or equal to 90 percent). After that, any remaining cases with
missing citizenship status will be reviewed manually with the help of information available from the list.
For cases missing name and citizenship-race-ethnicity, citizenship-race-ethnicity will be imputed
randomly using the institution level percentages derived from the GSS and IPEDS data when developing
the composite ECD size measures for each institution.
5. Linking Institution’s List of ECD with the SED Data
For ECD with earned doctorate degrees from U.S. institutions, cases with missing sex, race/ethnicity, or
citizenship status in ECDS lists will be linked to SED based on several key variables such as academic
institution of doctorate, doctorate degree year, last name, first name, birth year (if available), and
sampling variables. Combinations of these key variables will be used to maximize the linking rates. For
example to get sex from SED for missing sex in the institution list, first we will link the SED and ECD
list using the most variables that are available, for example:
-

academic institution of doctorate, degree year, last name, first name, race/ethnicity, and birth year.

Remaining un-linked cases will be linked sequentially using less number of key variables as follows:
-

academic institution of doctorate, degree year, last name, race/ethnicity, birth year,
academic institution of doctorate, degree year, last name, first name, race/ethnicity,
academic institution of doctorate, degree year, last name, race/ethnicity,
academic institution of doctorate, degree year, last name, first name,
academic institution of doctorate, degree year, last name,
academic institution of doctorate, last name, first name, race/ethnicity,
academic institution of doctorate, last name, race/ethnicity,
academic institution of doctorate, last name, first name,

Similarly, for linking SED and the institution’s list to obtain race/ethnicity, we can use combinations of
sex, academic institution of doctorate, degree year, last name, first name, and birth year as key variables
for linking. To obtain citizenship status, we can use combinations of sex, race/ethnicity, academic
institution of doctorate, degree year, last name, first name, and birth year as key variables for linking.
6. Evaluation of the Imputation for Frame Data
The level of missing data in the frame variables is not known at this time for all of the variables because
they were not requested on the institution lists in the pilot survey. It will be important to evaluate the
16

Attachment H: ECDS Sample Design
imputation procedures and improve on it if possible. We suggest the following tabulations and analyses of
the missing frame data and to evaluate the success of the imputation:
•
•
•
•

•

tabulate counts and rates of missing data as each institution’s frame is received,
tabulate counts of matches to the SED and Census data bases,
compare demographic distributions (including the imputed data) to distributions from IPEDs and
the GSS,
compare the data from the two sources using statistics such as Cohen’s kappa or the intraclass
correlation when the variables race, sex, etc. are provided on the institution frames and the ECD
matches to the SED or census data base (e.g. match to the SED or the name matches a census data
base for the variables being imputed). This should give an idea of how well the procedure works
when we don’t have frame data. When we have frame data for groups that may be difficult to
impute, such as potential foreign doctorates, we can implement the imputation procedures for
those where the information is known as well as unknown to get an early look at how the
procedures are working. That is, we’d follow the same imputation procedures when do have
frame data; we would use the imputed values for evaluation of the procedures (but for sampling
we’d use the actual frame data).
compare frame, imputed, and data collected in the survey for the variables of interest after data
collection is complete.

E. Inflating Sample Size to Account for Survey Nonresponse
The sample sizes given in the previous sections are the numbers of target completes; that is, the expected
numbers of eligible survey respondents. During fielding of the survey, however, we expect to have
nonrespondents and that not all of the sampled individuals are eligible. Therefore, the respondent sample
sizes 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 need to be inflated by dividing by the expected response and eligibility rates. That is, the initial
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
sample size 𝑛𝑛ℎ𝑖𝑖𝑖𝑖
for stratum ℎ, institution 𝑖𝑖, and domain 𝑗𝑗 is 𝑛𝑛ℎ𝑖𝑖𝑖𝑖
= 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 /𝑅𝑅ℎ𝑖𝑖𝑖𝑖 , where 𝑅𝑅ℎ𝑖𝑖𝑖𝑖 is the overall
expected response and eligibility rate in stratum ℎ, institution 𝑖𝑖, and domain 𝑗𝑗. Table 5 presents the
expected nonresponse and ineligibly rates for both stages 1 and 2 of sampling.
Table 5. Assumptions: Nonparticipation, Nonresponse, and Ineligibility
Stage 1
Sampling

Stage 2 Sampling

% Nonparticipating

%
Ineligible

% Nonresponding

GSS
Medical schools and centers
Very high research activity universities
High research activity universities
All other colleges and universities

15.0
10.0
15.0
20.0

3.0
2.0
1.0
1.0

25.0
17.5
17.5
20.0

FFRDC
NIH IRP

10.0
0.0

1.0
1.0

12.5
30.0

Sampling stratum

It is expected that some sampled institutions will not respond to our request to provide list of ECD (we
call this as stage-1 institution nonresponse). When the institution response rate is known or can be
17

Attachment H: ECDS Sample Design
estimated, the number of institutions to be sampled can be calculated as the target completes (i.e., number
of institutions providing lists) divided by the institution response rate. For example, the initial institutional
sample size for GSS medical schools and centers is 63 institutions because the target completes
(institutions providing lists) from this stratum is 53 institutions and the estimated institutional response
rate is 85 percent (53 / 0.85 ≈ 63).
The actual response rate during fielding may be lower or higher than estimated. When the actual response
rate is lower than estimated, then the target completes will not be achieved. On the other hand, when it is
higher than estimated, we might obtain many more completes than desired. To account for this institution
nonresponse, especially when the response rate is lower than expected, RTI will draw a larger initial
sample of institutions. These extra institutions will serve as reserve samples which may or may not be
released depending on the need. Table 6 shows the numbers of institutions and ECD initially sampled in
order to obtain the desired numbers of institutions providing lists and the desired numbers of responding,
eligible ECD.
Table 6. Example of initial and desired sample sizes for the academic institutions, FFRDC, and NIH
IRP (computed using the rates in Table 4)

Stratum
1 - GSS Medical schools and centers

Institution sample
Target
completes
(i.e.
Initial
providing
sample
lists)
63
53

ECD sample in Institutions
that Provide Lists

Initial sample
5,136

Target
completes
3,736

2 - GSS Very high research activity universities

7,717

6,239

3 - GSS High research activity universities

3,682

3,007

4 - GSS All other colleges and universities

4,752

3,763

Total GSS

296

250

21,287

16,745

5 = FFRDC

976

845

6 = NIH IRP

592

410

349

300

22,855

18,000

OVERALL TOTAL

As discussed in this document, some large institutions will be sampled with certainty. These certainty
institutions will be put into a separate strata and a proportional sample selected from within each
institution. Table 7 shows an example of the number of institutions providing lists (target completes) and
the number of eligible and responding and eligible ECD for each of the GSS institution strata to
demonstrate the resultant certainty institution samples, and sample size of ECD within non-certainty
samples.

Attachment H: ECDS Sample Design
Table 7. Sample size allocation for first-stage sampling, and estimate sampled ECD by
certainty/non-certainty institution
Population

Stratum
1 - Medical schools and
centers
2 - Very high research
activity universities
3 - High research
activity universities
4 - All other colleges
and universities
Total GSS
5 = FFRDC
6 = NIH IRP
OVERALL TOTAL

Institution

ECD

Sample of
Institutions

Sample ECD

NonCert- CertTotal ainty ainty

Total

Non-Certainty
Certainty institution
institution
Max Min
Sample
Sample per
per Sample size per
size
inst
inst
size
inst

172

62,854

3,736

1,397

128

2,339

109

92,847

6,239

2,841

164

3,398

29,713

3,007

348

2,659

461

61,637

3,763

107

3,656

840 247,051
43
7,520
25
2,368
908 256,939

250
25
25
300

200 16,745
845
410
18,000

4,693

12,052
34
16

We will likely set an upper bound on the number of ECD selected from the certainty institutions in order
to control the burden. After NSF and RTI have finalized the precision and sample sizes, we will be able to
identify the certainty institutions and can work with NSF to determine how many ECD to include from
each.
The institution response rates (i.e. the proportion that provide lists) may vary from those given in Table 5.
Rather than initially fielding all of the institutions shown in the first column of Table 6, we will first
select a larger sample of institutions; in this larger sample, the desired ECD will also be inflated for
purposes of calculation of the composite size measure so that the domain by stratum sampling rates are
the same as intended. Next, we will randomly partition the initial sampled institutions into a set of mini
samples called replicates (or waves) for sample release, so that each mini-sample or sample replicate is a
random subset of the initial selected institutions. Under this approach, one typically releases several of the
replicates at the start of the data collection period; the number initially released is selected based on an
optimistic level of response so that the release would be expected to yield a respondent sample that fall
short of the desired respondent quotas.
Fielding the institutions in waves will help control the number that we contact and the number from
which we obtain lists. The sample will be monitored, and once a better understanding of the realized
response rate is obtained we can estimate the additional sample size needed to reach the target number of
institutions that provide lists. Then, the number of replicates needed to reach the additional sample size
requirements is released at a subsequent point in the field collection. This process may occur in several
iterations until the end of the stage-1 survey when the desired number of institutions providing lists are
achieved. Waves will be maintained and released separately for each of the institution strata and certainty
strata in order to have better control over the number of institutions from which we obtain lists. Because
the certainty institutions are so large and important to the survey, we may choose to release all of them at
the beginning of data collection.
Sampled institutions would be randomly assigned to replicates within a stratum. The number of
institutions in a replicate should be small enough to provide control over the sample size of institutions
19

Attachment H: ECDS Sample Design
that provide lists; for example 5 to 10 institutions per wave might be a reasonable number for the ECDS.
We’d prefer that the replicates within each stratum be close to equal. For example, GSS stratum 3 calls
for 54 institutions to provide lists (Table 3), and an initial sample of 64 institutions need to obtain this
number given the response rate assumptions. Here, we might sample 70 institutions (so that we are
covered in case the response rates is less than expected), randomly divide the sample into 14 replicates of
5 institutions each, and initially field 11 replicates (55 institutions). This would leave 3 replicates, and one
or more could be released if needed.
Suppose there are a total of M h sample institutions across all replicates in stratum h, and in the replicates
that are released there are m h sample institutions. Also suppose there are R h replicates and r h are released.
Replicates or waves that are not released are not treated as nonrespondents for either response
calculations or weighting; they are treated the same as if they had not been sampled. Suppose there are a
total of M h sample institutions across all replicates in stratum h, and in the replicates that are released
there are m h sample institutions. Weights for institutions would first be adjusted to account for the sample
release, by multiplying by the factor M h /m h . Alternately, if all of the replicates in a stratum contain the
same number of institutions, the factor could be R h /r h . This will be followed by an adjustment for
institution nonresponse. Then, the response adjusted institution weights will be calibrated to the total
number of institutions on the frame within each of the first stage strata.
As with any sampling scheme that inflates the number of units selected in order to account for
nonresponse and eligibility, the implementation of institution sample waves and release will change the
selection probabilities from those that are designed. However, if the expected and actual response rates
are similar, the nonresponse adjustments to the weights that are made after data collection is complete
should help to restore the weights so that the design effect due to unequal weighting for the respondents is
close to that anticipated in the sample design.

F. Adjusting the Sample Size Allocation for Discrepancies in Counts of ECD
There may be differences between the counts of ECD counts used during the institution sampling (first
stage sample selection) and those counts used during the ECD sampling (second stage sample selection).
During the list collection for the second-stage sampling frame construction, we will receive list of ECD
with sampling variables from the sampled institutions. This will provide a more accurate counts, while the
counts used for the first stage sampling are estimates. To maintain the goal of epsem sample when the
actual count based on the institution-provided ECD list available, we can adjust the sample size 𝑛𝑛ℎ𝑖𝑖𝑖𝑖 as
�ℎ𝑖𝑖𝑖𝑖 denotes the count of ECD provided by the institution 𝑖𝑖 for domain 𝑗𝑗 in stratum ℎ.
follows. Suppose 𝑁𝑁
The institution-level domain-specific sample size may be recalculated as follows:
�ℎ𝑖𝑖𝑖𝑖 �𝑁𝑁ℎ𝑖𝑖𝑖𝑖 �×𝑛𝑛ℎ𝑖𝑖𝑖𝑖 = 𝑁𝑁
�ℎ𝑖𝑖𝑖𝑖 ×𝑓𝑓ℎ𝑗𝑗
Certainty U.S. academic institutions: 𝑛𝑛�ℎ𝑖𝑖𝑖𝑖 = �𝑁𝑁
�5𝑖𝑖𝑖𝑖 �𝑁𝑁5𝑖𝑖𝑖𝑖 �×𝑛𝑛5𝑖𝑖𝑖𝑖 = 𝑁𝑁
�5𝑖𝑖𝑖𝑖 ×𝑓𝑓5𝑗𝑗
Certainty FFRDC: 𝑛𝑛�5𝑖𝑖𝑖𝑖 = �𝑁𝑁

�6𝑖𝑖𝑖𝑖 �𝑁𝑁6𝑖𝑖𝑖𝑖 �×𝑛𝑛6𝑖𝑖𝑖𝑖 = 𝑁𝑁
�6𝑖𝑖𝑖𝑖 ×𝑓𝑓6𝑗𝑗
Certainty NIH IRP: 𝑛𝑛�6𝑖𝑖𝑖𝑖 = �𝑁𝑁
Non-certainty U.S. academic institutions:
20

(12)

Attachment H: ECDS Sample Design

�ℎ𝑖𝑖𝑖𝑖 �𝑁𝑁ℎ𝑖𝑖𝑖𝑖 �×
𝑛𝑛�ℎ𝑖𝑖𝑖𝑖 = �𝑁𝑁

Non-certainty FFRDC:

�5𝑖𝑖𝑖𝑖 �𝑁𝑁5𝑖𝑖𝑖𝑖 �×
𝑛𝑛�5𝑖𝑖𝑖𝑖 = �𝑁𝑁

𝐶𝐶
(𝐼𝐼 −𝑚𝑚ℎ
)

ℎ
∑𝑖𝑖=1

𝑚𝑚ℎ𝑁𝑁𝑁𝑁

(𝐼𝐼 −𝑚𝑚5𝐶𝐶 )

5
∑𝑖𝑖=1

𝑚𝑚5𝑁𝑁𝑁𝑁

𝐶𝐶

𝑆𝑆ℎ𝑖𝑖

(𝐼𝐼ℎ −𝑚𝑚ℎ )
�ℎ𝑖𝑖𝑖𝑖
∑𝑖𝑖=1
𝑆𝑆1𝑖𝑖 𝑁𝑁
𝑁𝑁ℎ𝑖𝑖𝑖𝑖
×
×𝑓𝑓ℎ𝑗𝑗 =
×
×𝑓𝑓ℎ𝑗𝑗
𝑆𝑆ℎ𝑖𝑖
𝑆𝑆ℎ𝑖𝑖
𝑚𝑚ℎ𝑁𝑁𝑁𝑁

𝑆𝑆5𝑖𝑖

(𝐼𝐼5 −𝑚𝑚5 )
�5𝑖𝑖𝑖𝑖
∑𝑖𝑖=1
𝑁𝑁5𝑖𝑖𝑖𝑖
𝑆𝑆9𝑖𝑖 𝑁𝑁
×
×𝑓𝑓5𝑗𝑗 =
×
×𝑓𝑓5𝑗𝑗
𝑆𝑆5𝑖𝑖
𝑆𝑆5𝑖𝑖
𝑚𝑚5𝑁𝑁𝑁𝑁

𝐶𝐶

�ℎ𝑖𝑖𝑖𝑖 for all domains and institutions, we will achieve equal weights
Under the condition that 𝑛𝑛�ℎ𝑖𝑖𝑖𝑖 ≤ 𝑁𝑁
within each stratum by domain:
�

�ℎ𝑖𝑖𝑖𝑖 � = 𝑁𝑁ℎ𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗 = 𝑓𝑓ℎ𝑗𝑗
Certainty U.S. academic institutions: 𝜋𝜋�ℎ𝑗𝑗𝑗𝑗 = 𝜋𝜋ℎ𝑖𝑖 ×𝜋𝜋�ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×�𝑛𝑛�ℎ𝑖𝑖𝑖𝑖 ⁄𝑁𝑁
�
𝑁𝑁
�5𝑖𝑖𝑖𝑖 � =
Certainty FFRDC: 𝜋𝜋�5𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋5𝑖𝑖 ×𝜋𝜋�5𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×�𝑛𝑛�5𝑖𝑖𝑖𝑖 ⁄𝑁𝑁

� 5𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗
𝑁𝑁
� 5𝑖𝑖𝑖𝑖
𝑁𝑁

�6𝑖𝑖𝑖𝑖 � =
Certainty NIH IRP: 𝜋𝜋�6𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋6𝑖𝑖 ×𝜋𝜋�6𝑗𝑗𝑗𝑗|𝑖𝑖 = 1×�𝑛𝑛�6𝑖𝑖𝑖𝑖 ⁄𝑁𝑁
Non-certainty U.S. academic institutions:
𝜋𝜋�ℎ𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋ℎ𝑖𝑖 ×𝜋𝜋�ℎ𝑗𝑗𝑗𝑗|𝑖𝑖 = �𝑚𝑚ℎ −

� 6𝑖𝑖𝑖𝑖 ×𝑓𝑓𝑗𝑗
𝑁𝑁
� 6𝑖𝑖𝑖𝑖
𝑁𝑁

= 𝑓𝑓6𝑗𝑗
𝐶𝐶

𝑚𝑚ℎ𝐶𝐶 �

(𝐼𝐼ℎ −𝑚𝑚ℎ )
�ℎ𝑖𝑖𝑖𝑖
𝑆𝑆ℎ𝑖𝑖 𝑁𝑁
1 ∑𝑖𝑖=1
×
×
×𝑓𝑓ℎ𝑗𝑗 = 𝑓𝑓ℎ𝑗𝑗
𝐶𝐶
𝑁𝑁𝑁𝑁
�
(𝐼𝐼ℎ −𝑚𝑚ℎ )
𝑆𝑆ℎ𝑖𝑖
𝑚𝑚ℎ
∑𝑖𝑖=1
𝑆𝑆ℎ𝑖𝑖 𝑁𝑁ℎ𝑖𝑖𝑖𝑖

𝑚𝑚5𝐶𝐶 �

5
5
𝑆𝑆1𝑖𝑖 𝑁𝑁5𝑖𝑖𝑖𝑖
1 ∑𝑖𝑖=1
×
×
×𝑓𝑓5𝑗𝑗 = 𝑓𝑓5𝑗𝑗
𝐶𝐶
𝑁𝑁𝑁𝑁
�
(𝐼𝐼5 −𝑚𝑚5 )
𝑆𝑆5𝑖𝑖
𝑚𝑚5
∑𝑖𝑖=1
𝑆𝑆5𝑖𝑖 𝑁𝑁5𝑖𝑖𝑖𝑖

Non-certainty FFRDC:
𝜋𝜋�5𝑖𝑖𝑖𝑖𝑖𝑖 = 𝜋𝜋5𝑖𝑖 ×𝜋𝜋�5𝑗𝑗𝑗𝑗|𝑖𝑖 = �𝑚𝑚5 −

= 𝑓𝑓5𝑗𝑗

ℎ𝑖𝑖𝑖𝑖

𝑆𝑆ℎ𝑖𝑖

𝑆𝑆5𝑖𝑖

(𝐼𝐼 −𝑚𝑚𝐶𝐶 )

The second-stage sampling will take place in rolling basis. That is, once the list of ECD is received from a
sampled institution, we will draw the sample of ECD within that institution. If we keep the sampling rate
𝑓𝑓ℎ𝑗𝑗 as in formula (12) fixed during the second-stage sampling, a consequence is that the total number of
sampled ECD may not be exactly 18,000; that is, the total number of sampled ECD can be less or more
than 18,000 depending on the number of institutions responding to the survey. We will monitor the
numbers of ECD sampled from each institution, stratum, and domain and may adjust the sample sizes
(after discussion with NSF staff) if it appears that the counts of ECD on the institution lists or the number
sampled will differ greatly from the sampling plan.

G. Small Institutions
Some of the GSS and FFRDC institutions are too small to support the average sample sizes called for in
Table 6 (especially after inflating by the expected ECD response and eligibility rates). RTI and NSF
reviewed frame coverage and examined domain distributions when dropping the smallest institutions
from the frame, and NSF determined that institutions with fewer than 50 ECD could be dropped without
substantial loss of coverages. Others that are too small to support the full sample will be combined with
another in the same institution strata (ideally in the same GSS stratum and state) for purposes of forming
institution PSUs.
21

Attachment H: ECDS Sample Design
Combining institutions that are too small to support the full sample with other institutions to form PSUs
would be done prior to selection of the first stage sample. Assuming that all institutions in a PSU provide
lists, the minimum number of ECD needed in a PSU would be the counts shown in the last column of
Table 7 inflated by the ECD eligibility and response rate (Table 5). In the first two GSS strata, we plan to
combine any institution with fewer than 100 ECD with a larger institution, and in the last two GS strata
we plan to combine any institutions with fewer than 75 ECD with a larger institution. If one or both of the
institutions does not provide a list then we may consider adding an additional PSU in that strata as part of
the wave release. In any case, if only one of the set of institutions in a PSU provides a list, then we will
likely still select ECD from the cooperative institution.

H. Institutions that Cannot Identify their ECD
As in the pilot survey, not all institutions will be able to identify their ECD and will provide a list with
variables such as year of degree and job type. We plan to classify individuals on these lists according to
the likelihood of being an ECD (“not likely,” “somewhat likely,” “likely”). Those that are “not likely” to
be an ECD will not be a part of the sample; we learned in the pilot survey that every few of these were
actually ECD. Those that are “somewhat likely” or “likely” will be sampled, with the sampling rates set
lower for those that are “somewhat likely” to be an ECD and higher for those that are “likely” to be an
ECD. A higher sampling rates for those that are “likely” to be an ECD means that a larger number will be
eligible and actually ECD. This will increase the design effect due to unequal weighting, but will also
increase the proportion of the sampled individuals that are actually ECD and therefore eligible for the
survey.

File Type	application/pdf
Author	Brian Head
File Modified	2017-07-17
File Created	2017-07-17