Appendix 1b LaVange et al Sampling Design SOL

Appendix 1b LaVange et al Sampling Design SOL.pdf

The Hispanic Community Health Study/ Study of Latinos (HCHS/SOL)(NHLBI)

Appendix 1b LaVange et al Sampling Design SOL

OMB: 0925-0584

Document [pdf]
Download: pdf | pdf
Sample Design and Cohort Selection in the Hispanic Community Health
Study/Study of Latinos
LISA M. LAVANGE, PHD, WILLIAM D. KALSBEEK, PHD, PAUL D. SORLIE, PHD,
LARISSA M. AVILE´S-SANTA, MD, MPH, ROBERT C. KAPLAN, PHD,
JANICE BARNHART, MD, MS, KIANG LIU, PHD, AIDA GIACHELLO, PHD, DAVID J. LEE, PHD,
JOHN RYAN, DRPH, MICHAEL H. CRIQUI, MD, MPH, AND JOHN P. ELDER, PHD, MPH

PURPOSE: The Hispanic Community Health Study (HCHS)/Study of Latinos (SOL) is a multicenter,
community-based cohort study of Hispanic/Latino adults in the United States. A diverse participant sample
is required that is both representative of the target population and likely to remain engaged throughout
follow-up. The choice of sample design, its rationale, and benefits and challenges of design decisions are
described in this study.
METHODS: The study design calls for recruitment and follow-up of a cohort of 16,000 Hispanics/Latinos
18–74 years of age, with 62.5% (10,000) over 44 years of age and adequate subgroup sample sizes to support
inference by Hispanic/Latino background. Participants are recruited in community areas surrounding four
field centers in the Bronx, Chicago, Miami, and San Diego. A two-stage area probability sample of households is selected with stratification and oversampling incorporated at each stage to provide a broadly diverse
sample, offer efficiencies in field operations, and ensure that the target age distribution is obtained.
CONCLUSIONS: Embedding probability sampling within this traditional, multisite cohort study design
enables competing research objectives to be met. However, the use of probability sampling requires developing solutions to some unique challenges in both sample selection and recruitment, as described here.
Ann Epidemiol 2010;20:642–649. Ó 2010 Elsevier Inc. All rights reserved.
KEY WORDS:

Probability Sampling, Sampling Diverse Populations, Hispanic/Latino Health.

INTRODUCTION
The Hispanic Community Health Study (HCHS)/ Study of
Latinos (SOL) is a multicenter community-based cohort
study of Hispanics/Latinos in the United States (US). The
study objectives are to provide information on the health
From the Collaborative Studies Coordinating Center, Department of
Biostatistics, Gillings School of Global Public Health, University of North
Carolina, Chapel Hill (L.M.L.); the Survey Research Unit, Department of
Biostatistics, Gillings School of Global Public Health, University of North
Carolina, Chapel Hill (W.D.K.); the Division of Cardiovascular Sciences,
National Heart, Lung and Blood Institute, National Institutes of Health,
Bethesda, MD (P.D.S., L.M.A.S.); the Department of Epidemiology and
Population Health, Albert Einstein College of Medicine, Bronx, NY (R.
C.K., J.B.); the Department of Preventive Medicine, Feinberg School of
Medicine, Northwestern University, Chicago, IL (K.L.); Midwest Latino
Health Research, Training and Policy Center, Jane Addams College of
Social Work, University of Illinois at Chicago (A.G.); the Department
of Epidemiology and Public Health, Sylvester Comprehensive Center,
University of Miami, FL (D.J.L.); the Department of Family Medicine
and Community Health, University of Miami, FL (J.R.); the Department
of Family and Preventive Medicine, University of California at San Diego,
La Jolla (M.H.C.); and the Graduate School of Public Health, San Diego
State University, CA (J.P.E.).
Address correspondence to: Lisa M. LaVange, PhD, CSCC, Department
of Biostatistics, UNC-CH, 137 E. Franklin St., Suite 203, Chapel Hill,
NC 27514. Tel.: þ1-919-966-8333; Fax: þ1-919-962-3265. E-mail: lisa_
[email protected].
Received January 7, 2010; accepted May 15, 2010.
Ó 2010 Elsevier Inc. All rights reserved.
360 Park Avenue South, New York, NY 10010

status and disease burden of US Hispanics/Latinos and to
investigate relationships between baseline risk factors and
disease incidence during follow-up. A cohort of 16,000
Hispanics/Latinos 18–74 years of age will be enrolled and,
on completion of a comprehensive baseline examination,
followed annually to determine the incidence of clinical
events, including cardiovascular events and pulmonary
exacerbations. The study is funded by the National Heart,
Lung, and Blood Institute and six other institutes, centers,
or offices within the National Institutes of Health. Details
of the study design and its various components are described
by Sorlie et al. (1). This paper describes the sample design
used to identify and select households and persons for study
participation.
Two distinct analytical objectives motivated the
approach to sample selection. First, the study sample must
support estimates of prevalence of baseline risk factors,
both overall and by Hispanic/Latino background and other
demographic subgroups. Second, the sample must support
evaluation of the relationships between the various risk
factors and disease outcomes measured during follow-up.
To accomplish both objectives, a hybrid approach to cohort
identification and selection is used that combines deliberate
selection of community areas and random selection of
households within those areas. The rationale for the use of
1047-2797/$ - see front matter
doi:10.1016/j.annepidem.2010.05.006

AEP Vol. 20, No. 8
August 2010: 642–649

Selected Abbreviations and Acronyms
HCHS Z Hispanic Community Health Study
SOL Z Study of Latinos
NHLBI Z National Heart, Lung, and Blood Institute
MEPS Z Medical Expenditure Panel Survey
SES Z socio-economic status
BG Z block group
PSU Z primary sampling unit
DSF Z delivery sequence file

probability sampling, details of the sample design, and the
impact of the sampling strategy on the recruitment process
are provided in the following sections.

METHODS
The four communities included in HCHS/SOL are located
in the Bronx, Chicago, Miami, and San Diego. The sampled
area in each community was defined by a group of neighboring census tracts to provide geographical balance and
diversity with respect to Hispanic/Latino background.
Each community’s field center purposively selected its targeted tracts based on their proximity to the clinic, tractlevel demographic distributions available from the 2000
Decennial U.S. Census, and local information about neighborhoods. The target population in HCHS/SOL corresponds to all noninstitutionalized Hispanic/Latino adults
18–74 years of age residing in the four sampled areas. Probability sampling within these areas is used to ensure a broad
representation of the target population and to minimize the
various sources of bias that may otherwise enter into the
cohort selection and recruitment process.
The Need for Probability Sampling
The design of a population-based sample must accommodate the specific informational needs of the study. For
HCHS/SOL, the selected sample should be broadly representative of the target population in that the sample mirrors
the full range of possible values for key outcome variables
while also providing adequate representation of important
combinations of predictor and outcome variables (2). Probability sampling provides a means for achieving such
balanced representation. Probability sampling also provides
a basis for making unbiased inference to target population
characteristics of interest. These advantages, however,
come at a cost. Probability sampling requires the exclusive
use of random selection so that the statistical probability
of choosing each sample member can be calculated (3).
Random selection requires enumeration of members of the
target population, or well-defined subsets thereof, and can
be costly to implement. As a result, study designs

Lavange et al.
HCHS/SOL SAMPLE DESIGN

643

incorporating more convenient methods of selection are
often used for population-based cohort studies.
Any sample design that uses nonrandom selection (e.g.,
a convenience sample) produces a nonprobability sample.
Quasi-probability samples that combine random and
nonrandom methods of selection (e.g., allowing interviewers to subjectively select a quota sample of households
within a random sample of neighborhoods) are also nonprobability samples (4, 5). Accompanying the simplicity
and lower cost associated with nonprobability sampling
are two problems. First, there is no direct theoretical basis
for making estimates of population characteristics from
the sample (6). Instead, one must either defend a model to
explain the generation of the sample data from some
underlying distribution or assume that the variability of
sample-based estimates is similar to that associated with
simple random sampling. Both assumptions are difficult to
verify. Second, nonprobability samples, typically offer
a skewed reflection of the sampled population due to diminished participation by population sectors (2). Self-selected
samples exclude those more reluctant to volunteer and
who are less accessible; allowing interviewers to decide
who is selected can also exclude those not meeting personal
preferences, leading to potentially biased study results. The
magnitude of this bias is directly related to the extent of
under-representation in the sample and the degree to which
key study measurements on those included differ from those
not included. Although it is true that sources of error
unrelated to sample selection (e.g., nonresponse) can bias
the analysis of data from probability samples, nonprobability
samples are subject to these same nonsampling errors,
producing estimates with bias due to both nonrandom
selection and nonsampling sources (7).
To illustrate the potential for bias in nonprobability
samples, we compared health outcomes estimated from
a national probability sample to outcomes estimated from
a simulated clinic-users sample using data from the 2005
Medical Expenditure Panel Survey (MEPS). MEPS uses
a national probability sample of all civilian, noninstitutionalized U.S. residents (8). The subset of MEPS respondents
reporting one or more physician visits in the past year
(‘‘clinic users’’) mimics a convenience sample selected
through physician practices alone. Estimates of the number
of chronic conditions, average cost of physician visits,
obesity prevalence, and number of work days missed due
to illness/injury are provided in Figure 1 for the full sample
and the clinic-users sample, both overall and by race/
ethnicity. Sampling weights are incorporated in the analysis
to account for disproportionate sampling of population
subgroups in the MEPS study design. Estimates of the
clinic-users sample standardized by age, race/ethnicity, and
gender to the MEPS target population are also provided in
Figure 1 in an attempt to adjust for skewness in the

644

Lavange et al.
HCHS/SOL SAMPLE DESIGN

AEP Vol. 20, No. 8
August 2010: 642–649

FIGURE 1. Results of a simulated comparison of probability and nonprobability samples using data from the 2005 MEPS. ‘‘Population’’
corresponds to estimates based on the MEPS sample and therefore representing the 2005 US resident, noninstitutionalized population.
‘‘Clinic Users’’ corresponds to estimates based on the subset of the MEPS sample reporting one or more physician visits in the past
year. ‘‘Clinic Users (standardized)’’ corresponds to estimates based on the MEPS clinic-users sample, standardized to the 2005 US resident,
noninstitutionalized population distributions for age, race/ethnicity, and gender.

convenience sample. The findings show that estimates from
the clinic-users sample are consistently higher than those
from the probability sample. This selection bias is not unexpected due to the association between the criteria for selecting the clinic-users sample and the outcome measures (i.e.,
clinic users are more likely to have health problems than the
population as a whole). The fact that standardizing estimates from the clinic-users sample does not consistently
compensate for the skewed representation due to nonprobability sampling, however, is unexpected. Standardization
seems to offset the effect of selection bias for one outcome
(chronic conditions), partially compensate for the effect
in another (health care costs), and exacerbate the effect in
the remaining two measures (obesity and days lost). For
subgroup comparisons, the white–nonwhite difference in
obesity prevalence for the clinic-users sample overstates
the actual difference, and standardization exacerbates this
overstatement. Although this example highlights the
pitfalls of just one form of nonrandom selection, similar
results would be expected for other forms of convenience
sampling.

Rationale for Key Sample Design Features
A probability-based sampling strategy was chosen for
HCHS/SOL, with specific features dictated by the goals
and overall design of the study. First, the decision to identify
Hispanics/Latinos from the general residential population
made controlling the cost of face-to-face recruitment
a priority. The mode of recruitment and data collection is
an important cost factor in population-based studies, and,
although mail-and web-based methods are inexpensive,
nonsampling errors due to incomplete frame coverage and
nonresponse can occur. Telephone screening is also relatively inexpensive, but its exclusive use was impractical
for HCHS/SOL due to the declining use of telephone
land-lines and the fact that an extensive clinic visit is
a key component of data collection. Face-to-face sample
recruitment was seen as the only real option for HCHS/
SOL; consequently, steps to control the associated higher
costs were needed.
One obvious cost-saving measure was to sample
geographic clusters of households (i.e., census block groups)
at the first stage of a multistage sample to reduce the cost of

AEP Vol. 20, No. 8
August 2010: 642–649

return visits to neighboring households. More substantial
cost savings were realized through oversampling of both
clusters and households within clusters most likely to be
Hispanic/Latino, thereby reducing the number of sampled
households that must be screened to achieve the study’s
sample size goals. Geographic clusters were stratified by
the proportion of the population found to be Hispanic/
Latino in the 2000 Census, and clusters in the ‘‘high concentration’’ stratum were selected at a higher rate than clusters
in the ‘‘low concentration’’ stratum at the first stage of
sample selection. An optimal delineation point between
high and low concentration was determined for each field
center using Cochran’s cumulative Of rule (9). Similarly,
household addresses within clusters were divided into two
strata, those associated with Hispanic/Latino surnames
versus all others. Hispanic/Latino surname addresses were
selected at a higher rate than other addresses at the second
stage. Oversampling in multiple stages of the selection
process in this way provides efficiencies in sample identification while still retaining the advantages of random
selection.
Meeting the HCHS/SOL objectives requires adequate
representation of the socio-economic status (SES) distribution of residents of the defined community areas. Although
SES is an individual- or household-level characteristic, it is
rarely possible to stratify a sample of households by a direct
measure of SES. The next best option is to use census
measures such as educational attainment or household
income as a practical proxy indicator (10). To this end,
geographic clusters were stratified by the proportion of residents aged 25 years or older with at least a high school
education based on the 2000 Census. The high and low
SES delineation point was defined as the median value of
the distribution across clusters, and the first-stage sample
was allocated proportionately across strata to ensure broad
SES representation.
To meet the HCHS/SOL objective of identifying predictors of disease outcomes including cardiovascular events,
a target sample size of 10,000 persons 45–74 years of age
(62.5% of the full cohort) was set. Over-representation of
this age group required subsampling households or persons
within households according to the household’s age distribution. Such a procedure is best applied during screening, with
the intention of retaining a higher portion of discovered older Hispanics/Latinos than would occur if persons were
chosen at random. Subsampling according to age was
accomplished in one of two ways. Method 1 was designed
to keep all households intact, with no subsampling at the
person level, and was adopted at study start. With this
method, households in which the Hispanic/Latino adults
are all 45–74 years of age are selected with certainty (probability of selection Z 1) within the first-stage cluster, and all
other households are subsampled with probability less than

Lavange et al.
HCHS/SOL SAMPLE DESIGN

645

1. Method 2 involves dividing each household into two
subclusters, Hispanics/Latinos 45–74 years of age and
Hispanics/Latinos 18–44 years of age. The 45–74 year
subclusters are selected with certainty (probability Z 1),
whereas the 18–44 year subclusters are selected with probability of less than 1. This method involves subsampling
persons within a household rather than keeping households
intact, but can result in fewer households needing to be
screened and was adopted after study start for efficiency.
The final design consideration was the need to compare
health characteristics by Hispanic/Latino background
among the four field centers. Valid comparisons require
comparability across sites in cohort recruitment, but not
necessarily identical probability sample designs. Indeed,
the same sample design structure was used (i.e., two-stage
stratified sampling of households with the same sampling
units and stratification variables in each stage), with some
allowance for how the strata were defined and the sample
allocated among the centers.
Sample Selection
A stratified two-stage area probability sample of household
addresses was selected in each of the four HCHS/SOL field
centers. A summary of the center-specific designs is presented in Table 1. At the first stage, a stratified simple
random sample of census block groups (BGs), which served
as primary sampling units (PSUs), was selected in each field
center. PSU sampling strata were defined by the crossclassification of (i) high and low Hispanic/Latino concentration, and (ii) high and low SES, defined above. The
distribution of BGs across strata and the oversampling ratios
for high and low Hispanic concentration strata are presented in Table 2. Special strata were created as needed to
target specific neighborhoods. In the Bronx, a fifth stratum
was defined as a portion of a high-rise housing complex
(named Co-op City) to provide additional income diversity,
and two additional strata were appended after study start to
increase coverage. In Miami, a fifth stratum was defined with
high expected concentrations of Central and South Americans, and a sixth stratum corresponding to an area with
a high concentration of Cuban residents was appended after
study start. Overall, 670 (72.4%) of the 925 BGs in the
target areas were selected for the PSU sample.
Separate stratified second-stage samples of household
addresses were selected within each sample PSU. Address
listings came from the delivery sequence file (DSF) available
from the U.S. Postal Service and obtained through MSGGenesys (Ft. Washington, PA). The DSF addresses within
each sample BG were cross-referenced with telephone and
commercial mailing lists, and surname and telephone
number were appended where available. Table 2 provides
the second-stage oversampling ratios for the Hispanic/

646

Lavange et al.
HCHS/SOL SAMPLE DESIGN

AEP Vol. 20, No. 8
August 2010: 642–649

TABLE 1. Summary of HCHS/SOL sample design features
Sampling
Stage

Random Selection
Method in Each Stratum

Sampling Unit and Frame Source(s)

Stratification and Stratum Allocation

1

Sampling unit: BG as defined for the 2000 Census.
Frame: list of BGs created from the designated
community area at the site, defined in each
stratum as a set of census tracts.

 Explicit strata formed by cross-classification of: (i)
‘‘high’’/‘‘low’’ categories according to % Hispanic
among total population in 2000; and (ii) ‘‘high’’/
‘‘low’’ SES as measured by % of the population
with at least a high school education. One or more
special strata were defined in a subset of sites to
target specific population subgroups.
 Disproportionately higher BG sampling rates for
strata in the ‘‘high’’ % Hispanic category;
proportionate allocation to ‘‘high’’ and ‘‘low’’ SES
categories within each % Hispanic category.

Simple random sampling

2

Sampling unit: household address
Frame source: USPS listing
of addresses available
through MSG-Genesys

 Explicit strata formed by whether/not the
occupant has an Hispanic surname.
 Disproportionately higher address sampling rate
for Hispanic surname addresses; uniform stratumspecific address sampling rates among sample BGs.

Simple random sampling

BG Z block group; HCHS Z Hispanic Community Health Study; SES Z socio-economic status; SOL Z Study of Latinos; USPS Z United States Postal Service.

Latino surname strata used to achieve the final sample of
127,213 addresses.
The sample addresses in each field center were randomly
subsampled to form three waves corresponding to the 3 years
of recruitment. Thus, the yearly sample for each field center
was representative of the target community area, thereby
minimizing bias due to temporal trends.

Design Modifications
A key feature of the HCHS/SOL sample design is the ability
to modify components to adapt to recruitment experiences.
The modifications made to date include the designation of
a sixth stratum in the Miami field center to append certain
block groups in the Hialeah neighborhood for increased
coverage of the Cuban population and designation of a sixth
and seventh strata in the Bronx to capture a neighborhood
adjoining the original target area, thereby increasing
coverage of the Bronx Hispanic/Latino community.
Approximately 6 months into recruitment, a decision
was made to apply Method 2 for oversampling adults 45–
74 years of age in lieu of Method 1, based on the need to
accept a higher proportion of households into the sample
and reduce recruitment time. The selection probabilities
for both methods of oversampling 45–74-year-olds during
household screening were based initially on 2005 American
Community Survey data for the geographic region of each
field center. The sample age distribution is monitored
continually as data on HCHS/SOL households accumulate,
and the selection probabilities are adjusted as needed.
Table 2 provides the subsampling rates for each method
applied to each field center.

Sample Size and Data Analysis
Each field center will enroll 4000 Hispanics/Latinos with
the prescribed age distribution, namely 2500 that are 45–
74 years of age and 1500 that are 18–44 years of age. In terms
of Hispanic/Latino background, the Bronx field center
sample is predominantly Puerto Rican and Dominican,
whereas the majority of participants in the San Diego site
are Mexican in origin. Study participants in the Miami field
center are Cuban and Central and South American, and
participants in the Chicago field center are Mexican,
Puerto-Rican, and Central and South American. A
minimum of 2000 participants in each of the four prespecified Hispanic/Latino groups (Mexican, Puerto Rican, Cuban, and Central and South American) is required to
support the analysis objectives, and sample sizes are monitored continuously to determine if adjustments to the
sampling strategy are needed.
The HCHS/SOL sample size will support a broad range of
analyses planned for the study. As an example, consider the
possible association of an exposure variable with incident
disease. The range of hazard ratios able to be detected
with approximately 90% power are provided in Table 3 by
event rate and the relative sample sizes of low to high risk
groups. The estimates incorporate a design effect to account
for clustering in the sample of 1.25, based on an average
cluster size (persons per block group) of 24 and intraclass
correlation for incident disease of 0.011. Based on the entire
study cohort of 16,000, a hazard ratio of 1.6 would be able to
be detected for an event occurring at the rate of 4 per 1000
person years of follow-up and equally sized low and high risk
groups, e.g., for a continuous exposure variable dichotomized at the median value. With a population subgroup of

AEP Vol. 20, No. 8
August 2010: 642–649

Lavange et al.
HCHS/SOL SAMPLE DESIGN

647

TABLE 2. Design characteristics of the HCHS/SOL sample
Design Characteristic
Stage 1 (sampling block groups):
Total BGs on census frame (n)
Delineation point: ‘‘high’’ vs. ‘‘low’’ % Hispanic
Delineation point: ‘‘high’’ vs. ‘‘low’’ SES
(% >high school education)
Selected BGs (n)
By stratum
Low concentration/high SES
Low concentration/low SES
High concentration/high SES
High concentration/low SES
Special stratum 1
Special stratum 2
Special stratum 3
BG over-sampling ratio for % Hispanic strata: high/low*
Stage 2 (sampling addresses within block groups)
Total of addresses on USPS frame (n)
Total of addresses in selected BGs (n)
Total of selected addresses (n)
Address oversampling ratio
By stratum
Low concentration/high SES
Low concentration/low SES
High concentration/high SES
High concentration/low SES
Special stratum 1
Special stratum 2
Special stratum 3
Final sample (Hispanics/Latinos 18–74 years of age):
Age oversampling during household screening
Method 1: selection probability for mixed-age households
Method 2: selection probability for younger
adults within a household (aged 18–44 years)
Targeted participant sample size
Target distribution by background (%)
Central and South American
Cuban
Mexican
Puerto Rican and Dominican

Bronx

Chicago

Miami

San Diego

376
44.69
46.80

170
45.79
44.11

158
64.71
48.50

221
44.67
53.32

238

125

147

160

14
5
78
100
3
16
22
2.5

19
5
46
55
d
d
d
2.5

9
0
46
34
11
47
d
1.6

51
11
35
63
d
d
d
2.0

188,932
117,319
30,718

10.0
6.0
4.0
4.0
15.0
6.0
4.0

0.21
0.63

83,950
59,666
31,143

3.0
3.5
2.3
3.0
d
d
d

0.14
0.30

98,072
90,298
22,929

2.0
d
1.0
1.0
1.0
1.0
d

0.38
0.45

126,769
92,061
42,423

3.0
3.0
3.1
3.6
d
d
d

0.12
0.50

Total Count: All Field
Centers Combined
925
d
d
670
93
21
205
252
14
63
22
d
497,723
359,344
127,213

d
d
d
d
d
d
d

d
d

4000

4000

4000

4000

16,000

8
1
7
84

11
1
61
27

35
60
1
4

2
0
97
1

2240
2480
6640
4640

BG Z block group; HCHS Z Hispanic Community Health Study; SES Z socio-economic status; SOL Z Study of Latinos; USPS Z United States Postal Service.
*The oversampling ratio is calculated by dividing the sampling rate in a stratum isolating those to be oversampled by the sampling rate for the corresponding stratum isolating
those to be undersampled.

size 4000 (e.g., a single site or Hispanic/Latino subgroup),
the hazard ratio able to be detected in the same circumstances is 2.25. For higher levels of intraclass correlation,
power for both comparisons would decrease.
The use of multistage or clustered sampling creates
complexity in data analyses due to correlations among
sample units at the various stages of selection, here, correlations among households within the same block group and
correlations among individuals within the same household.
Similarly, oversampling through the use of differential probabilities of selection requires the use of sampling weights for
unbiased estimation of population characteristics. Whereas
clustering and unequal probabilities of selection tend to

increase the variability of population estimates and reduce
the power available for testing associations, stratification
at one or more stages of sample selection has the reverse
effect. To ensure accurate estimation of variances and valid
statistical tests of hypotheses therefore requires appropriately accounting for the HCHS/SOL sample design during
data analysis. Initial sampling weights will correspond to
the inverse probability of selection for each participant.
Nonresponse adjustments and calibration to known population totals (from the 2010 Decennial Census, when available) will be applied. Final sampling weights, stratification
variables, and cluster indicators will be available for design
specification during data analysis. A variety of statistical

648

Lavange et al.
HCHS/SOL SAMPLE DESIGN

AEP Vol. 20, No. 8
August 2010: 642–649

TABLE 3. Hazard ratios detected with approximately 90% power by event rate and low/high risk group ratio
Ratio of Low-Risk to High-Risk
Analyses

Rate in Low-Risk Group in Person-Years

1:1

3:1

15:1

Based on the total HCHS/SOL sample (n Z 16,000)

2/1000
4/1000
6/1000
8/1000

1.85
1.60
1.45
1.40

1.90
1.65
1.50
1.45

2.50
2.05
1.90
1.75

Field center- or subgroup-specific (n Z 4000)

2/1000
4/1000
6/1000
8/1000
10/1000

2.95
2.25
2.00
1.85
1.75

2.95
2.35
2.05
1.90
1.80

4.15
3.20
2.80
2.55
2.40

HCHS Z Hispanic Community Health Study; SOL Z Study of Latinos.
Assuming 3-year accrual and 2-year follow-up periods and a design effect due to clustering of 1.25.

methods that account for multi-stage sampling are available
(11, 12), and most standard statistical software packages are
able to accommodate probability sample designs (e.g., SAS
[SAS Institute Inc., Cary, NC], STATA [StataCorp LP,
College Station, TX]). Special purpose software (e.g.,
SUDAAN [RTI, Research Triangle Park, NC]) for complex
sample designs is also available.
Sample Recruitment
Successful implementation of probability sampling requires
a systematic approach to recruitment to realize the benefits
of the sample design. If subjective factors such as interviewer
preference enter into the recruitment process, then the objectivity associated with random selection will not be
achieved. The goals of HCHS/SOL recruitment are to optimize the ability to establish contact with, determine eligibility of, and actively engage households at every sample
address, regardless of the neighborhood or living conditions
encountered in the field. Recruitment teams inform potential participants of the study objectives and associated benefits of their participation. The research nature of the study is
emphasized, including the information it is designed to
provide and the impact the study results may have on policy
making and health care for future generations of US
Hispanics/Latinos. Extensive community engagement
efforts provide the context for this information exchange,
including collaborations with community-based organizations and targeted media campaigns.
The recruitment protocol consists of three steps: (i)
initial mailings to sample addresses describing the study;
(ii) optional telephone contacts for households with telephone numbers available; and (iii) in-person contacts.
Once contact is established, a brief household screener is
administered via a digital hand-held device to determine
eligibility and implement the age subsampling procedure
(13). On obtaining agreement to participate, a roster of

household members is created, and individual eligibility
confirmed. Persons on active duty military service, not
currently living at home, planning to move from the area
in the next 6 months, or physically unable to attend the
clinic examination are considered ineligible.
Household- and individual-level screening and eligibility
rates and clinic participation rates are monitored continuously, and adjustments to selection parameters (for age) or
fielding of sample addresses (for SES and background) are
made as needed. At the conclusion of HCHS/SOL recruitment, final household and individual level participation rates
will be computed among those eligible for the study. A goal of
60% participation was set at the onset of recruitment.
DISCUSSION
Study design decisions are typically made to accommodate
competing priorities; the National Children’s Study
provides a recent example (14, 15). If the HCHS/SOL
research objectives were limited to baseline prevalence estimates and comparisons thereof, then a probability sample
representing a broad cross-section of US Hispanics/Latinos
would be the choice. Had the sole objective been to support
valid inference of relationships among baseline risk factors
and disease incidence during follow-up, then enrolling
a cohort most likely to remain active in the study for years
to come would be optimal. A hybrid design was chosen to
simultaneously meet both objectives, with communities
defined based on proximity to clinical centers and
Hispanic/Latino diversity, and probability sampling nested
within.
Two aspects of the HCHS/SOL design represent what we
believe to be a novel approach for epidemiology studies with
similar objectives. First, formal probability sampling
methods are embedded into the study design at each site,
thereby allowing the advantages of probability sampling to
be available within a traditional, multisite study model.

AEP Vol. 20, No. 8
August 2010: 642–649

Second, methods to efficiently sample Hispanics/Latinos in
already enriched community areas are used, where efficiency
is defined in both a statistical and operational sense. Techniques are applied at each stage of sample selection such
that field operations are optimized without unnecessarily
sacrificing precision of estimates.
Several trade-offs occur as a result of incorporating probability sampling in the HCHS/SOL design. First, the lack of
recent census data on which to base key design parameters
produced some inefficiencies early in recruitment. Second,
areas with low Hispanic/Latino concentration are included
in the sample for diversity, although with lower representation due to undersampling, and their coverage can substantially increase field costs. Finally, the complexity of data
analyses is increased by the need to account for the sample
design.
In summary, the HCHS/SOL sampling strategy was
chosen to provide broad representation of the US
Hispanic/Latino population living in the communities
surrounding the four field centers with sufficient diversity
to support the research objectives. A rigorous recruitment
protocol is required to realize the benefits of probability
sampling; however, the design is flexible in that modifications can be incorporated with minimal disruption of
ongoing recruitment activities. The hybrid design used for
HCHS/SOL can serve as a model for the design of future
studies with similar objectives.
The Hispanic Community Health Study/Study of Latinos is funded by
contracts from the National Heart, Lung, and Blood Institute (NHLBI)
to the University of North Carolina (N01-HC65233), University of
Miami (N01-HC65234), Albert Einstein College of Medicine (N01HC65235), Northwestern University (N01-HC65236), and San Diego
State University (N01-HC65237). The following Institutes/Centers/
Offices contribute to the HCHS/SOL through a transfer of funds to the
NHLBI: National Center on Minority Health and Health Disparities,
the National Institute of Deafness and Other Communications Disorders,
the National Institute of Dental and Craniofacial Research, the National
Institute of Diabetes and Digestive and Kidney Diseases, the National

Lavange et al.
HCHS/SOL SAMPLE DESIGN

649

Institute of Neurological Disorders and Stroke, and the Office of Dietary
Supplements.

REFERENCES
1. Sorlie PD, Avile´s-Santa LM, Wassertheil-Smoller S, Kaplan RC, Daviglus
ML, Giachello AL, et al. Design and implementation of the Hispanic
Community Health Study/Study of Latinos. Ann Epidemiol. 2010;20:
629–641.
2. Kruskal W, Mosteler F. Representative sampling, III: the current statistical
literature. Int Stat Rev. 1979;47:245–265.
3. Sarndal CE, Swensson B, Wretman J. Model Assisted Survey Sampling.
New York (NY): Springer-Verlag; 1992.
4. Kish L. Survey Sampling. Second Printing. New York (NY): John Wiley
and Sons; 1965.
5. Kish L. Statistical Design for Research. New York (NY): John Wiley &
Sons; 1987.
6. Groves RM, Fowler FJ, Couper MP, Lepkowski JM, Singer E, Tourangeau
R. Survey Methodology. New York (NY): John Wiley and Sons; 2004.
7. Lessler JT, Kalsbeek WD. Nonsampling Errors in Surveys. New York (NY):
John Wiley and Sons; 1992.
8. Ezzati-Rice TM, Rohde F, Greenblatt J. Sample Design of the Medical
Expenditure Panel Survey. Methodology Report 22. Rockville (MD):
Agency for Healthcare Research and Quality; 2008.
9. Cochran WG. Sampling Techniques. 3rd ed. New York (NY): John Wiley
and Sons; 1977.
10. Winkleby MA, Jatulis DE, Frank E, Fortmann SP. Socioeconomic status
and health: how education, income, and occupation contribute to risk
factors for cardiovascular disease. Am J Pub Health. 1992;82:816–820.
11. Korn EL, Graubard BI. Analysis of Health Surveys. New York (NY): John
Wiley and Sons; 1999.
12. Little RA. To model or not to model? Competing modes of inference for
finite population sampling. J Am Stat Assoc. 2004;99:546–556.
13. Bryan H, Mehlman T, Gildner P. A Study Recruitment System Using
Ultra-Mobile Computers with Handwriting Recognition. Poster presentation at the Society for Clinical Trials 30th Annual Meeting, Atlanta, GA;
2009.
14. National Children’s Study. Final Report from the NCS Sampling Design
Workshop. 2004. National Children’s Study.
15. Michael RT, O’Muircheartaigh CA. Design priorities and disciplinary
perspectives: the case of the US National Children’s Study. J R Stat Soc
Ser A. 2008;171:465–480.


File Typeapplication/pdf
File TitleSample Design and Cohort Selection in the Hispanic Community Health Study/Study of Latinos
SubjectProbability Sampling, Sampling Diverse Populations, Hispanic/Latino Health
AuthorLisa M. Lavange PhD
File Modified2014-05-28
File Created2010-07-03

© 2024 OMB.report | Privacy Policy