Appendix M - Detailed Sampling and Weighting Plan

Appendix M - Detailed Sampling and Weighting Plan.doc

2009 and 2011 Youth Risk Behavior Surveys

Appendix M - Detailed Sampling and Weighting Plan

OMB: 0920-0493

⚠️ Notice: This form may be outdated. More recent filings and information on OMB 0920-0493 can be found here:

2025-05-05 - No material or nonsubstantive change to a currently approved collection
2024-08-13 - Reinstatement without change of a previously approved collection

Document [doc]

Download: doc | pdf

M. Detailed Sampling and Weighting Plan

SAMPLING AND WEIGHTING PLAN

NATIONAL YOUTH RISK BEHAVIOR SURVEY

The objective of the sampling design is to support estimation of the health risk behaviors in a nationally representative population of 9th through 12th graders, by gender and by age or grade. National estimates of the behaviors of all high school students are specifically required as well as estimates by grade, by gender, and by race/ethnicity for White, black, and Hispanic youth. The 2009 YRBS will be the eleventh fielding of this national survey.

The sampling universe for the national survey will consist of public, Catholic, and other private school students in grades 9 through 12 in the 50 states and the District of Columbia.

M.1 Estimation and Justification of Sample Size

M.1.1 Overview

The YRBS studies are designed to produce most estimates accurate to within ±5 percent at 95 percent confidence. Overall estimates and estimates by grade or gender or race/ethnicity meet this standard as do certain finer-grained analyses such as grade by gender or race/ethnicity by gender. A looser design target of ±5 percent at 90 percent confidence was established for estimates by grade and race/ethnicity due to the practical difficulties of obtaining nationally representative samples of minority students at an affordable level.

We propose to replicate, in most respects, the sampling parameters used in the 2007 YRBS because they met the levels of precision required for CDC's purposes. Minor refinements in the sampling plan are expected to occur for subsequent rounds of data collection, driven by the changing demographics of the in-school population. Current trends of increasing percentages of black and Hispanic students will influence the design of future YRBS cycles in several areas:

The weighting function that oversamples minority students will gradually be adjusted downwards to give less weight to minority students, and thus to improve the statistical efficiency of overall survey estimates.

The stratum boundaries based on the percentage of minority students will be re-computed to minimize variances according to the cumulative square root rule.

The over-allocation of primary sampling units (PSUs) to high-minority strata will be gradually reduced, although the numbers of minority students in the sample will not change appreciably.

We may add one or two PSUs to the sample rather than, or in addition to, doubling sample sizes in schools with the highest percentages of minority students.

The proposed sample consists of 57 primary sampling units (PSUs), defined as a county or a group of counties. In each PSU, at least three different schools will contribute classes at each grade in the 9 through 12 range. The actual number of sampled schools will be greater than 57 x 3 = 171 because (1) some schools contain only some of the targeted grades (e.g., 7^th – 9^th) and (2) small schools are selected in a subset of the 57 PSUs over and above those initially selected. A small school has an enrollment that is insufficient to generate the equivalent of one full class section at each targeted grade contained in the school. We will select one class per school per targeted grade, except in schools with the highest concentrations of minority students, where we will select two classes per grade. We expect that approximately 200 sample schools will be selected to generate about 12,000 participating students.

M.1.2 Expected Confidence Intervals for the YRBS

Confidence intervals vary depending upon whether the prevalence estimate is for the full population or for a subset, such as a particular grade or gender. They also vary from one variable to another. Within a grouping, they also vary depending on the level of the estimate (e.g., variances of an estimated prevalence reach a maximum near percentage estimates of 50%), and the design effect associated with the measure. As a general guideline based on the prior YRBS studies, which had similar designs and sample sizes, we can expect the following levels of precision:

Estimates by grade or by gender, or pooling grades/genders, will be within ±5 percent at 95 percent confidence.

For minority group estimates by grade (e.g., 11th grade Hispanics), about 70 percent will be within 5 percent at 90 percent confidence and about 85 percent will be within 7 percent at 90 percent confidence.

The experience in using these data is that the levels of sampling errors have been suitable for the usage of the data.

M.1.3 School and Student Nonresponse

The average school participation rate over the prior 10 studies has been 77 percent; the average student participation rate has been 86 percent. To be conservative, we will assume average values in the 2009 YRBS sample design, subject to future re-evaluation.

M.2 SAMPLING METHODS

M.2.1 Overview

The sampling universe for the national YRBS will consist of all public, Catholic and other private school students in grades 9 through 12 in the 50 states and the District of Columbia. The sample will be a stratified, three-stage cluster sample stratified by racial/ethnic status and urban versus rural. PSUs are classified as "urban" if they are in one of the 54 largest MSAs in the U.S.; otherwise, they were classified as "rural". Additional, implicit stratification will be imposed by geography by sorting the PSU frame by state and by 5-digit Zip Code (within state). Within each stratum, a primary sampling unit (PSU), defined as a county or a group of counties, will be chosen without replacement at the first stage. In subsequent sampling stages, a probabilistic selection of schools and students will be made from the sample PSUs. Exhibit M-1 presents a summary of the sampling design features.

Exhibit M-1 Key Sampling Design features

Sampling Stage

Sampling Units

Sample Size (Approximate)

Stratification

Measure of Size

Counties or groups of counties

56-58

Urban vs. non-urban (2 strata)

Minority concentration (8 strata)

Aggregate school size in target grades

Schools

200 (>=3 per PSU)

Small vs. other

Weighted enrollment (increased for black, Hispanic groups)

Classes/ students

1 or 2 classes per grade per school:

12,000 students

Three strategies will be employed to achieve over-sampling of blacks and Hispanics: (1) larger sampling rates will be used in high-Hispanic and high-black strata; (2) a modified measure of size will be employed that increases the probability of selection of schools with high minority enrollments; and (3) two classes per grade (rather than one) will be selected in high-minority schools.

M.2.2 Measure of Size

The sampling approach will utilize Probability Proportional to Size (PPS) sampling methods to achieve over-sampling of blacks and Hispanics. In PPS sampling, when the measure of size is defined as the count of final-stage sampling units, and a fixed number of units are selected in the final stage, the result is an equal probability of selection for all members of the universe. For the YRBS, we approximate these conditions, and thus obtain a roughly-self weighting sample. This section describes the type of measure of size to be employed for selecting PSUs and schools with over-sampling of blacks and Hispanics.

A function of the form r_hH + r_bB + r_oO is used where the r's are the weighting factors for the Hispanic, black, and Other high school per-grade enrollment (H, B, and O, respectively). This function will increase the chances of schools with relatively large minority enrollments entering the sample, and also increase the probability of selection for high-minority PSUs.

The effectiveness of a weighted measure of size in achieving oversampling is dependent upon the distributions of black and Hispanic students in schools. For example, if U.S. schools had identical percentages of minorities in every school, then the sample of students from any sample of schools would mirror the national percentages and use of a weighted measure of size would fail to oversample blacks and Hispanics. We know this is not the case, however, as the distribution of high school students with respect to race and ethnicity follows that of the general population, and here we find a great deal of clustering by race and ethnicity. This observation is further born out by the success of the use of a weighted measure of size in prior studies as an effective means of oversampling black and Hispanic students.

In 1990, Macro conducted a series of simulation studies that investigated the relationship of various weighting functions to the resulting numbers and percentages of minority students in the obtained samples.^¹ In the 2007 YRBS cycle, the following weighting function was used for the measure of size:

2 H + 2 B + O

We will perform a new simulation study during the 2009 YRBS design for similar purposes, i.e., fine-tuning the measure of size coefficients.

The measure of size will be used to compute stratum and PSU sizes as well. This will have the effect of increasing the allocation of the sample to high‑minority strata and increasing the chances of PSUs with high minority concentrations getting into the sample.

M.2.3 Definition of Primary Sampling Units

In defining PSUs, several issues are considered:

Each PSU should be large enough to contain the requisite numbers of schools and students by grade.

Each PSU should be compact geographically so that field staff can go from school to school easily.

There should be recent data available to characterize the PSUs.

PSUs definitions should be consistent with secondary sampling unit (school) definitions.

Generally, counties will be equivalent to PSUs, except where low population counties are combined to provide sufficient numbers of schools and students. Also, very large counties are divided into multiple PSUs so that no one county will be certain of selection. The variance estimation process is more efficient without the need to account for certainty PSUs. The method of dividing large PSUs will ensure that each sub-county PSU meets all of the criteria for a PSU.

County population figures will be aggregated from school enrollment data for the grades of interest. Enrollment data are being obtained from the most recent Common Core of Data from the National Center for Education Statistics, which are merged on a rolling basis into the current school and school district data files of Quality Education Data, Inc.

The 2009 PSU frame will be formed directly from counties using methods developed by Macro. The methods employ both student counts and geographic data to ensure that the PSUs being formed have the correct number of schools and students, and that the PSUs are compact geographically.

M.2.4 Stratification and Selection of PSUs

M.2.4.1 Definition of strata

The PSUs will be organized into 16 strata, based on urban/rural location (as defined above) and minority enrollment. The approach involves the computation of optimum stratum boundaries using the cumulative square root of “f” method developed by

Dalenius-Hodges. The boundaries or cutoffs change as the frequency distribution (“f”) for the racial groupings change from one survey cycle to the next. These rules are summarized below, and the boundaries computed for the 2007 YRBS are shown in Exhibit F-2.

If the percentage of Hispanic students in the PSU exceeded the percentage of black students, then the PSU is classified as Hispanic. Otherwise it is classified as black. (Exhibit M-2, column (a)).
If the PSU is within one of the 54 largest MSA in the U.S. it is classified as 'Urban', otherwise it is classified as 'Rural' (Exhibit M-2, column (b)).
Hispanic Urban and Hispanic Rural PSUs were classified into four density groupings (Exhibit M-2, column (c)) depending upon the percentages of Hispanics in the PSU. (Exhibit M-2, column (d)).
Black Urban and black Rural PSUs were also classified into four groupings (Exhibit M-2, column (c)) depending upon the percentages of blacks in the PSU (Exhibit M-2, column (d)),

Exhibit M-2. First-Stage Strata and Frame PSU Distribution

Predominant Minority (a)	Urban/Rural (b)	Density Group Number (c)	Boundaries (d)	Stratum Code (e)	Total Number PSU (f)
Black	Urban	1	0% - 22%	BU1	91
		2	22% - 34%	BU2	25
		3	34% - 56%	BU3	12
		4	56% - 100%	BU4	8
	Rural	1	0% - 18%	BR1	373
		2	18% - 34%	BR2	100
		3	34% - 58%	BR3	94
		4	58% - 100%	BR4	26
Hispanic	Urban	1	0% - 22%	HU1	60
		2	22% - 34%	HU2	13
		3	34% - 45%	HU3	10
		4	45% - 100%	HU4	4
	Rural	1	0% - 22%	HR1	373
		2	22% - 44%	HR2	44
		3	44% - 66%	HR3	19
		4	- 100%	HR4	13

M.2.4.2 Allocation of the PSU sample

Precision requirements dictate a first-stage sample of at least 55 sample PSUs. As in the two previous cycles, we will design and select a sample of 57 sample PSUs. In order to stay as close as possible to maximum sample efficency in terms of precision, the initial allocation will be made proportional to student enrollment. Then, so as to meet design requirements in terms of minority student yields, we will make adjustments to the initial allocation. This is similar to the basic process used in prior studies. Given shifts in student populations, we expect to see the resulting allocation close to proportional than prior allocations, continuing a trend we have observed over the past few cycles of the YRBS. Adjustments to the initial, base allocation evaluated using sample simulations. Response rates from prior cycles will be used to inform the yield computations in the simulations.

M.2.4.3 Selection of PSUs

Within each first-stage stratum, the PSUs will be sorted by five-digit zip code to attain a form of implicit geographic stratification. Implicit stratification, coupled with the probability proportional to size (PPS) sampling method described below, will ensure geographic sample representation. With PPS sampling, the selection probability for each PSU is proportional to the PSU’s measure of size. The following systematic sampling procedures, similar to those adopted in previous YRBS cycles, will be applied to the stratified frame to select a PPS sample of PSUs.

Select 57 PSUs with a systematic random sampling method within each stratum. The method applies within each stratum a sampling interval computed as the sum of the measures of size for the PSUs in the stratum divided by the number of PSUs to be selected in the stratum.

Subsample at random 15 of the 57 sample PSUs for the small school sampling.

M.2.5 Selection of Schools

Schools in selected PSUs will be classified as “large” if they have 25 or more students per grade in all eligible grades, otherwise they will be classified as small. The following procedures will be used to select large schools in each stratum:

Schools will be classified as "whole" if they have all high-school grades 9-12. Otherwise, they will be considered a "fragment" school. Fragment schools will be linked with other schools (fragment or whole) to form a cluster school that has all four grades. We will link schools before sampling using an algorithm, used in previous cycles, that links geographically proximate schools. Cluster schools are treated as a single school during sampling with selection performed at the grade level as described below.

The weighted high school per-grade average enrollment will be computed for each school, to be used as the measure of size. The estimate of enrollment will be developed by averaging the enrollment at each eligible grade in the school. When enrollment by grade is not available, we will divide total school enrollment by the number of grades taught in the school.

Three large schools, or linked school clusters, will be selected in each PSU with probability proportional to their measures of size.
Fifteen small schools will be drawn in the 2009 YRBS to represent the small percent of students attending small schools (less than six percent nationwide). As in the 2007 YRBS, the sample small schools will be selected in 15 subsample PSUs, with one school selected per PSU. All students in eligible grades will be selected per school, averaging an expected draw of 63 students per school. Within each subsampled PSU, small schools will be drawn PPS using the same weighted measure of size used in selecting large schools. This approach minimizes the linking of schools to create linked sampling units that span all grades and have a required minimum grade size for selection.

M.2.6 Grade Selection

Except for cluster schools, all eligible grades are included in the class selection in each school. In school clusters, grade samples are selected independently with one component school being selected for each grade.

M.2.7 Selection of Classes

The method of selecting students will vary from school to school, depending upon the organization of that school and whether a cluster of schools is involved. The key element of the school sampling strategy is to identify a structure that partitions the students into mutually exclusive, collectively exhaustive groupings that are of approximately equal sizes and that are accessible. Beyond that basic requirement, we will do the partitioning to result in groups in which both genders and students of all ability levels are represented. In selecting classes, we will generally give preference to selecting from mandatory courses such as English. Another option is to select from all classes that meet during a particular time of day such as all second or third period classes.

We will not use special procedures to sample for minorities at the school building level for two reasons:

Schools do not maintain student rosters that identify students by racial/ethnic affiliation.

We feel this would be viewed by many schools as an offensive practice.

We plan to select one or two classes at each grade level from each participating school. Two classes per grade are selected in those schools with very high percentages of black or Hispanic students. In the case of school clusters, we will conduct our sampling on a grade by grade basis. At each grade we will determine the identity of all schools in the cluster with students in that grade. If each school has enough students in the grade, then we will pick randomly one of the schools with probability proportional to grade enrollment and then select all of the classes from that school. If one of the schools does not have enough students, then its students will be combined with a class of another school in the cluster. If that class is picked, then students are surveyed in both schools.

A "class" will be defined by our sampling team so that it meets size and composition requirements before the sampling is done. For example, two small classes may be combined and treated as one for sampling purposes. Or, boys and girls physical education classes may be combined. This approach is an efficient method of data collection in schools that also has the advantage of using the classroom teacher to distribute consent forms and to "leverage" student participation; hence, it tends to yield higher student participation rates. The disadvantage of this approach is its tendency to make the sampling design less efficient because students within a class section tend to be more homogeneous than the student population at large within a school. The effect of this inefficiency has been accounted for in our estimates of the design effect of the study.

M.2.8 Replacement of Schools/School Systems

We will not replace refusing school districts, schools, classes, or students. We have allowed for school and student nonresponse by oversampling, i.e., by virtue of number of selections that are inflated to account for the expected levels of non-response.

M.2.9 Selection of Students

All students in a selected classroom will be surveyed.

M.3 WEIGHTING AND VARIANCE ESTIMATION

This section describes the procedures used to weight the data. From a sampling perspective, these include:

Sampling Weights
Nonresponse Adjustments and Weight Trimming
Post-stratification to National Estimates of Racial Percentages and Student Enrollment by Grade
Estimators and Variance Estimators

M.3.1 Weighting

Although the sample was designed to be self‑weighting under certain idealized conditions, it will be necessary to compute weights to produce unbiased estimates. The basic weights, or sampling weights, will be computed on a case‑by‑case basis as the reciprocal of the probability of selection of that case. Below is a simple presentation of the basic steps in weighting including a) Sampling weight computation, b) Nonresponse adjustments, and c) Post-stratification adjustments.

Sampling Weights

If k is the number of PSUs to be selected from a stratum, N_i is the size of stratum i and N_ij is the size of PSU j in stratum i (in all cases "size" refers to our proposed measure of size), then the probability of selection of PSU j is kN_ij/N_i. Assuming three large schools are to be selected in stratum i, N_ijk is the size of school k in PSU j in stratum i, then the conditional probability of selection of the school given the selection of the PSU is 3N_ijk/N_ij. The derivation is similar for small schools with an extra factor to account for PSU subsampling probability (15/57).

If C_ijk is the number of classes in school ijk then the conditional probability of selection of a class is just 1/C_ijk (or 2/C_ijk if two classes are taken). Since all students are selected, the conditional probability of selection of a student given the selection of the class is unity.

The overall probability of selection of a student in stratum is the product of the conditional probabilities of selection. The probabilities of selection will be the same for all students in a given school, regardless of their ethnicity, but will vary among schools depending upon the racial/ethnic mix of the schools and their surrounding regions.

Sampling weights assigned to each student record are the reciprocal of the overall probabilities of selection for each student.

b. Nonresponse Adjustments and Weight Trimming

Several adjustments are planned to account for student and school nonresponse patterns. Anadjustment for student nonresponse will be made using gender and grade within school. With this adjustment, the sum of the student weights over participating students within a school matches the total enrollment by grade in the school. This adjustment factor will be capped in extreme situations, such as when only one or two students respond in a school, to limit the potential effects of extreme weights (i.e., unequal weighting effects on survey variances).

The weights of students in participating schools will be adjusted to account for nonparticipation by other schools. The adjustment uses the ratio of the weighted sum of measures of size over all selected schools in the stratum (numerator of adjustment factor), and over the subset of participating schools in a stratum (denominator of adjustment factor). The adjustment factor will be computed and applied to small and large schools separately.

Extreme variation in sampling weights can inflate sampling variances, and offset the precision gained from a well-designed sampling plan. One strategy to compensate for these potential effects is to trim extreme weights and distribute the trimmed weight among the untrimmed weights. The trimming method that we will use, outlined in Potter,^²^,^³ for example, is based on procedures first developed for the National Assessment of Educational Progress (NAEP). It is

The trimming is an iterative procedure. In each iteration an optimal weight, W_o is calculated from the sum of the squared weights in the sample. Then, each weight W_i is marked and trimmed if it exceeds that optimal weight. The trimmed weight is summed within grade and spread out proportionally over the unmarked cases in the grade. This process is repeated until little or no weight is being trimmed. Weight trimming is done within stratum.

Typically, 3 to 4 percent of the total sample weight is trimmed and redistributed under the weight trimming procedure.

Post-stratification to National Estimates of Racial Percentages and Student Enrollment by Grade

National estimates of racial/ethnic percentages were obtained from the two sources. Private schools enrollments by grade and five racial/ethnic groups were obtained from the Private School Universe Survey (PSS), and public school enrollments by grade, gender, and five racial/ethnic categories were obtained from the Common Core of Data (CCD), both produced by the National Center for Education Statistics (NCES). These databases were combined to produce the enrollments for all schools, and to develop population percentages to use as controls in the post-stratification step. For post-stratification purposes, a unique race/ethnicity is assigned to respondents with missing data on race/ethnicity, those with an “Other” classification, and those reporting multiple races.

Given a national estimate of R_a and a weighted population estimate of P_a for race category a in some grade, the simple poststratification factor would be the ratio of R_a to P_a for each race and grade.

M.3.2 Estimators and Variance Estimators

If w_i is the weight of case i (the inverse of the probability of selection adjusted for nonresponse and poststratification adjustments) and x_i is a characteristic of case i (e.g., x_i=1 if student i smokes, but is zero otherwise), then the mean of characteristic x will be (Σ w_ix_i)/(Σ w_i). A population total would be computed similarly as (Σ w_ix_i). The Weighted population estimates will be computed with the Statistical Analysis System (SAS) and SUDAAN software.

These estimates will be accompanied by measures of sampling variability, or sampling error, such as variances and standard errors, that account for the complex sampling design. These measures will support the construction of confidence intervals and other statistical inference such as statistical testing (e.g., subgroup comparisons or trends over successive YRBS cycles).

Sampling variances will be estimated using the method of general linearized estimators^⁴ as implemented in the SUDAAN^⁵ or SAS survey procedures. These software packages must be used since they permit estimation of sampling variances for multistage stratified sampling designs, and account for unequal weighting, and for sample clustering and stratification.

1Errecart, M.T., Issues in Sampling African-Americans and Hispanics in School-Based Surveys. Centers for Disease Control, October 5, 1990.

2Potter F. "Survey of Procedures to Control Extreme Sampling Weights" in Proceedings of the Section on Survey Research Methods, American Statistical Association, pp 453-458. 1988.

3Potter F. "A Study of Procedures to Identify and Trim Extreme Sampling Weights," in Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp 225-230, 1990.

4Skinner CJ, Holt D, and Smith TMF, Analysis of Complex Surveys, John Wiley & Sons, New York, 1989, pp. 50.

5Shah BV, Barnwell GG, Bieler GS. SUDAAN: software for the statistical analysis of correlated data, release 7.5, 1997 [user’s manual]. Research Triangle Park, NC: Research Triangle Institute; 1997.

File Type	application/msword
File Title	APPENDIX F
Author	kflint
Last Modified By	arp5
File Modified	2008-04-09
File Created	2008-01-16