2021 NYTS Sampling Plan
This document provides a draft sampling plan for the 2021 NYTS. Sampling procedures follow closely those developed by the ICF statistical team and adopted in repeated cycles over the past decade.
The NYTS employs a stratified, three-stage cluster sample design to produce a nationally representative sample of middle school and high school students in the United States. Sampling procedures are probabilistic stages and entail selection of 1) Primary Sampling Units (PSUs) (defined as a county, or a group of small counties, or part of a very large county) within each stratum; 2) Secondary Sampling Units (SSUs), (defined as schools or linked schools) within each selected PSU; and 3) students within each selected school.1 Participating students complete the anonymous voluntary survey using a self-administered questionnaire.
The objective of the NYTS sampling design is to support estimation of tobacco-related knowledge, attitudes, and behaviors in a national population of public and private school students enrolled in grades 6 through 12 in the United States. More specifically, the study is designed to produce national estimates at a 95% confidence level by school level (middle and high school), by grade (6, 7, 8, 9, 10, 11, and 12), by sex (male and female), and by race-ethnicity (non-Hispanic White, non-Hispanic Black, and Hispanic). Additional estimates also are supported for subgroups defined by grade, by sex, and by race-ethnicity, each within school-level domains; however, precision levels vary considerably according to differences in subpopulation sizes. The NYTS employs a repeat cross-sectional design.
The universe for the study consists of all public and private school students enrolled in regular middle schools and high schools in grades 6 through 12 in the 50 U.S. states and the District of Columbia. Alternative schools, special education schools, Department of Defense–operated schools, Bureau of Indian Affairs schools, vocational schools that serve only pull-out populations, and students enrolled in regular schools who are unable to complete the questionnaire without special assistance are excluded.
The 2021 NYTS will be a continuation of the NYTS cycles that have taken place since 1999, employing the general sampling design framework used in the previous cycles. The number of participating students, in excess of 24,000 as required, is larger than the numbers in recent cycles to generate approximately equivalent effective sample sizes and precision levels overall.
The frame will be constructed from separate sources obtained from the National Center for Education Statistics (NCES) and from a commercial vendor, Market Data Retrieval, Inc. (MDR). The NCES files will be the Common Core of Data (CCD) for public schools and Private School Survey (PSS) for private schools.
The reason for moving to a frame built from multiple data sources is to increase the coverage of schools nationally. This dual-source frame build method was implemented for the 2014 NYTS survey for the first time. Including schools sourced from the two NCES files resulted in a coverage increase among all public and non-public schools of 11.3%.
A cut-off in school size was also added in the 2014 survey to ensure anonymity and the presence of all grades. Eligible schools need an enrollment of at least 40 students across the eligible grades. To improve coverage, we will investigate the impact of lowering the threshold to 35 students.
To illustrate the population sizes, Exhibit 1 presents the number of schools and students in the 2021 NYTS frame by school level.
Exhibit 1. Number of Schools and Students by School Level in the School Frame (NYTS 2021 Frame)
School Level |
Schools |
Students |
High Schools |
28,342 |
16,560,926 |
Middle Schools |
42,832 |
12,320,865 |
The three-stage cluster sample will be stratified by racial/ethnic composition and urban versus rural status at the first (primary) stage. PSUs will be classified as “urban” if they are in one of the 54 largest metropolitan statistical areas (MSAs) in the U.S. using 2018 American Community Survey (ACS) data from the U.S. Census Bureau. Otherwise, they will be classified as “nonurban.” Additionally, implicit stratification will be imposed by geography by sorting the PSU frame by state and by five-digit ZIP Code (within state). The implicit stratification will be extended to ensure regional stratification using the four US Census regions (i.e., we will stratify or sort by region first, then states and ZIP codes). Within each stratum, a PSU will be randomly sampled without replacement at the first stage.
In subsequent sampling stages, a probabilistic selection of schools and students will be made from the sample PSUs. The NYTS is designed to balance the yields across grades; therefore, the PSU subsampling is simplified to vary across school sizes but not between school-level categories.
The sampling stages may be summarized as follows, with additional details provided below:
Selection of PSUs: One hundred PSUs will be selected from 16 strata, with probability proportional to the total number of eligible students enrolled in all eligible schools located within a PSU.
Selection of Schools: At the second sampling stage, a total of 240 large schools, or SSUs, will be selected from the sample PSUs. Two large schools will be selected per sample PSU, one per level (middle or high). An additional large school for each level will be selected in a subsample of 20 PSUs. An additional 50 medium SSUs and 30 small SSUs will be selected from subsample PSUs, for a total of 320 sample SSUs (320 = 240 + 50 + 30). The PSU subsample will be drawn as a simple random sample, and the schools drawn with probability proportional to the total number of eligible students enrolled in a school.
Selection of Students: Students will be selected via whole classes, whereby all students enrolled in any one selected class will be by default chosen for participation. Classes will be selected from course schedules provided by each school that agreed to participate. Schedules will be constructed such that all eligible students are represented one time only.
Schools will be stratified into large, medium, and small schools based on their ability to support two, one, or less than one class selection per grade. Double class sampling will take place in a subset of half of the large schools.
The sampling approach uses PPS sampling methods. In PPS sampling, when the measure of size (MOS) is defined as the count of final-stage sampling units, and a fixed number of units is selected in the final stage, the result is an equal probability of selection for all members of the universe (“epsem” design). For the NYTS, we approximate these conditions and thus obtain a roughly self-weighting sample. Self-weighting samples, and “epsem” designs,’ are more efficient statistically in the sense of minimizing variances.
The MOS also is used to compute stratum sizes and PSU sizes. Assigning an aggregate measure of size to PSU, the sample allocates the PSU sample in proportion to the student population. Exhibit 2 presents a high-level summary of the key sampling design features that will be described in detail in the next sections.
Exhibit 2. Key Sampling Design Features
|
Sampling Units |
Stratification |
Measure of Size |
Designed Sample Size |
Projected Sample Size |
1 |
Counties, portions of a county, or groups of counties |
Urban versus Nonurban (two strata); Minority concentration (eight strata) |
Aggregate school size in target grades |
100 Counties, portions of a county, or groups of counties |
100 Counties, portions of a county, or groups of counties |
2 |
Schools |
Small, medium and large; High school versus middle school |
Eligible enrollment |
320 SSU (school) selections*: 240 large schools, 50 medium schools, and 30 small schools |
320 SSUs |
3 |
Classes/ students |
|
|
1 or 2 classes per grade (two per grade in large, high- minority schools) 37,527, students2 sampled |
24,000 student participants |
*Denotes virtual schools containing all grades of interest at a given school level. Note that the actual number of physical schools will be closer to 345-375 after the disaggregation of SSUs into physical buildings.
To facilitate accurate prevalence estimates among racial/ethnic minority groups, prior cycles of the NYTS have employed multiple strategies to increase the number of non-Hispanic Black and Hispanic students included in the sample. The sampling design always seeks to balance increasing yields for minority students with overall precision, as oversampling leads to larger variances for overall estimates. These approaches have included over-sampling PSUs in strata with a high proportion of racial/ethnic minority students, the use of a weighted MOS, and double class selection in large schools that contained a sufficient proportion of minority students.
The only oversampling that remains in the more efficient design of the last couple of cycles of the NYTS is double class sampling. Double class sampling is focused on a subset of large schools.
The new design has been shown to reduce design effects for survey estimates, which is defined as the variance of actual survey estimates divided by the variance of a simple random sample of the same size. It is a common useful measure of the precision of survey estimates. The ICF team has developed a simulation program that calibrates the coefficients of the MOS to ensure the required yields while balancing the precision for minority group estimates and for overall estimates. While the allocation to strata will continue to be proportional, we will continue to implement double class selection in a subset of large schools.
ICF has historically conducted simulation studies to investigate the impact of various weighting functions on the numbers and percentages of racial/ethnic minority students. These simulations have been updated with each cycle of the NYTS to ensure that the minimum amount of oversampling while still achieving adequate representation of non-Hispanic Black and Hispanic students.
This section describes the following steps that are necessary for the selection of the first and secondary sampling units of PSUs and schools: organizing PSUs; linking schools into SSUs; and implementing the stratification and allocation methods at each of these stages.
Defining a PSU
In general, PSUs are geographic areas defined as counties or groupings of counties. In defining a PSU, several issues are considered:
Each PSU should be large enough to contain the requisite numbers of schools and students by grade, yet not so large as to be selected with near certainty.
Each PSU should be compact geographically so that field staff can go from school to school easily.
Recent data should be available to characterize each PSU.
Each PSU should contain at least four middle and five high schools.
Generally, counties are equivalent to PSUs with two exceptions:
Low population counties are combined to provide sufficient numbers of schools and of students; and
Counties that are very large may be split to avoid becoming certainty or near-certainty PSUs.
Certainty PSUs are those whose size is large enough to ensure selection with probability of one (1.0) with a PPS sampling design that selects larger PSUs with larger probabilities. As certainty PSUs lead to inefficiencies in the design, they are split so that the new smaller units are selected with a probability smaller than one. Near-certainty units also are split to build in a safety buffer in the PSU sizes. County population figures are aggregated from school enrollment data for the grades of interest.
Stratification of PSUs
The PSUs will be organized into 16 strata, on the basis of urban/rural location (as defined above) and racial/ethnic minority enrollment of non-Hispanic Blacks and Hispanics. In the traditional stratification used by the NYTS, the classification of PSUs into the two racial/ethnic minority strata, non-Hispanic Black and Hispanic, is based on the predominant minority in the PSU. This classification is coupled with the density distribution of non-Hispanic Blacks and Hispanics to subdivide each of the four primary strata into four substrata, indexed by one through four according to this density. The approach for computing stratum boundaries follows the cumulative square root of “f” method developed by Dalenius and Hodges.3 The boundaries or cutoffs change as the frequency distribution (“f”) for the racial groupings changes from one survey cycle to the next. These rules are summarized below.
If the PSU is within one of the 54 largest MSAs in the U.S., it is classified as “urban”; otherwise it is classified as nonurban.
If the percentage of Hispanic students in the PSU exceeded the percentage of non-Hispanic Black students, then the PSU is classified as Hispanic. Otherwise it is classified as non- Hispanic Black.
Hispanic urban and Hispanic nonurban PSUs were classified into four density groupings, depending upon the percentages of Hispanics in the PSU.
Non-Hispanic Black urban and non-Hispanic Black nonurban PSUs also were classified into four groupings, depending upon the percentages of non-Hispanic Blacks in the PSU.
We will develop the cutoffs used in defining the substrata by concentrations of Black and Hispanic students in each of the four primary strata using the most recent frame data.
Allocation of the PSU Sample
We will select a sample of 100 PSUs allocated in proportion to student enrollment to maximize overall precision. The initial proportional allocation will be adjusted to ensure that racial/ethnic minority targets will be met. The adjustments will ensure that each stratum has at least two sampled PSUs and add balance to the distribution across strata. Exhibit 3 displays the allocation planned for the 2021 sample.
Exhibit 3: PSU Allocation
Stratum |
# of Schools |
# of Students |
# of PSUs Allocated |
BR1 |
3121 |
1,511,813 |
9 |
BR2 |
1281 |
768,040 |
5 |
BR3 |
1111 |
545,667 |
4 |
BR4 |
608 |
273,748 |
2 |
BU1 |
1815 |
1,270,183 |
8 |
BU2 |
1374 |
890,576 |
5 |
BU3 |
415 |
270,683 |
2 |
BU4 |
514 |
290,375 |
2 |
HR1 |
7128 |
2,971,743 |
16 |
HR2 |
1641 |
821,571 |
5 |
HR3 |
1081 |
577,091 |
4 |
HR4 |
575 |
378,629 |
3 |
HU1 |
2577 |
1,946,984 |
11 |
HU2 |
1920 |
1,475,655 |
9 |
HU3 |
1785 |
1,413,611 |
8 |
HU4 |
1386 |
1,114,118 |
7 |
Linking into Secondary Sampling Units
Schools will be classified as “whole” for high schools if they have all high-school grades (9th through12th), and whole for middle schools if they had all grades six through eight. Otherwise, they will be considered a “fragment” school. Fragment schools will be linked with other schools (fragment or whole) to form a linked school that has all grades present for a given level. We will link schools before sampling using an algorithm that links geographically proximate schools. Linked schools are treated as SSUs with selection performed at the grade level, as described below.
Stratification
SSUs will be stratified by school level (middle and high) and by size. Middle schools are those that contain any of grades 6-8 and high schools are those that contain any of grades 9-12. Schools that contain a mix of high and middle school grades will be split into two sampling units, or one for each level.
SSUs also will be stratified by school size into small, medium, and large strata on the basis of their ability to support less than one, one, or two class selections per grade. Operationally, large SSUs contain at least 56 students at each grade level, medium SSUs contain between 28 and 55 students per grade, and small SSUs contain less than 28 students at any grade level.
This section provides the derivation of the NYTS sample sizes driven by target precision requirements overall and in key subgroups. The required student yields, or numbers of participating students, are translated into the necessary numbers of sample schools, and sample PSUs, using historical participation rates.
The NYTS is designed to produce accurate estimation within a margin of error (MOE) of 5% at a 95% precision level for the following key subgroup estimates:
Middle and high school (school level): middle school students in total (grades 6–8 combined) and high school students in total (grades 9–12 combined);
Grade: individual grades 6, 7, 8, 9, 10, 11, and 12;
Sex: males and females in total, by school level (male middle school students, female high school students), and by individual grade (sixth-grade males, sixth-grade females);
Race-Ethnicity: in total and by school level (e.g., Hispanic middle school students).
The sample sizes are developed to support analysis by individual grade and by sex without any special considerations in the sampling plan. Design effects will be relatively small for subgroups that cut across schools; therefore, estimates by sex have better precision than other subgroups, with confidence intervals within ± 3%. Because the design is expected to yield a greater number of completed surveys from high school students than from middle school students, overall estimates are anticipated to be more precise at the high school level than those at the middle school level.
The 2021 NYTS sampling design will aim at balancing student yields by grade, with target sample sizes of approximately 3,428 participating students per grade, so they also ensure the precision of estimates by individual grade (e.g., sex by grade subgroup estimates on the basis of about 1,700 students).
Across the 14 previous cycles of the NYTS, school participation has averaged 83.3%, and student participation has averaged 89.9%. Overall response rates have averaged 75.5%. Historical participation rates at both school and student levels, which guide the sampling design and sample sizes, are summarized in Exhibit 4.
Exhibit 4. Historical Summary of NYTS Participation Rates
YEAR |
School Participation |
Student Participation |
Overall |
1999 |
90.30% |
93.20% |
84.20% |
2000 |
90.00% |
93.40% |
84.10% |
2002 |
83.10% |
90.08% |
74.85% |
2004 |
92.70% |
87.90% |
81.50% |
2006 |
91.60% |
87.60% |
80.20% |
2009 |
92.30% |
91.90% |
84.80% |
2011 |
83.20% |
88.00% |
73.20% |
2012 |
80.30% |
91.70% |
73.60% |
2013 |
75.40% |
90.70% |
68.40% |
2014 |
80.20% |
91.40% |
73.30% |
2015 |
72.50% |
87.40% |
63.40% |
2016 |
81.00% |
88.00% |
71.00% |
2017 |
76.76% |
88.73% |
68.10% |
2018 |
76.77% |
88.82% |
68.19% |
2019 |
77.23% |
85.85% |
66.30% |
Average over all previous cycles |
83.30% |
89.92% |
75.43% |
In calculating the sample sizes for the 2021 NYTS, we made our approach more robust by assuming a conservative combined rate (student x school) of 63.75%, substantially lower than the historical overall response rate. These numbers are closer to the more recent experience at both levels. The main reason, discussed below, is that the student participation rate needs to be adjusted to account for a growing number of ineligible students. This number needs to be subtracted from the net number of students available for selection in the participating schools.
Schools will be classified by size on the basis of grade-level enrollments. This ensures that a sampled school of a given size classification is able to support the student sample sizes summarized in Exhibit 5 below.
For this sampling plan, the NYTS sample size calculations were based on the following assumptions:
The main structure of the sampling design is consistent with the design used to draw the sample for prior cycles of the NYTS.
The selection of a minimum of one SSU at the high school level and one SSU at the middle school level within each PSU. In addition, we will select 20 additional large schools per level, one middle and one high-school in a subsample of 20 PSUs, as described below.
SSUs with at least 56 students per grade are considered large, and those among the others with 28 students per grade are considered medium; otherwise, they are considered small.
On average, each selected class includes 25 students (on the basis of historical averages) pre-attrition.
For half (50%) of the SSUs classified as large, we will sample double the amount of students by sampling eight classes instead of four in high schools, and six classes instead of three in middle schools.
A 63.75% overall response rate (based on historical averages) calculated as the product of the school response rate (75%) and student response rate (85%).
Note that the double sampling will occur for half of the large schools. Note also that the assumed student response rate is lower than in previous cycles to reflect a growing number of ineligible students that also need to be subtracted off from the net numbers. On the other hand, the school response rate will remain high especially as we have retained a relatively large number of large sample schools which tend to participate at higher rates than smaller (often non-public) schools.
Based on these assumptions, 100 PSUs will be selected at the first stage. Subsamples of these sample PSUs will be selected to supply additional SSUs as described next: a) small (30 SSUs from 15 subsample PSUs), b) medium (50 SSUs from 25 subsample PSUs), and c) large (40 SSUs in 20 subsample PSUs). The PSU subsamples will be drawn as simple random samples, and the schools drawn with probability proportional to the total number of eligible students enrolled in a school. The sampling and subsampling of PSUs are described in more detail next.
Within each of the 100 sample PSUs, two large schools will be drawn, one at the middle school level to supply students in grades 6 through 8, and one at the high school level to supply students in grades 9 through 12. An additional large school for each level will be selected in a subsample of 20 PSUs.
An additional 50 medium SSUs and 30 small SSUs will be selected from subsample PSUs, for a total of 320 sample SSUs (320 = 240 + 50 + 30).
In addition, 25 PSUs will be independently subsampled to supply medium SSUs (two selected per level in each subsample PSU), and 15 PSUs subsampled to supply small SSUs (two selected per level in each subsample PSU).
Exhibit 5 provides a detailed calculation of designed sample sizes across school level and school size categories. A larger school sample is selected from a larger number of PSUs to limit clustering effects. These sample sizes are larger than recent cycles of the NYTS to accommodate the larger student sample size desired by CDC. These sample sizes are designed to support estimates for smaller subgroups formed by cross classifications of race/ethnicity by grade or sex.
Exhibit 5. Summary of Expected Sample Sizes for the 2021 NYTS
PSU |
Size |
# of SSUs |
Number of Schools Sampled |
Number of Classes per School |
Number of Students per Class |
Number of Sampled Students Prior to Attrition |
Combined School and Student 63.75% Response Rate |
100 |
Large High School |
120 |
Double Classes: 60 |
8 |
25 |
12000 |
7650 |
Single classes: 60 |
4 |
25 |
6000 |
3825 |
|||
Large Middle School |
120 |
Double classes: 60 |
6 |
25 |
9000 |
5738 |
|
Single classes: 60 |
3 |
25 |
4500 |
2869 |
|||
Large Total |
240 |
|
|
|
31500 |
20081 |
|
25 (subsample) |
Medium High School |
25 |
|
4 |
25 |
2500 |
1594 |
Medium Middle School |
25 |
|
3 |
25 |
1875 |
1195 |
|
Medium Total |
50 |
|
|
|
4375 |
2789 |
|
15 (subsample) |
Small High School |
15 |
|
4 |
25 |
1500 |
956 |
Small Middle School |
15 |
|
3 |
25 |
1125 |
717 |
|
Small Total |
30 |
|
|
|
2625 |
1673 |
|
|
Overall Total |
320 |
|
|
|
38500 |
24544 |
* In this exhibit, the schools are SSUs or “virtual schools” created by combining actual, physical schools so that each virtual school unit has a complete set of grades for the level. The virtual schools are expanded to physical schools. The number of physical schools in the sample is expected to range from 345 to 375.
1 Sampling is conducted without replacement at all stages in the sense that sampling units are not thrown back into the pool after being selected. (Note that the sampling design concept of “replacement” is not related to substitution of ineligible schools, a step that is undertaken prior to the data collection process.)
2 The total sample size of 37,527 was derived in our simulations; it is also approximately the ratio of the target number of participants, 24,000, divided by the anticipated response rate overall, 63.75%
3 Dalenius, T., & Hodges, J. L. (1959). Minimum variance stratification. Journal of American Statistical Association, 54, 88−101.
Page
File Type | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
Author | Iachan, Ronaldo |
File Modified | 0000-00-00 |
File Created | 2022-04-07 |