Imputation Procedures for the FY 2018
Higher Education Research and Development Survey
July 2019
prepared for
National Science Foundation
National Center for Science and Engineering Statistics
by
ICF • 530 GAITHER ROAD • ROCKVILLE, MD 20850
i
Contents
Introduction ................................................................................................................................1
Overview ....................................................................................................................................1
General Procedures .....................................................................................................................2
Determining Imputation Factors ..............................................................................................3
Imputing Key Variables ..........................................................................................................4
Imputing Non-Key Variables ..................................................................................................5
Procedures by Survey Question ...................................................................................................6
Expenditures by Source of Funds (Question 1) ........................................................................6
Federal Expenditures by Field of R&D and Agency (HERD Question 9 and Short Form
Question 2, Column 1) ............................................................................................................8
Nonfederal Expenditures by Field of R&D and Source of Funds (HERD Question 11 and
Short Form Question 2, Column 2) ........................................................................................ 11
Equipment Expenditures by Field of R&D (Question 14) ...................................................... 13
Funds Received as a Subrecipient (HERD Question 7 and Short Form Question 3) ............... 16
Expenditures Passed Through to Other Institutions (HERD Question 8 and Short Form
Question 4) ........................................................................................................................... 18
Foreign Funding for R&D (Question 2) ................................................................................. 21
R&D Contracts and Grants (Question 3) ............................................................................... 24
R&D Expenditures at Medical School (Question 4) ............................................................... 24
Clinical Trial Expenditures (Question 5) ............................................................................... 25
Type of R&D (Basic, Applied, or Experimental Development) (Question 6) ......................... 26
Cost Elements of R&D (Question 12).................................................................................... 27
Headcount for R&D Personnel (Question 15) ........................................................................ 28
Retro-imputation ....................................................................................................................... 29
Imputing Back to a Reported Year ........................................................................................ 30
Retro-imputing When There Is No Previously Reported Year ............................................... 30
1
Introduction
This document details the procedures used for the imputation of missing values for the National
Science Foundation (NSF) FY 2018 Higher Education Research and Development (HERD)
Survey.
Overview
In 2010, the HERD Survey replaced NSF’s survey of the R&D effort in the academic sector, the
Survey of Research and Development Expenditures at Universities and Colleges (Academic R&D
Expenditures Survey), which had been conducted annually since 1972. The FY 2018 survey was
the ninth collection cycle completed with the redesigned survey. Questions included in the HERD
Survey can be broadly divided into two groups: those similar in content to items in the Academic
R&D Expenditures Survey and those that are new to the survey and were not asked of institutions
prior to 2010.
Many of the data requested as part of the HERD Survey were identical to those requested by the
FY 2009 Academic R&D Expenditures Survey but were included in questions that were expanded
or restructured. The biggest change to most questions was the inclusion of non-science and
engineering (S&E) fields in all R&D categories; most items in the Academic R&D Expenditures
Survey asked for expenditures in only S&E fields. For example, Question 1 of the HERD Survey
was very similar to Item 1 of the Academic R&D Expenditures Survey. Both asked for R&D
expenditures by source of funds, but the FY 2010 survey asked for expenditures from S&E and
non-S&E fields. The Academic R&D Expenditures Survey included one item that asked about
non-S&E expenditures by field and source of funding (federal vs. nonfederal).
During the FY 2012 cycle, NSF introduced the HERD Short Form survey. This survey is sent to
institutions in the HERD Survey population that reported less than $1 million in R&D expenditures
in the previous fiscal year. The goal of the new instrument was to reduce the burden on smaller
R&D-performing institutions that frequently had little or no expenditures in some categories. All
variables in the Short Form HERD questionnaire are included in the standard HERD questionnaire.
When applicable, data from both surveys were used to inform the imputation of a particular
variable.
The FY 2018 survey was the seventh collection cycle completed with the inclusion of the HERD
Short Form survey. Each year, there are institutions that move from the Short Form population in
the previous year to the HERD standard form population in the current year. Procedures were
added to address the imputation of missing HERD Survey data for an institution that completed
the Short Form in the previous year. Variables that were included in both surveys were imputed
using the existing methodologies. Variables that were not included in the HERD Short Form
survey were imputed in one of two ways: using the most recent standard form survey data, whether
FY 2011 or FY 2016, or using peer institution data. Throughout this document, we highlight how
imputation procedures were altered to address missing FY 2018 Short Form data and missing FY
2018 standard form data when only the previous year’s Short Form data were available. When not
2
specified, it should be assumed that we are referring to the imputation of data for the standard
HERD Survey.
Prior to the start of the imputation process, the submitted data underwent a recoding process
designed to address issues of logical imputation. Within the HERD Survey, the amount that can
be reported for one question often is logically restricted by values reported for another question.
For example, if Question 1, row a (federal R&D expenditures) was reported as zero and other
questions asking for amounts that are a subset of federal R&D were left blank, the missing values
were recoded as 0. If there were no federal expenditures reported in Question 1, federal
expenditures were not imputed for any other part of the survey. During this recoding process, some
values were changed to accurately reflect partial data provided by the institution. For example,
respondents were asked to report total expenditures from federal sources in Questions 1, 6, and 9.
For Question 1, they reported the total amount. For Question 6, they reported the amounts of
federal expenditures for basic research, applied research, and experimental development, and the
sum of these three values had to equal the value reported in Question 1, row a. For Question 9,
respondents reported federally funded expenditures for R&D by agency and field of R&D. Again,
the grand total for this question had to equal Question 1, row a. If a respondent could not report
complete data for Question 6 or 9 (e.g., reported basic research expenditures but could not report
applied research or experimental development), the total for the question, which was calculated
automatically on the survey website, did not equal Question 1, row a. As part of the recoding effort,
the value for total federal R&D expenditures for the questions with partial data was replaced with
the correct value from Question 1. An additional logical imputation technique was implemented
before the imputation of expenditures by field on Questions 9, 11, and 14. This is described in the
procedures for Question 9.
Unless noted otherwise, the order of imputation described in this document depicts the order of
imputation programming. At the end of the imputation process, all imputed data cells are flagged
with an “i” in the database and in published tables.
General Procedures
Imputation techniques for variables can be broadly divided into two steps:
1.
Using inflator/deflator factors to impute key variables based on the previous year’s data for
each nonresponding institution. Key variables are values identified as having high correlations
across years and high correlations with other, smaller values within the current-year survey
responses.
2.
Using the relative percentages that were last reported by that institution or by peer institutions
in the current year as a reference for the distribution of the key variables across detail fields.
Imputed amounts were based on a mean value or mean proportion of a value within a group of
institutions with similar characteristics, referred to as an imputation class.
In some circumstances, there was an intervening step. For questions for which the previous year’s
data could not be used as a basis for imputation, logistic regression was used to identify values that
should be zero. There was a high prevalence of zero values for many variables in some questions
(i.e., Questions 2, 4, 5, 6, 12, 15, 16). For these types of variables, it was efficient to first determine
3
whether the variable should take on a zero value before attempting to impute a nonzero value.
Logistic models (SAS PROC LOGISTIC) were run for several variables. If the predicted value (
𝑝𝑝̂
)
was less than 0.5, the variable in question was imputed with 0. Specific information about
predictors and class variables is included in the descriptions below.
Because much current-year imputation is based on an institution’s R&D expenditures from the
previous year, alternative procedures were adopted for institutions that did not have FY 2017 data.
Three short form institutions did not submit data for the FY 2018 survey and had no FY 2017
HERD Survey data. For these institutions, total R&D expenditures were set at the baseline for the
short form survey ($150,000).
Questions 1.1, 10, and 13 were not imputed.
Determining Imputation Factors
The imputation process involves first determining imputation factors for certain key variables.
Imputation factors are the ratio of current-year data to previous-year data for institutions that
responded in both years (i.e., matched, clean data). These factors, when applied to institutions in a
predefined group, reflect the average annual growth or decline in expenditures for reporting
institutions in that group.
Imputation factors were derived for different groups of institutions based on the highest degree
offered (HDO) and type of control (TOC). Factors were calculated separately for each key variable
for each combination of HDO (PhD or no PhD) and TOC (public or private). These combinations
are referred to as imputation classes.
All institutions in both the short form and standard form populations, including those that reported
less than $150,000 in total R&D, could contribute to the imputation factors. Table 1 shows the
number of institutions from the FY 2018 survey in each imputation class, including those that did
not have matched, clean data for total R&D expenditures and were not used to derive imputation
factors.
Table 1. Number of Institutions in the Population by Highest Degree Offered and Type of
Control
HDO
TOC
Public
Private
PhD
354
204
No PhD
185
206
The imputation classes were further divided based on quartiles of total R&D expenditures within
each class for some questions. This is noted in the description of each question.
4
Imputing Key Variables
Key variables are values identified as having high correlations across years and high correlations
with other, smaller values within the current-year survey responses. Specific key variables are
discussed, as applicable, for each survey question. All key variables were imputed for unit
nonresponders; only missing key variables were imputed for partially nonrespondent institutions.
The imputation technique used to calculate key variables is called ratio imputation and takes the
following mathematical form:
Equation 1a:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
= 𝑩𝑩
�
𝒊𝒊
𝒕𝒕
𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕−𝟏𝟏
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t,
and
𝐵𝐵
�
𝑖𝑖
𝑡𝑡
is the inflator/deflator factor for key variable
𝑦𝑦
𝑖𝑖
,
defined as
Equation 1b:
𝑩𝑩
�
𝒊𝒊
𝒕𝒕
=
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕−𝟏𝟏
𝒓𝒓
𝒋𝒋=𝟏𝟏
where
𝑦𝑦
𝑗𝑗𝑖𝑖
𝑡𝑡−1
is the value of key variable
𝑦𝑦
𝑖𝑖
for institution
j
for year
t-
1, and
r
is the set of institutions in the same degree level and institutional control peer
group as institution
i
that provided key variable
𝑦𝑦
𝑖𝑖
both in years
t
and
t-
1.
If a key variable was imputed in the previous year, the factor was applied to the imputed value to
derive the current year’s value.
In some cases, the specific key variable from the past year was missing and not imputed. In these
situations, a ratio of the missing key variable to a non-missing key variable for peer institutions
that provided both values was used:
Equation 2a:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
= 𝑹𝑹
�
𝒊𝒊𝒊𝒊
𝒕𝒕
𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t,
and
𝑅𝑅
�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the ratio of key variables
𝑦𝑦
𝑖𝑖
to
𝑦𝑦
𝑖𝑖
,
defined as
Equation 2b:
𝑹𝑹
�
𝒊𝒊𝒊𝒊
𝒕𝒕
=
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
where
𝑦𝑦
𝑗𝑗𝑖𝑖
𝑡𝑡
is the value of key variable
𝑦𝑦
𝑖𝑖
for institution
j
for year
t,
and
where
r
is the set of institutions in the same imputation class as institution
i
that
provided key variable
𝑦𝑦
𝑖𝑖
and
𝑦𝑦
𝑖𝑖
both in years
t.
In the example where there is no previous year’s value for R&D equipment expenditures, the
imputed value would be the product of total R&D expenditures (imputed or reported) and the ratio
of R&D equipment expenditures to total R&D expenditures for the imputation class.
5
Imputing Non-Key Variables
The ratio imputation technique described above was used to impute key variables. However, many
HERD Survey variables are hierarchical, and each key variable has a number of lower-level,
non-key detail variables associated with it. For example, the key variable Federally Funded R&D
Expenditures has 326 lower-level, non-key variables associated with it in the standard form survey,
such as federally funded R&D expenditures in astronomy, R&D expenditures funded by the
U.S. Department of Health and Human Services (HHS), and R&D expenditures in chemistry
funded by NSF. For nonresponding institutions, key variables (imputed or reported) were
distributed across the associated non-key variables using the same relative percentages that were
last reported by that institution. If some non-key fields were reported, the difference between the
key variable and the reported non-key fields was distributed to the missing detailed fields using
the same relative percentages last reported by that institution.
Non-key variables were derived from their associated key variables or higher-level, non-key
variable using the following relation:
Equation 3:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
= 𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
�
𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕−𝟏𝟏
𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕−𝟏𝟏
�
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of non-key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t,
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t,
𝑦𝑦
𝑖𝑖𝑖𝑖
𝑡𝑡−1
is the value of non-key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t-
1, and
𝑦𝑦
𝑖𝑖𝑖𝑖
𝑡𝑡−1
is the value of key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t-
1.
This was the same non-key variable imputation approach used for both unit nonresponders and
those institutions that did not respond to individual non-key items. For example, if an institution
reported federal R&D expenditures but did not provide the breakdown of those expenditures by
field of study, the non-key values were imputed the same way; however, rather than using the
imputed value of the key variable (
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
), the reported value of
𝑦𝑦
𝑖𝑖
was used.
If lower-level, non-key data were not available for a particular institution for the previous cycle,
the key variables were distributed across detail fields based on the relative percentages for the
institution’s class. Non-key variables were derived from their associated key variables using the
following relation:
Equation 4a:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
= 𝑹𝑹
�
𝒊𝒊𝒊𝒊
𝒕𝒕
𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of non-key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t,
and
𝑅𝑅
�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the ratio of non-key variables
𝑦𝑦
𝑖𝑖
to
𝑦𝑦
𝑖𝑖
defined as
Equation 4b:
𝑹𝑹
�
𝒊𝒊𝒊𝒊
𝒕𝒕
=
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
6
where
𝑦𝑦
𝑗𝑗𝑖𝑖
𝑡𝑡
is the value of key variable
𝑦𝑦
𝑖𝑖
for institution
j
for year
t,
and
where
r
is the set of institutions in the same imputation class as institution
i
that
provided variables
𝑦𝑦
𝑖𝑖
and
𝑦𝑦
𝑖𝑖
both in years
t.
Procedures by Survey Question
Expenditures by Source of Funds (Question 1)
The imputation of missing values in Question 1 was completed only for unit nonresponders, which
were defined as institutions in the population that did not report any data for FY 2018. The
imputation of values for individual missing fields would necessarily impact the total R&D reported
by the institution for Question 1, and it was decided that the total R&D reported by an institution
would not be altered through imputation.
Question 1 Key Variables
There were two key variables imputed for Question 1: Federal R&D Expenditures and Total R&D
Expenditures
.
Imputation factors for both key variables for each imputation class are listed in
tables 2 and 3.
Table 2. Imputation Factors for Federal Expenditures by Class
HDO/TOC
n
Federal R&D
PhD
Public
341
1.0446
Private
182
1.0394
No PhD
Public
144
0.9314
Private
178
0.9613
n = number of institutions used to create the factor
Table 3. Imputation Factors for Total Expenditures by Class
HDO/TOC
n
Total R&D
PhD
Public
341
1.0580
Private
182
1.0492
No PhD
Public
144
0.9484
Private
178
0.9596
n = number of institutions used to create the factor
If an institution was missing a key variable from the previous year and that value was not imputed,
the current-year value was based on the proportion for peer institutions of that key variable to a
known value:
7
Equation 5:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕
= 𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
∑
𝒚𝒚
𝒋𝒋𝒊𝒊
𝒕𝒕
𝒓𝒓
𝒋𝒋=𝟏𝟏
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡
is the imputed value of federal or total R&D for institution
i
for year
t,
𝑦𝑦
𝑖𝑖𝑖𝑖
𝑡𝑡
is
the
value
of
federal
or
total
R&D
for
institution
i
for
year
t,
r
is the set of institutions in the same imputation class as institution
i
that provided
variables
𝑦𝑦
𝑖𝑖
and
𝑦𝑦
𝑖𝑖
both in years
t.
Question 1 Non-Key Variables
There are three hierarchical steps for the imputation of non-key variables in Question 1:
1.
Nonfederal R&D:
Total R&D minus Federal R&D
2.
Nonfederal Sources:
Nonfederal R&D expenditures were distributed across the associated
nonfederal source variables (i.e., state and local government, business, nonprofit, institutional,
and other) using the same relative percentages that were last reported by that institution.
3.
Institutional
Sources:
The
imputed
value
of
institutionally
funded
expenditures
was
distributed across the three types of institution funds (institutionally financed organized
research, cost sharing, and unrecovered indirect costs) using the same relative percentages that
were last reported by that institution.
For each step in the imputation process, if the imputed details did not add to the total, the details
were adjusted by adding 1 progressively until they totaled correctly. On the rare occasion that the
sum of the details was more than the reported total, the analyst reduced the amount reported for
the details by 1 until the values were equal. This same process was implemented for each stage of
imputation of non-key variables for every question.
If a value in Question 1 from FY 2017 was missing and not imputed, which would happen only if
the institution partially responded to Question 1 in 2017, it was considered unavailable in FY 2018.
The other option was to impute as zero, but we consider that a misrepresentation of the previous
year’s data, which form the basis of current-year imputation.
Tables 4 and 5 provide summary data on imputed amounts and rates for imputation class and each
Question 1 variable.
8
Table 4. Imputed and Aggregate Amounts for Total and Federal R&D by Class
(amounts are dollars in thousands)
HDO/TOC
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
PhD
Public
4
41,516
25,703,791
0.16%
4
53,572
51,743,972
0.10%
Private
6
26,659
15,903,159
0.17%
6
39,205
26,922,442
0.15%
No PhD
Public
7
16,784
263,624
6.37%
7
33,502
457,922
7.32%
Private
11
2,599
148,841
1.75%
11
7,074
314,658
2.25%
n = number of institutions with imputed values
Table 5. Imputed and Aggregate Amounts for Sources of Funds
(amounts are dollars in thousands)
Funding Source
n
Imputed
Total
%
Imputed
Federal
28
87,558
42,019,415
0.21%
State/Local
28
8,302
4,321,480
0.19%
Business
28
4,375
4,723,897
0.09%
Nonprofit
28
6,730
5,452,898
0.12%
All Inst Funds
27
22,315
20,438,289
0.11%
Inst Financed Research
28
12,163
13,310,779
0.09%
Cost Sharing
28
1,350
1,589,047
0.08%
Unrecovered
26
8,802
5,538,463
0.16%
Other
28
4,073
2,483,015
0.16%
Total
28
133,353
79,438,994
0.17%
n = number of institutions with imputed values
Federal Expenditures by Field of R&D and Agency (HERD Question 9 and Short
Form Question 2, Column 1)
As with Question 1, if an institution reported partial data for Question 9 of the HERD Survey or
Question 2, column 1 of the HERD Short Form, and if the imputation of missing data would
necessarily impact the federal R&D expenditures reported by the institution, it was decided that
the federal R&D amount would not be altered and values would not be imputed for Question 9 for
that institution. However, in most cases where some values in Question 9 were missing, all federal
expenditures were reported on the survey, but the institution could not provide the level of detail
required. For example, some institutions entered all engineering under Other Engineering and
indicated that they could not break out these expenditures across the many detailed fields of
engineering requested on the survey. In cases such as these, missing values were imputed.
9
Question 9 Key Variables
The Federal R&D Expenditures key variable was already imputed during the imputation process
for Question 1.
Question 9 Non-Key Variables
For institutions where all or some of the information for Question 9 (or Question 2, column 1 of
the short form) was missing and there were no reported past year’s data for the missing values to
refer to, an additional logical imputation technique was employed before proceeding with the
imputation of non-key variables. Data collection staff reviewed the websites of institutions to
determine which fields of R&D should be imputed as zero. The assumption was that if there were
no degrees granted in an area or related area, no R&D was likely being performed. This approach
was thought to be better than imputation based solely on imputation class, which typically resulted
in expenditures being imputed in every field. For example, based solely on imputation class, a
liberal arts college that specializes in social sciences and non-S&E programs would have
expenditures imputed in engineering. By reviewing institution websites, we could avoid some of
these obvious issues. The same imputation logic was applied to Question 11
Non-key variables in Question 9 were imputed in three hierarchical steps (see below). For HERD
Short Form institutions, imputation for Question 2, column 1 ended after the first step. In each
step, the target value was computed based on the ratio of the lower-level variable to the
higher-level variable in the previous year’s survey.
1.
Major Fields of R&D (e.g., engineering, physical sciences, life sciences):
Referring to
equation 3, the total for each major field was a y
in
variable, and Federal R&D Expenditures
was the y
ik
variable.
2.
Minor Fields of R&D (e.g., health sciences, economics, chemical engineering):
The
detailed fields of R&D that contribute to subtotals were y
in
variables. The major fields of R&D
that are broken down into more detailed fields were y
ik
variables. For institutions that were in
the standard form population in FY 2018 but were in the short form population in FY 2017 and
had standard form data for any years prior to FY 2017, the most recent data were distributed
across detailed fields.
3.
Expenditures
by
Agency
(e.g.,
NSF-funded
expenditures
in
chemical
engineering,
HHS-funded expenditures in health sciences):
Each agency by lowest level of R&D field
variable was a y
in
variable, and the total federal expenditures for the corresponding fields were
y
ik
variables.
Detailed data were summed to provide the major field by agency total when major field subtotals
by agency were needed.
If the past year’s data were not available, key variables were distributed across associated non-key
variables using the relative percentages reported by institutions in the same imputation class
(equations 4a and 4b). If this was the case for major fields, standard form and short form
institutions were used to derive relative percentages per class. Table 6 lists the imputed amount for
federal R&D in each field and includes amounts for both the short form and the standard form. For
this reason, the
n
for major fields is larger than for detailed fields.
10
Table 6. Imputed and Aggregate Amounts for Federal Expenditures by Field
(amounts are dollars in thousands)
Field of R&D
n
Imputed
Total
% Imputed
Computer and Information Sciences
34
2,335
1,635,198
0.14%
Engineering
34
7,763
7,099,651
0.11%
Aerospace, Aeronautical, and Astronautical
20
8
678,087
0.00%
Bioengineering and Biomedical
20
346
787,000
0.04%
Chemical
20
185
461,674
0.04%
Civil
20
890
592,396
0.15%
Electrical, Electronic, Communications
20
2,297
1,981,799
0.12%
Industrial and Manufacturing
20
5
306,696
0.00%
Mechanical
20
1,199
993,683
0.12%
Metallurgical and Materials
20
513
464,981
0.11%
Other
20
2,151
826,975
0.26%
Geosciences, Atmospheric, and Ocean sciences
34
423
2,054,549
0.02%
Atmospheric Sciences and Meteorology
20
192
485,737
0.04%
Geological and Earth Sciences
20
175
699,305
0.03%
Ocean Sciences and Marine Sciences
20
0
648,156
0.00%
Other
20
0
215,383
0.00%
Life Sciences
34
74,637
23,978,544
0.31%
Agricultural Sciences
20
9,820
956,060
1.03%
Biological and Biomedical Sciences
20
30,676
8,589,048
0.36%
Health Sciences
20
25,742
13,453,146
0.19%
Natural Resources and Conservation
21
2,684
314,948
0.85%
Other
21
29,487
633,813
4.65%
Mathematics and Statistics
34
2,985
459,454
0.65%
Physical Sciences
34
6,241
3,483,381
0.18%
Astronomy and Astrophysics
20
149
454,394
0.03%
Chemistry
20
2,011
1,136,317
0.18%
Materials Science
20
0
162,048
0.00%
Physics
20
2,805
1,564,866
0.18%
Other
20
835
154,035
0.54%
Psychology
34
1,135
764,434
0.15%
Social Sciences
34
7,310
947,919
0.77%
Anthropology
20
151
43,407
0.35%
Economics
20
87
102,053
0.09%
Political science and Government
20
129
98,089
0.13%
Sociology, Demography, and Population Studies
20
380
285,375
0.13%
11
Field of R&D
n
Imputed
Total
% Imputed
Other
20
6499
416,554
1.56%
Other Sciences
34
203
350,624
0.06%
Non-S&E Fields
34
3,642
1,245,661
0.29%
Business Management and Business Administration
20
412
71,706
0.57%
Communication and Communications Technologies
20
34
34,279
0.10%
Education
20
1,057
673,774
0.16%
Humanities
20
68
49,997
0.14%
Law
20
258
51,134
0.50%
Social work
20
107
114,073
0.09%
Visual and Performing Arts
20
16
10,829
0.15%
Other
20
1,431
234,242
0.61%
n = number of institutions with imputed values
Table 7 lists the imputed amount of federal R&D for each agency. Federal expenditures by agency
are not collected on the short form; therefore, these amounts are for the standard form only.
Table 7. Imputed and Aggregate Amounts for Federal Expenditures by Agency
(amounts are dollars in thousands)
Agency
n
Imputed
Total
% Imputed
USDA
20
20,005
1,185,986
1.69%
DoD
20
8,240
5,900,829
0.14%
Energy
20
2,661
1,819,663
0.15%
HHS
20
51,987
22,922,192
0.23%
NASA
20
2,898
1,516,983
0.19%
NSF
20
11,310
5,273,511
0.21%
Other
20
6,635
3,325,918
0.20%
n = number of institutions with imputed values
Nonfederal Expenditures by Field of R&D and Source of Funds (HERD Question 11
and Short Form Question 2, Column 2)
Question 11 Key Variables
The key variable Nonfederal R&D Expenditures was already imputed during the imputation
process for Question 1.
Question 11 Non-Key Variables
Non-key variables in Question 11 were imputed in three hierarchical steps (see below). For HERD
Short Form institutions, imputation for Question 2, column 2 ended after the first step. In each
step, the target value was computed based on the ratio of the lower-level variable to the
higher-level variable in the previous year’s survey.
12
1.
Major Fields of R&D (e.g., engineering, physical sciences, life sciences):
Referring to
equation 3, the total for each major field was a y
in
variable, and Nonfederal R&D Expenditures
was the y
ik
variable.
2.
Minor Fields of R&D (e.g., health sciences, economics, chemical engineering):
The
detailed fields of R&D that contribute to subtotals were y
in
variables. The major fields of R&D
that were broken down into more detailed fields were y
ik
variables. For institutions that were
in the standard form population in FY 2018 but were in the short form population in FY 2017
and had standard form data for any years prior to FY 2017, the most recent data were
distributed across detailed fields.
3.
Expenditures
by
Source
(e.g.,
expenditures
in
chemical
engineering
sponsored
by
businesses, expenditures in health sciences funded by institutional funds):
Because total
R&D funded by different nonfederal sources was already imputed for Question 1, there was
no need to reference past-year or peer data to impute values for source by field cells. Each
value was imputed as follows:
Q12rowXcolumnY = (column Y total / Total Nonfederal) * row X total
If the amount for a nonfederal source was missing in Question 1 and was not imputed because it
would alter the reported total R&D expenditures, expenditures for R&D fields funded by that
source also remained missing and un-imputed. Table 8 lists the imputed amount for nonfederal
R&D in each field and includes amounts for both the short form and the standard form. For this
reason, the
n
for major fields is larger than for detailed fields.
Table 8. Imputed and Aggregate Amounts for Nonfederal Expenditures by Field
(amounts are dollars in thousands)
Field of R&D
n
Imputed
Total
% Imputed
Computer and Information Sciences
41
21,805
772,598
2.82%
Engineering
40
70,767
5,287,268
1.34%
Aerospace, Aeronautical, and Astronautical
25
5,574
333,724
1.67%
Bioengineering and Biomedical
25
2,684
552,652
0.49%
Chemical
25
9,611
471,849
2.04%
Civil
25
4,017
767,778
0.52%
Electrical, Electronic, Communications
25
22,000
864,801
2.54%
Industrial and Manufacturing
25
3,281
208,338
1.57%
Mechanical
25
11,001
635,666
1.73%
Metallurgical and Materials
25
3,943
298,686
1.32%
Other
25
8,364
1,147,058
0.73%
Geosciences, Atmospheric, and Ocean sciences
41
8,819
1,117,398
0.79%
Atmospheric Sciences and Meteorology
25
3,972
122,342
3.25%
Geological and Earth Sciences
25
3,839
435,123
0.88%
Ocean Sciences and Marine Sciences
25
540
410,500
0.13%
Other
25
374
145,094
0.26%
13
Field of R&D
n
Imputed
Total
% Imputed
Life Sciences
41
244,822
21,922,366
1.12%
Agricultural Sciences
25
3,242
2,364,887
0.14%
Biological and Biomedical Sciences
25
36,108
5,965,052
0.61%
Health Sciences
25
190,900
12,485,116
1.53%
Natural Resources and Conservation
26
11,910
454,428
2.62%
Other
26
8,740
627,047
1.39%
Mathematics and Statistics
41
6,144
298,315
2.06%
Physical Sciences
41
24,025
1,773,027
1.36%
Astronomy and Astrophysics
25
4,997
212,709
2.35%
Chemistry
25
8,030
739,811
1.09%
Materials Science
25
0
93,882
0.00%
Physics
25
7,482
638,866
1.17%
Other
25
2,988
80,628
3.71%
Psychology
41
6,279
503,099
1.25%
Social Sciences
41
10,926
1,807,875
0.60%
Anthropology
25
424
77,605
0.55%
Economics
25
463
362,924
0.13%
Political science and Government
25
757
345,083
0.22%
Sociology, Demography, and Population Studies
25
6,674
321,908
2.07%
Other
25
2,215
693,567
0.32%
Other Sciences
41
1,307
540,229
0.24%
Non-S&E Fields
41
24,311
3,397,404
0.72%
Business Management and Business
Administration
25
2,454
714,350
0.34%
Communication and Communications
Technologies
25
1,087
137,035
0.79%
Education
25
2,045
813,614
0.25%
Humanities
25
1,638
463,088
0.35%
Law
25
1,071
217,196
0.49%
Social work
25
746
137,294
0.54%
Visual and Performing Arts
25
741
126,535
0.59%
Other
25
13,464
769,311
1.75%
n = number of institutions with imputed values
Equipment Expenditures by Field of R&D (Question 14)
Question 14 Key Variables
The Total R&D Equipment key variable was calculated in the same way as other key variables
(equations 1a and 1b). The imputation factors for each class are listed in table 9. If there was no
14
value for Total R&D Equipment in the previous year, a ratio imputation technique was used
(equations 2a and 2b). This was the procedure for institutions that were in the FY 2018 standard
form population but had been in the FY 2017 short form population.
15
Table 9. Imputation Factors for Total Equipment Expenditures by Class
HDO/TOC
n
Total Equipment
PhD
Public
307
0.9804
Private
132
0.9744
No PhD
Public
70
0.8370
Private
81
0.9000
n = number of institutions used to create the factor
Question 14 Non-Key Variables
Non-key variables in Question 14 were imputed in three hierarchical steps:
1.
Federal and Nonfederal:
Total equipment expenditures were distributed based on the ratio of
the current year’s total federal to total nonfederal expenditures.
2.
Major Fields of R&D (e.g., engineering, physical sciences, life sciences, education):
Again,
the ratios of field to total for federal expenditures or nonfederal expenditures were used to
distribute equipment expenditures by major field.
3.
Minor Fields of R&D (e.g., health sciences, economics, chemical engineering):
The same
process was used as for imputing major fields.
Table 10 provides summary data on imputed amounts and rates for each field of study included in
Question 14.
Table 10. Imputed and Aggregate Amounts for Equipment Expenditures by Field
(amounts are dollars in thousands)
Field of R&D
n
Imputed
Total
% Imputed
Computer and Information Sciences
27
7,209
89,849
8.02%
Engineering
27
36,627
594,041
6.17%
Aerospace, Aeronautical, and Astronautical
27
4,140
31,210
13.26%
Bioengineering and Biomedical
27
713
67,052
1.06%
Chemical
27
332
42,024
0.79%
Civil
27
248
32,581
0.76%
Electrical, Electronic, Communications
27
19,158
125,712
15.24%
Industrial and Manufacturing
27
3,178
23,557
13.49%
Mechanical
27
5,943
85,668
6.94%
Metallurgical and Materials
27
510
51,129
1.00%
Other
27
2,405
135,108
1.78%
Geosciences, Atmospheric, and Ocean sciences
27
1,636
95,496
1.71%
Atmospheric Sciences and Meteorology
27
430
17,113
2.51%
Geological and Earth Sciences
27
460
39,840
1.15%
16
Ocean Sciences and Marine Sciences
27
451
33,037
1.37%
Other
27
295
5,506
5.36%
Life Sciences
27
8,356
874,888
0.96%
Agricultural Sciences
27
132
79,387
0.17%
Biological and Biomedical Sciences
27
6,340
394,038
1.61%
Health Sciences
27
1,783
357,822
0.50%
Natural Resources and Conservation
27
28
13,783
0.20%
Other
27
73
29,858
0.24%
Mathematics and Statistics
27
4,022
9,342
43.05%
Physical Sciences
27
10,389
383,982
2.71%
Astronomy and Astrophysics
27
1674
31,098
5.38%
Chemistry
27
762
120,695
0.63%
Materials Science
27
0
17,263
0.00%
Physics
27
5,308
191,497
2.77%
Other
27
2645
23,429
11.29%
Psychology
27
67
16,325
0.41%
Social Sciences
27
360
13,059
2.76%
Anthropology
27
2
1,886
0.11%
Economics
27
18
3,443
0.52%
Political science and Government
27
52
905
5.75%
Sociology, Demography, and Population Studies
27
196
1,302
15.05%
Other
27
92
5,523
1.67%
Other Sciences
27
1416
26,892
5.27%
Non-S&E Fields
27
578
41,721
1.39%
Business Management and Business Administration
27
59
5,962
0.99%
Communication and Communications Technologies
27
4
4,020
0.10%
Education
27
251
6,448
3.89%
Humanities
27
55
6,263
0.88%
Law
27
34
326
10.43%
Social work
27
8
217
3.69%
Visual and Performing Arts
27
6
1,195
0.50%
Other
27
161
17,290
0.93%
n = number of institutions with imputed values
Funds Received as a Subrecipient (HERD Question 7 and Short Form Question 3)
Question 7 Key Variables
Because of the inclusion of the short form survey, which requests subrecipient funds received only
from higher education entities, it was necessary to have two key variables (i.e., Sub From Higher
Education and Sub From Non-Higher Education)
.
Institutions from the short form and long form
17
populations were used to calculate imputation factors for Sub From Higher Education, but only
standard form institutions were included in the calculation of Sub From Non-Higher Education. In
FY 2018 one institution, Roger Williams University, reported a $829,000 decrease in their total
R&D expenditures received from non-higher education pass through entities. This change was
unusually high compared to other institutions within the same imputation class (NoPhD/Private)
who reported changes between $1,000 and $169,000 in the non-higher education data element.
The inclusion of Roger Williams University in the calculation of the factor for Received from Non-
Higher Education would have resulted in an unusually low number and so it was decided that the
institution should be excluded as an outlier from that calculation. A similar decrease was not
reported for the Higher Education factor and so the institution was included in that calculation. If
there was no value for either key variable in the previous year, a ratio imputation technique was
used (equations 2a and 2b). The imputation factors for each class and key variable are listed in
table 11.
Table 11. Imputation Factors for Total Subrecipient Expenditures by Class
HDO/TOC
n
Sub From Higher
Education
n
Sub From Non-Higher
Education
PhD
Public
297
1.0496
263
1.0460
Private
171
1.0899
126
1.0073
No PhD
Public
142
0.9751
69
0.8337
Private
175
1.0050
79
0.8495
n = number of institutions used to create the factor
Question 7 Non-Key Variables
Sub From Higher Education was imputed in one hierarchical step, so step 1 below was the only
step that applied to both short form and long form institutions. Sub From Non-Higher Education
was imputed in two hierarchical steps:
1.
Source of Funds (federal or nonfederal)
2.
Other Pass-Through Institutions:
Standard form institutions were asked to divide non-higher
education pass-through sources into business, nonprofit, and other. If they were unable to
report the non-higher education sources at this level of detail, they were asked to classify all
expenditures as other and indicate that amounts from business and nonprofit sources were
unavailable.
Distribution across categories was based on last year’s response (equation 3) unless last year’s data
were missing, in which case distribution was based on current-year peer institutions (equations 4a
and 4b).
Tables 12 and 13 provide summary data on federal and total imputed amounts and rates by
imputation class and pass-through entity. Short form and standard form institutions are included
in the summaries for table 13 but only standard form institutions are included in the summaries for
Table 12.
18
Table 12. Imputed and Aggregate Amounts for Total and Federal R&D Received as a
Subrecipient by Class
(amounts are dollars in thousands)
HDO/TOC
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
PhD
Public
9
42,462
4,175,889
1.02%
45
956,430
5,078,949
18.83%
Private
9
20,836
1,998,498
1.04%
16
245,197
2,287,378
10.72%
No PhD
Public
4
166
30,875
0.54%
5
1,645
36,855
4.46%
Private
5
418
22,024
1.90%
5
480
23,313
2.06%
n = number of institutions with imputed values
Table 13. Imputed and Aggregate Amounts for Total and Federal R&D Received as a
Subrecipient by Pass-Through Entity
(amounts are dollars in thousands)
Pass-
Through
Entity
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
Higher Ed
41
21,938
3,192,875
0.69%
82
431,591
3,559,095
12.13%
Business
30
12,681
916,451
1.38%
72
202,871
1,180,215
17.19%
Nonprofit
30
15,200
1,182,214
1.29%
72
249,918
1,533,277
16.30%
Other
30
14,607
951,133
1.54%
72
195,633
1,170,563
16.71%
n = number of institutions with imputed values
Expenditures Passed Through to Other Institutions (HERD Question 8 and Short
Form Question 4)
Question 8 Key Variables
Because of the inclusion of the short form survey, which requests subrecipient funds passed
through only to higher education entities, it was necessary to have two key variables (i.e., Passed
to Higher Education and Passed to Non-Higher Education).
Institutions from the short form and
standard form populations were used to calculate imputation factors for Passed to Higher
Education, but only standard form institutions were included in the calculation of Passed to Non-
Higher Education. In FY 2018 three institutions, California Polytechnic State University, San Luis
Obispo, CUNY, Queens College, and CUNY, John Jay College of Criminal Justice, reported
increases over $1,000,000 in their total R&D expenditures passed through to higher education pass
through entities. This change was unusually high compared to other institutions within the same
imputation class (NoPhD/Public) who reported changes between $1,000 and $522,000 in the
19
higher education data element. The inclusion of these institutions in the calculation of the factor
for Passed to Higher Education would have resulted in an unusually high number and so it was
decided that the institution should be excluded as an outlier from that calculation. Similiarly,
CUNY, Queens College and two other institutions reported large decreases in their total R&D
expenditures passed through to non-higher education pass through entities and were excluded as
an outlier from that calculation. CUNY, Queens C. and Humboldt State University reported
decreases over $1,000,000 while other institutions in the same imputation class (NoPhD/Public)
reported changes between $1,000 and 201,000. Charles R. Drew University of Medicine and
Science reported a decrease over $1,000,000 while other institutions in that imputation class
(NoPhD/Private) reported changes between $1,000 and $408,000. If there was no value for either
key variable in the previous year, a ratio imputation technique was used (equations 2a and 2b).
The imputation factors for each class and key variable are listed in table 14.
The Total Pass-Through variable was reported in Question 12 as well as Question 8 on the standard
form, and it was possible for the variable to be missing in one of the questions but reported in the
other. There were three scenarios related to the imputation of Total Pass-Through:
1.
If
all
variables
in
Question
8
were
missing
but
Total
Pass-Through
was
reported
in
Question 12, the pass-through value reported in Question 12 was used to impute detail values
for Question 8.
2.
If Total Pass-Through was missing in both questions but some partial data were included in
Question 12, the variable was not imputed for either question and was left missing. The total
value in Question 12 was Total R&D Expenditures, and it equated to the total in Question 1.
As with Question 1, imputing an individual missing value in Question 12 would necessarily
alter the value for Total R&D Expenditures reported by the institution.
3.
When all Question 8 and Question 12 values were missing, the key variables Passed to Higher
Education and Passed to Non-Higher Education were calculated with Total Pass-Through
calculated as the sum of the two.
Table 14. Imputation Factors for Total Pass-Through Expenditures by Class
HDO/TOC
n
Passed to Higher
Education
n
Passed to Non-Higher
Education
PhD
Public
317
1.0633
279
1.1136
Private
175
1.0471
132
1.0850
No PhD
Public
139
1.0630
66
1.0567
Private
174
0.9251
79
0.9328
n = number of institutions used to create the factor
Question 8 Non-Key Variables
Passed to Higher Education was imputed in one hierarchical step, so step 1 was the only step that
applied to both short form and standard form institutions. Passed to Non-Higher Education was
imputed in two hierarchical steps:
20
1.
Source of Funds (federal or nonfederal)
2.
Other Subrecipient Institutions:
Standard form institutions were asked to divide non-higher
education pass-through into business, nonprofit, and other. If they were unable to report the
non-higher education recipients at this level of detail, they were asked to classify all
expenditures as other and indicate that amounts from business and nonprofit sources were
unavailable.
Distribution across categories was based on last year’s response (equation 3) unless last year’s data
were missing, in which case distribution was based on current-year peer institutions (equations 4a
and 4b).
Tables 15 and 16 provide summary data on federal and total imputed amounts and rates by
imputation class and subrecipient entity. Short form and standard form institutions are included in
the summaries for Table 16, but only standard form institutions are included in the summaries for
Table 15.
Table 15. Imputed and Aggregate Amounts for Total and Federal R&D Passed Through to
a Subrecipient by Class
(amounts are dollars in thousands)
HDO/TOC
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
PhD
Public
7
3,134
3,310,924
0.09%
7
3,354
3,919,750
0.09%
Private
9
17,466
1,956,436
0.89%
8
17,779
2,394,249
0.74%
No PhD
Public
4
1254
15,380
8.15%
4
1368
19,076
7.17%
Private
4
83
9,967
0.83%
4
90
10,930
0.82%
n = number of institutions with imputed values
Table 16. Imputed and Aggregate Amounts for Total and Federal R&D Passed Through by
Subrecipient Entity
(amounts are dollars in thousands)
Subrecipient
Entity
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
Higher Ed
59
428,308
3,084,474
13.89%
59
526,242
3,540,969
14.86%
Business
52
144,520
825,140
17.51%
52
184,696
1,059,764
17.43%
Nonprofit
52
116,593
914,798
12.75%
52
150,423
1,102,866
13.64%
Other
52
93,694
470,707
19.90%
52
130,045
643,297
20.22%
n = number of institutions with imputed values
21
Foreign Funding for R&D (Question 2)
Prior to FY 2016 Question 2 had only one value to impute: Total R&D Funded by Foreign Sources
(foreign_tot). In FY 2016 new variables were added to this question: Foreign Funding Received
From Foreign Governments (foreign_gov), Foreign Funding Received from Foreign Businesses
(foreign_bus), Foreign Funding Received from Foreign Nonprofit Organizations (foreign_np),
Foreign Funding Received from Foreign Higher Education Institutions (foreign_ed), and Foreign
Funding Received from Other Foreign Sources (foreign_oth).
Imputation of the Question 2 total expenditure value was performed first using the same
methodology that has been applied in previous years and is described below. Imputation of the
source categories was performed next based on last year’s response, unless last year’s data were
missing, in which case distribution was based on current year peer institutions.
Total Foreign Funding
By definition, total expenditures from foreign sources must be equal to or less than the total
expenditures from external, nongovernmental sources as reported in Question 1 (i.e., business
sources + nonprofit sources + other sources). For the purposes of this calculation, external,
nongovernmental funding is referred to as
T.
The value
T
was calculated during the recoding
process prior to other imputation. If
T
was 0 or missing, Question 2 was imputed as 0.
If Question 2 was reported last year, this year’s value was calculated by applying the same
proportion reported last year (foreign_tot / T) to this year’s reported or imputed value of
T.
For
institutions that moved from the short form population to the standard form population, the
proportion from the most recent standard form data (FYs 2011–16) was used, if reported.
If there were no reported data from last year, a logistic regression model was employed to identify
cases in which Question 2 should be imputed as zero. PROC LOGISTIC was run separately for
public and private institutions using the following predictors: the continuous variable
T,
HDO, and
MedS. MedS is a variable indicating the inclusion of a medical school, derived from Question 4.
If the predicted value (
𝑝𝑝̂
) was less than 0.5, the value for Question 2 was imputed as 0.
The next step was the imputation of the nonzero values for foreign-funded expenditures. For this
step, the mean proportion of
T
(
𝑝𝑝̅
= foreign_tot / T) was calculated for the nonzero values in
imputation classes determined by TOC, HDO, and the quartiles of Total R&D Expenditures. The
imputed value of Question 2 was then calculated as T *
𝑝𝑝̅
.
Foreign Funding by Source
If total foreign funded expenditures was imputed as zero in the first step than all sources were
imputed as zero as well. The next step was the imputation of cases where nonzero data were
reported or imputed for the total value.
Similar to the value for the total, expenditures from foreign businesses must be equal to or less
than the total expenditures reported from businesses in Question 1(source_bus), expenditures from
foreign nonprofit organizations must be equal to or less than the total expenditures reported from
22
total nonprofit organizations in Question 1 (source_np), and the total of expenditures from foreign
governments, foreign higher education, and other foreign sources must be equal to or less than the
total expenditures from all other sources in Question 1 (source_oth). This required the use of
Question 1 variables when calculating proportions and means rather than using a simple ratio of
each foreign source to the overall total. To accomplish this for those institutions that reported this
distribution last year, last year’s proportion of each source to the corresponding Question 1 source
was calculated. For those where last year’s distribution was not reported, the mean proportion of
each source to the corresponding Question 1 source was calculated in the same imputation classes
used for total foreign expenditures. For expenditures from foreign businesses and foreign nonprofit
organizations the proportion was applied to the institution’s corresponding Question 1 data:
source_bus (
𝑝𝑝̅
= foreign_bus/source_bus) and source_np (
𝑝𝑝̅
= foreign_np/source_np)
A multiple step approach had to be used for the three foreign sources reported under all other
sources in Question 1 (foreign_gov, foreign_ed, and foreign_oth). For the purposes of this
calculation the sum of those three foreign sources is referred to as O.
For those institutions where last year’s distribution was reported:
1.
Last year’s proportion of O to source_oth was calculated (O/source_oth)
2.
Last year’s proportion of each of those foreign sources to O was calculated:
•
foreign_gov/O
•
foreign_ed/O
•
foreign_oth/O
3.
The proportions for each of the sources was applied to the current year value of O.
For those institutions where last year’s distribution was not reported:
1. The mean proportion of O to source_oth was calculated source_oth (
𝑝𝑝̅
𝑂𝑂
= (O/source_oth).
2. The mean proportion of each of those foreign sources to O was calculated:
•
𝑝𝑝̅
𝑓𝑓
= (foreign_gov/O)
•
𝑝𝑝̅
𝑒𝑒
= (foreign_ed/O)
•
𝑝𝑝̅
𝑡𝑡
(foreign_oth/O)
3. A total was computed as a sum of the three means calculated in the 2
nd
step.
4.
A percentage of the total was computed for each variable (mean/total mean).
5.
That percentage for each of the sources was applied to O.
For this question, there was an additional normalization step in the imputation procedures. The
normalization step ensures that the five detail variables sum to the previously imputed or reported
total.
•
A total of the detail source data was calculated.
•
A percentage of the summed total was computed for each variable (detail foreign source / sum
of those imputed values).
•
That percentage was applied to the previously imputed or reported total foreign expenditures
to compute the imputed value.
Tables 17 lists summary data on foreign funded imputed amounts and rates by foreign source.
23
Table 17. Imputed and Aggregate Amounts for Total R&D Funded by Foreign Sources by
Foreign Source
(amounts are dollars in thousands)
Foreign Funding Source
n
Imputed
Total
% Imputed
Foreign Governments
46
11,885
253,140
4.70%
Foreign Businesses
46
67,544
546,291
12.36%
Foreign Nonprofit Organizations
46
19,889
273,180
7.28%
Foreign Higher Education
Institutions
46
13,150
117,876
11.16%
All Other Foreign Sources
46
9,306
67,907
13.70%
Total
26
62,434
1,258,394
4.96%
n = number of institutions with imputed values
24
R&D Contracts and Grants (Question 3)
Question
3
included
three
values:
External
Funding
Received
Through
Contracts
(external_contracts),
External
Funding
Received
Through
Grants
and
Other
Agreements
(external_grants), and Total External Funding (external_tot). Total external funding was a known
amount from Question 1, equivalent to total R&D (source_tot) minus institutionally funded
expenditures (source_inst_tot). If external_tot was 0, contract and grant values were imputed as 0.
If Question 3 was reported last year, this year’s value for contracts was calculated by applying the
same proportion reported last year (external_contracts / (source_tot – source_inst_tot) to this
year’s reported or imputed value of Total External Funding. For institutions that moved from the
short form population to the standard form population, the proportion from the most recent
standard form data (FYs 2011–16) was used, if reported.
If there were no reported data from last year, the mean proportion of external_grants / external_tot
was calculated for the non-missing values within imputation classes determined by TOC, HDO,
the quartiles of Total R&D Expenditures, and the median value of Federal R&D Expenditures.
The imputed values were calculated by applying the mean proportions to Total External Funding,
either reported or imputed.
Table 18 lists summary data on externally funded imputed amounts and rates by type of agreement.
Table 18. Imputed and Aggregate Amounts for Total Externally Funded R&D
Expenditures by Type of Agreement
(amounts are dollars in thousands)
Type of Agreement
n
Imputed
Total
%
Imputed
Contracts
31
543,762
13,550,998
4.01%
Grants and Other Agreements
31
2,041,993
45,339,443
4.50%
Total
16
106,990
58,890,441
0.18%
n = number of institutions with imputed values
R&D Expenditures at Medical School (Question 4)
Question 4 included one expenditure amount, R&D Expenditures Within the Medical School
(med_sch_tot), and a flag variable indicating that the institution did not have a medical school.
The existence of a medical school was researched using online data sources.
If the institution was determined to have a medical school and if Question 4 was reported last year,
this year’s value was calculated by applying the same proportion reported last year (med_sch_tot
/ Total R&D Expenditures) to this year’s reported or imputed value of Total R&D Expenditures.
For institutions that moved from the short form population to the standard form population, the
proportion from the most recent standard form data (FYs 2011–16) was used, if reported.
25
If there were no reported data from last year, a mean expenditure amount by imputation class was
calculated for institutions reporting medical schools. Imputation class was determined by TOC,
HDO,
the
quartiles
of
Total
R&D Expenditures,
and
the
median
value
of
Federal
R&D
Expenditures. The imputed value was the calculated mean if the mean of that imputation class was
less than the total reported in Total R&D Expenditures for that institution. If the calculated mean
for the imputation class was greater than Total R&D Expenditures, the imputed value was assigned
the value of the total.
Table 19 provides summary data on medical school imputed amount and rate.
Table 19. Imputed and Aggregate Amounts for R&D Expenditures Within a Medical
School
(amounts are dollars in thousands)
n
Imputed
Total
% Imputed
R&D Expenditures at Medical School
17
34,611
27,851,411
0.12%
n = number of institutions with imputed values
Clinical Trial Expenditures (Question 5)
Question 5 included three expenditure amounts (i.e., Federal Expenditures for Clinical Trials
(trials_fed), Nonfederal Expenditures for Clinical Trials (trials_nonfed), Total Expenditures for
Clinical Trials (trials_tot) and a flag variable indicating that the institution did not conduct clinical
trials.
If Question 5 was reported last year, even partially, this year’s value was calculated by applying
the same proportion reported last year (trials_tot / source_tot) to this year’s reported or imputed
value of Total R&D Expenditures. The imputed amount for Total Expenditures for Clinical Trials
was distributed across details based on the relative proportions reported last year. For institutions
that moved from the short form population to the standard form population, the proportion from
the most recent standard form data (FYs 2011–16) was used, if reported.
If there were no reported data from last year, a mean expenditure amount by imputation class was
calculated for institutions reporting clinical trials. Imputation class was determined by TOC, HDO,
the quartiles of Total R&D Expenditures, and MedS. This value was used to impute total clinical
trials (trials_tot).
Federal and nonfederal amounts were then imputed using a proportion mean (
𝑝𝑝̅
). The imputed
proportion was for expenditures for federal clinical trials (p1), while 1 - p1 was the proportion for
nonfederal clinical trials (p2). The mean proportion of trials_fed / trials_tot was calculated for the
non-missing values within imputation classes determined by TOC, HDO, the quartiles of Total
R&D Expenditures, and MedS. The imputed values were calculated as trials_tot * p1 for federal
clinical trials and trials_tot * p2 for nonfederal clinical trials.
Table 20 lists summary data on total and federally financed clinical trial imputed amounts and
rates
26
Table 20. Imputed and Aggregate Amounts for Clinical Trial Expenditures
(amounts are dollars in thousands)
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
Clinical Trial
Expenditures
22
43,920
1,047,345
4.19%
21
97,505
2,975,165
3.28%
n = number of institutions with imputed values
Type of R&D (Basic, Applied, or Experimental Development) (Question 6)
Question 6 included 12 expenditure values: federal, nonfederal, and total amounts for basic
research, applied research, experimental development, and overall R&D. Two cycles of imputation
were performed for Question 6, one for the federal column and one for the nonfederal column. The
totals for each column, Federal R&D Expenditures and Nonfederal R&D Expenditures, were
known amounts from Question 1. If the total of either column was 0, the contributing values were
imputed as 0.
Imputation was based on last year’s data only for FY 2018 unit nonresponders that had reported
data for Question 6 in FY 2017. In this case, the proportion of federal and nonfederal expenditures
that were considered basic, applied, and experimental development were based on the relative
proportions reported in FY 2017.
For institutions that were partial responders in FY 2018, after logical imputations were completed
a logistic regression model was employed to identify cases where values should be imputed as
zero. Logistic models were run for each of the Question 6 variables. PROC LOGISTIC was run
separately for public and private institutions using the continuous variables Federal R&D
Expenditures or Nonfederal R&D Expenditures, HDO, and MedS. If the predicted value (
𝑝𝑝̂
) was
less than 0.5, the variable in question was imputed as 0.
The next step was the imputation of the nonzero values for basic, applied, and experimental
development expenditures. For each variable, the mean expenditure was calculated for the non-
missing values within each imputation class determined by TOC, HDO, and the quartiles of Total
R&D Expenditures.
For this question, there was an additional normalization step in the imputation procedures. The
normalization step ensures that the three variables in each column sum to the known column total.
If all three variables were missing:
•
A total was computed as a sum of the class means for each variable.
•
A percentage of the total was computed for each variable (mean / total).
•
That percentage was applied to the known total (Federal R&D Expenditures or Nonfederal
R&D Expenditures) to compute the imputed value.
27
If only two variables were missing (e.g., applied and experimental development):
•
A total was computed as a sum of the class means for each variable.
•
A percentage of the total was computed for each variable (mean / total).
•
That percentage was applied to the known total (Federal R&D Expenditures or Nonfederal
R&D Expenditures minus the reported value, usually basic research expenditures) to compute
the imputed value.
Table 21 lists summary data on federal and total imputed amounts and rates by type of R&D
conducted.
Table 21. Imputed and Aggregate Amounts for Total and Federal R&D by Type of R&D
Conducted
(amounts are dollars in thousands)
Type of R&D
Federal R&D
Total R&D
n
Imputed
Total
%
Imputed
n
Imputed
Total
%
Imputed
Basic Research
69
6,114,888
26,799,164
22.82%
70
9,635,138
49,391,250
19.51%
Applied
Research
70
2,577,707
11,963,001
21.55%
71
4,294,099
22,200,867
19.34%
Experimental
Development
67
635,313
3,182,917
19.96%
67
1,236,897
7,693,749
16.08%
n = number of institutions with imputed values
Cost Elements of R&D (Question 12)
Question 12 had eight different variables that sum to the known value of total R&D expenditures
(source_tot). In addition to total value, three of the variables were known from other questions:
Unrecovered Indirect Cost (Question 1), Total Pass-Through (Question 8), and Total Capitalized
Equipment (Question 14). If all of Question 12 was missing, the values for these three variables
were taken from the corresponding variables in the other questions. In many cases, those values
were also missing. For example, if unrecovered indirect cost from Question 1 was missing, it must
also be a missing value for Question 12.
As with Question 1, values in Question 12 can only be imputed if the entire question is missing.
The imputation of values for individual missing fields would necessarily impact the total R&D
reported by the institution, and it was decided that the total R&D reported by an institution would
not be altered through imputation.
If an institution reported data for Question 12 in FY 2017, imputation was based on last year’s
data. Values that were not already imputed as part of other questions were based on the relative
proportion of Total R&D Expenditures reported in FY 2017. For institutions that moved from the
short form population to the standard form population, the proportion from the most recent
standard form data (FYs 2011–16) was used, if reported.
28
If there were no reported data from last year, a logistic regression model was employed to identify
cases where values should be imputed as zero. Logistic models were run for each of the unknown
Question 12 variables. PROC LOGISTIC was run separately for public and private institutions
using the continuous variables Federal R&D Expenditures, HDO, and MedS. If the predicted value
(
𝑝𝑝̂
) was less than 0.5, the variable in question was imputed as 0.
The next step was the imputation of the nonzero values for unknown expenditures. For each
variable, the mean expenditure was calculated for the non-missing values within each imputation
class determined by TOC, HDO, and the quartiles of Total R&D Expenditures.
For this question, there was an additional normalization step in the imputation procedures (see
below). The normalization step ensures that the variables in each column sum to the known total.
•
A total was computed as a sum of the class means for each variable plus the values of the
known variables.
•
A percentage of the total was computed for each variable being imputed from the class mean
(i.e., not the known values) (mean / total).
•
That percentage was applied to the known total minus the known values to compute the
imputed value.
Table 22 lists summary data on total imputed amounts and rates by type of cost.
Table 22. Imputed and Aggregate Amounts for Total R&D by Type of Cost
(amounts are dollars in thousands)
Type of Cost
n
Imputed
Total
%
Imputed
Wages, Salaries, Fringe Benefits
27
344,725
34,766,504
0.99%
Noncapitalized Software
27
3,143
111,024
2.83%
Capitalized Software
28
244
11,451
2.13%
Capitalized Equipment
25
7,574
2,145,595
0.35%
Passed through
23
22,591
6,344,005
0.36%
Other Direct Costs
28
360,561
17,607,097
2.05%
Recovered Indirect
27
112,711
12,764,810
0.88%
Unrecovered Indirect
14
8,671
5,535,380
0.16%
Total Indirect
27
137,299
18,300,190
0.75%
n = number of institutions with imputed values
Headcount for R&D Personnel (Question 15)
Question 15 had three different variables: R&D Principal Investigators (personnel_pi_count),
Other
R&D
Personnel
(personnel_oth_count),
and
Total
Personnel
(personnel_tot_count).
Questions 15 is the only item in the survey that does not request expenditures. Alternative
29
procedures were developed because the procedures applied to the imputation of expenditure values
could not be used accurately here.
If values for this question were reported last year, the same values were pulled forward and flagged
as imputed for FY 2018. If there were no reported data from last year, the imputations of
personnel_pi_count, personnel_oth_count, and personnel_tot_count were performed in a stepwise
manner. We first imputed personnel_pi_count and personnel_oth_count, then personnel_tot_count
was computed from the two imputed values.
For personnel_pi_count (principal investigators), we developed regression models separately for
public and private institutions using PROC REG with the independent variables Total R&D
Expenditures, HDO, and q12blank (a dichotomous variable based on the completion of Question
12). Predicted values were applied as follows to impute missing personnel_pi_count: if the
predicted value is less than 0, personnel_pi_count = 0; otherwise, personnel_pi_count = predicted
value rounded to the nearest integer.
Following the imputation of personnel_pi_count, we then modeled personnel_oth_count (other
personnel) using the independent variables Total R&D Expenditures, HDO, q12blank, and
personnel_pi_count.
The
final
steps
consisted
of
rounding
each
component
and
summing
them
to
obtain
personnel_tot_count.
Table 23 lists summary data on total imputed amounts and rates by personnel type.
Table 23. Imputed and Aggregate Personnel Headcounts by Personnel Type
Personnel
Type
n
Imputed
Total
%
Imputed
PIs
40
9,191
163,638
5.62%
Other Personnel
55
65,296
784,005
8.33%
Total
55
79,729
947,643
8.41%
n = number of institutions with imputed values
Retro-imputation
The last step in the imputation process is performing a backcasting, or retro-imputation, of
previous years’ imputed data. If an institution reports expenditures after 1 year or more of
nonresponse, the current year’s data are used to re-impute previous years’ data. Retro-imputation
is conducted for both unit and item nonresponses. Beginning with the FY 2013 cycle, data were
not retro-imputed prior to FY 2010. (It was determined that the possible changes to any imputed
values prior to FY 2010 would be too minor to justify the additional effort.) Although values
imputed prior to FY 2010 were no longer retro-imputed in FY 2013, reported values from those
cycles continued to be used to retro-impute imputed values for FYs 2010–12. Beginning with the
FY 2014 cycle, reported values from survey cycles prior to FY 2010 were no longer used during
30
retro-imputation in any way. All institutions that have been part of the population since FY 2009
have reported more recent data.
During the recoding process occurring prior to imputation, some institutions or their imputed data
were removed from past-year records based on additional information collected during the current
cycle. The mostly likely source of this information was the population review. Institutions are sent
a screener asking about their R&D expenditures in the previous fiscal year. The FY 2018
population review screener asked institutions to categorize their FY 2017 R&D expenditures as
one of the following: no expenditures, less than $150,000, between $150,000 and $999,999, or
$1 million or more. Four institutions that had been imputed as unit nonresponders during the FY
2017 cycle responded to the screener sent prior to the FY 2018 cycle to say that their FY 2017
expenditures were less than $150,000. Because this new information negated the numbers imputed
in FY 2017, the FY 2017 imputed values were removed, and the institutions were excluded from
the FY 2017 totals and population counts.
Imputing Back to a Reported Year
Retro-imputation is applied when data are reported following a period of nonresponse. For
example, if data were reported for FY 2010 and FY 2018 but not for the intervening years, the
difference between the reported figures for each item total would be calculated and evenly
distributed across the intervening years (FYs 2011–17) as follows:
Equation 6:
𝒚𝒚
�
�
𝒊𝒊
𝒗𝒗
= 𝒚𝒚
𝒊𝒊
𝒖𝒖
+
𝒗𝒗−𝒖𝒖
𝒕𝒕−𝒖𝒖
(𝒚𝒚
𝒊𝒊
𝒕𝒕
− 𝒚𝒚
𝒊𝒊
𝒖𝒖
)
where
𝑦𝑦�
�
𝑖𝑖
𝑣𝑣
is the calculated value of imputed variable
𝑦𝑦�
𝑖𝑖
𝑣𝑣
for year
v,
𝑦𝑦
𝑖𝑖
𝑢𝑢
is the reported value for variable
𝑦𝑦
𝑖𝑖
for earlier year
u,
𝑦𝑦
𝑖𝑖
𝑡𝑡
is the reported value for variable
𝑦𝑦
𝑖𝑖
for current year
t,
and
t > v > u.
The highest-level value for each question, which is typically a key value, is imputed for missing
years. The new figures are then spread across the lower-level detail figures on the basis of the most
recent reporting pattern. This is similar to equation 3, except that the ratio of detail data to key data
for the current year is being used to impute past years.
Retro-imputing When There Is No Previously Reported Year
If an institution reports after a period of nonresponse but there was no previous reported year, we
apply the reverse of the relevant imputation factor for that variable and year:
Equation 7:
𝒚𝒚
�
𝒊𝒊𝒊𝒊
𝒕𝒕−𝟏𝟏
= (𝟏𝟏 − 𝑩𝑩
�
𝒊𝒊
𝒕𝒕
)𝒚𝒚
𝒊𝒊𝒊𝒊
𝒕𝒕
where
𝑦𝑦�
𝑖𝑖𝑖𝑖
𝑡𝑡−1
is the imputed value of key variable
𝑦𝑦
𝑖𝑖
for institution
i
for year
t-1,
and
𝐵𝐵
�
𝑖𝑖
𝑡𝑡
is the inflator/deflator factor for key variable
𝑦𝑦
𝑖𝑖
in year
t
(see equation 1b).
31
This approach applies only to key variables, the ones imputed based on imputation factors. To
retro-impute lower-level values, we apply the ratio of detail data to key data for the current year.
All questions except Questions 1.1, 10, 13, the question asking for ARRA expenditures (removed
during the FY 2015 cycle), and the one asking for a headcount of postdocs (removed during FY
2016 cycle), which are not imputed, are retro-imputed. Question 15, which was not reported on an
institution level prior to FY 2012, is retro-imputed back to FY 2012 only. Because Question 15 is
not imputed using inflator/deflator factors or as a proportion of a reported expenditure amount,
past-year values are retro-imputed with the values reported in the current year.
File Type | application/pdf |
Author | Gibbons, Michael |
File Modified | 2020-12-08 |
File Created | 2020-12-08 |