Pilot Test Results

Teacher Assessment_Pilot Test Results_Distribution 1-13-06.pdf

Study of Teacher Preparation in Early Reading Instruction

Pilot Test Results

OMB: 1850-0817

Document [pdf]
Download: pdf | pdf
AMERICAN INSTITUTES FOR RESEARCH
MEASURING TEACHER KNOWLEDGE
OF THE NRP: AN INSTRUMENT AND
PILOT TEST RESULTS
December 30, 2005
Prepared by:
American Institutes for Research
1000 Thomas Jefferson St. NW
Washington, DC 20007-3835

Prepared for:
U.S. Department of Education
Tracy Rimdzius
555 New Jersey Ave., NW, Room 500K
Washington, DC 20208-5500

1000 THOMAS JEFFERSON ST, NW|WASHINGTON, DC 20007-3835|TEL 202 403 5000|FAX 202 403 5001|WEBSITE
WWW.AIR.ORG

Teacher Assessment Pilot Test

Table of Contents

Introduction......................................................................................................................... 1
Method ................................................................................................................................ 2
Participants...................................................................................................................... 2
Test Versions .................................................................................................................. 4
Procedure ........................................................................................................................ 5
Scoring & Analysis Plan..................................................................................................... 6
Item Scoring.................................................................................................................... 6
Analysis Procedures........................................................................................................ 8
Results................................................................................................................................. 9
Missing Data ................................................................................................................... 9
Item Analysis Results ................................................................................................... 10
Effect of Teaching Experience...................................................................................... 13
Summary ........................................................................................................................... 15
Appendix A. Multiple-Choice Descriptive Data .............................................................. 16
Appendix B. Multiple-Choice Item-Total Correlations.................................................... 23
Appendix C. Constructed-Response Descriptive Data ..................................................... 28
Appendix D. Constructed-Response Item-Total Correlations .......................................... 30
Appendix E. Item Difficulties by Teacher Experience..................................................... 32

American Institutes for Research

i

Teacher Assessment Pilot Test

Introduction
This report describes a pilot test of a set of items designed to assess pre-service
teacher knowledge of the five critical components of early reading instruction as defined
by the National Reading Panel (NRP). This report is intended to supplement the Revised
Study Design document, which describes the development and selection of these items.
Therefore, this report will not review the item development process. It focuses
specifically on the properties of the items we propose to use for assessing pre-service
teacher knowledge.
Although item development will not be revisited here, it is important to note that
the items we propose to use (108 multiple-choice and 24 constructed-response items)
were part of a larger pool of items that were pilot tested. Table 1 lists the total number of
items pilot tested, the different item types, and the components that each item was
developed to measure. The four components listed in Table 1 comprise the teacher
knowledge of student content engagement (TK-SCE) framework, which was described
Table 1. Pilot Test Items, by Type and Component
Component 1:
Subject Matter
Content Level

Component 2:
Occasion for
Processing

Component 3:
Physiological
Readiness

Component 4:
Motivation

CrossComponent

Total

Likert Scale

28

23

15

62

–

128

MultipleChoice

72

48

21

31

Constructed
-Response

25

15

5

9

Situational
Judgment

15

24

14

22

Q-Sort /
Checklist

10

8

–

–

High Fidelity
Simulation

–

2

–
172
–
54
–
75
–
18

2

–

2

6

in the Revised Study Plan. The 108 multiple-choice and 24 constructed-response items
that are proposed for the teacher knowledge assessment were aligned with the five NRP
components during item development and are viewed as subcomponents of the Subject
Matter Content Level and Occasions for Processing constructs (refer to Table 2). Q-sort
and Likert items were deemed as inappropriate for the pre-service teacher assessment
because these items are not objectively scored, while the hi-fidelity and situational
judgment items are too experimental in nature to be considered at this point.

American Institutes for Research

1

Teacher Assessment Pilot Test

Table 2. Number of Items Aligned with the NRP Components

NRP
Component

Multiple Choice

Constructedresponse

TOTAL

16
22
10
33
27
108

3
4
5
8
4
24

19
26
15
41
31
132

Phonemic
Awareness
Phonics
Fluency
Vocabulary
Comprehension
TOTAL

The main objective of the pilot test, therefore, was to collect preliminary data on
the items presented in Table 1. For this report, we will only examine data on the items
presented in Table 2. The goal of these analyses will be to answer the following
questions:
♦
♦
♦
♦

Are the items appropriately difficult?
Are the multiple-choice distractors functioning correctly?
Are the items reliable?
Does item difficulty vary by level of experience?

The answers to these questions will inform the technical working group (TWG)
and Department of Education (ED) regarding the properties of the proposed assessment
so that an informed decision can be made about exercising the optional tasks associated
with this project.

Method
Participants
Participants in the pilot test were selected to be reflective of teachers who might
take an operational version of the TK-SCE survey, not a teacher knowledge assessment to
be administered to pre-service teachers. Therefore, all participants except for two had
experience teaching in the primary grades. Basic demographics on these participants are
described below.
Teacher Recruitment
Current or recent primary grade teachers were recruited to participate in the pilot
test. All teachers who had taught kindergarten, 1st, or 2nd grade in a public school in the
last three years were eligible. The only requirement was that teachers were willing and
available to participate in the pilot test for the entire four-hour period.

American Institutes for Research

2

Teacher Assessment Pilot Test

Five locations were chosen for pilot testing in order to have representation from
different parts of the country. Teachers from each area were eligible for participation.
The five areas were:
♦
♦
♦
♦
♦

Raleigh / Durham, North Carolina
Chicago, Illinois
St. Louis, Missouri
Dallas, Texas
San Diego, California

A letter, fact sheet, and description of the study were sent to all district
superintendents in the selected sites. About a week after the mail out, telephone
interviewers began contacting superintendents to recruit and schedule districts.
Recruitment was slow going at first. Interviewers faxed and re-mailed the materials as
requested and made numerous call backs before finally getting answers about
participating in the pilot study. Some districts were very eager to participate and saw the
teacher incentive as a great opportunity for their teachers to earn a little extra money.
Other districts were not interested and were not motivated by the teacher incentive.
As sessions were scheduled the date, time, location, and contact person’s
information were entered into the receipt control system. Teachers’ names, emails, and
phone numbers were also recorded so that checks could be requested in advance of the
sessions. This information was provided to the interview teams in advance of the sessions
so that they could send reminder emails to teachers.
Teacher Demographics
A total of 589 teachers participated in the pilot test. Participating teachers were
distributed across each of the five geographic regions as reflected in Table 3.
Table 3. Number of Teachers per Region
Region
San Diego
Dallas
St. Louis
Chicago
Raleigh / Durham

State
CA
TX
MO
IL
NC

Number of Sessions
11
19
13
12
17

Number of Teachers
50
173
125
100
141

The vast majority of the participants in the pilot test were female (98%), which is
representative of elementary school teachers in the US. In providing demographic data,
10% of participants identified themselves as Black or African American; about 6%
identified themselves as Hispanic; and about 81% identified themselves as White.
Regarding age, 35% were 26-35, 22% were 36-45, and 25% were 46-55. The vast
majority of participants reported having at least a Bachelor’s degree, and many stated that
they had a Master’s degree. Most of the participants majored or minored in Elementary
Education. Individuals in the sample also reported having significant teaching experience
in early elementary (82% with four or more years) and upper elementary grades (82%

American Institutes for Research

3

Teacher Assessment Pilot Test

with four or more years). Finally, virtually all of the participants had some sort of
teaching certificate, including a few who were working toward (4%) or had attained (3%)
their National Board certification.

Test Versions
For the purpose of the pilot test, the pool of items was split and two alternate
versions of the survey were created (Version 1 and 2). We had to create separate
versions due to the total number of items written and the desire to pilot as many items of
this pool as possible.
Although creating alternative versions of the survey allowed us to collect data on
as many items as possible, it did create some challenges. For example, because no
individual completed every item on Version 1 and 2, items from the two versions could
not be correlated with each other. For instance, the 108 multiple-choice and 24
constructed-response items that are the focus of the current report were distributed so that
56 multiple-choice and 12 constructed-response items were on Version 1 and 52
multiple-choice and 12 constructed-response items were on Version 2. Thus, we pilot
tested two shorter versions of the teacher knowledge assessment (i.e., alternative forms).
In addition to creating two alternate versions of the survey, we counter-balanced
sections of the survey within each version to guard against order and fatigue effects. For
example, we did not want any particular item type to always appear last and hence not be
reached. This counterbalancing process produced five differently ordered forms of each
version of the survey (i.e., 10 unique forms in total). Table 4 presents the number of
teachers who received each form.
Table 4. Number of Teachers per Form
Form
1
2
3
4
5
6
7
8
9
10

Version
1
2
1
2
1
2
1
2
1
2

N
62 teachers
64 teachers
66 teachers
77 teachers
63 teachers
66 teachers
56 teachers
55 teachers
36 teachers
44 teachers

Similar to the creation of Version 1 and 2 of the survey, the counterbalancing of
items within each version had advantages and disadvantages. On the positive side, we
guarded against order and fatigue effects, which are common with long assessments.
Also, we were able to collect some data from some respondents on each item that was
developed. On the negative side, splitting the item pool in half and counterbalancing
produced smaller than desirable numbers of respondents for some items. Sample size per
items and our approach to dealing with this issue is discussed in the results section.

American Institutes for Research

4

Teacher Assessment Pilot Test

Procedure
Two individuals administered the survey in each location; thus, a total of ten
administrators were used. All of these administrators had previous experience with
various data collection projects.
Survey administrators completed a two-day comprehensive training course on
how to conduct the pilot test. During this training session, the project was introduced and
the procedures were described in detail. In addition, much of the training involved
familiarizing the administrators with the computers and the application that was used for
data collection. Administrators spent time practicing the computer set-up process and the
data saving procedures. The Administrator Guide that was used in training is available
upon request.
Data collection occurred between September and November, 2005. All forms of
the survey were administered to participating teachers on laptop computers. Computers
were not connected to the Internet or to a network, but operated as independent machines
with the software containing the items resident on each laptop. Teacher responses were
directly saved to the hard drive on the laptop. Upon completion of the survey, the
administrators saved the results on blank CD’s, via the CD-ROM drive which was built
into all of the computers.
In most cases, data collection occurred at local schools that volunteered to provide
meeting space. Up to ten participants were scheduled for each session and given
instructions about the project. The two administrators were scheduled to arrive one hour
before data collection was to begin. During this time, they introduced themselves to
school personnel and set-up the meeting room for data collection. This mainly involved
setting up the laptops in the room. Most sessions occurred either after school or on the
weekends; because of this and the time requirements, food was provided for participants.
Each pilot test session was four full hours and the four-hour session was broken
down into five smaller, time-limited test sections. At the scheduled start time, the
administrators commenced the check-in procedures and gave an overview of the project.
Then, participants started the survey. The first section for everyone was the Opinion
(Likert) items. Upon completion of that, they started Section 2, the content of which
varied by the form that each individual was completing. Section 1 and Section 2 took a
combined 80 minutes after which there was a ten-minute break. Section 3 lasted 50
minutes, which was followed by a ten-minute break. Section 4 also took 50 minutes, and
was directly followed by Section 5, the background section and check-out, which lasted
20 minutes. Because of the large number of items allocated to each section, very few
participating teachers were able to answer all of the items of a given form.

American Institutes for Research

5

Teacher Assessment Pilot Test

Scoring & Analysis Plan
This section describes the analyses that were conducted. Prior to analyzing the
data the items needed to be scored. The scoring process for the multiple-choice and
constructed-response items is described next.

Item Scoring
Multiple-Choice Items
We designed the multiple-choice items to have one clear, best response.
Participants received credit for selecting the right choice out of the alternatives provided
(A, B, C, or D). Participants were not instructed that they would be penalized for
skipping or failing to complete a certain number of items due to time. As a result,
respondents varied significantly in the number of multiple-choice items they actually
completed.
Constructed-Response Items
The constructed-response items required the participants to respond in writing to
open-ended questions. While this item format measures a unique type of knowledge that
is different from that measured by multiple-choice items, it brings with it some clear
challenges when scoring the items. The primary challenge involves having raters score
the responses in a standardized, reliable, and valid manner. In response to this challenge,
we devised an approach that utilizes specific scoring protocols, multiple raters, and expert
judges. During the development of the constructed-response items, item writers created
scoring rubrics, or standardized scoring keys, that describe how each item should be
scored. Following data collection, raters scored the items using these rubrics, which
defined correct and incorrect answers. Thus, raters were to make judgments as to
whether the response was deemed correct (2 points), partially correct (1 point), or
incorrect (0 points or no credit). Raters consisted of nine judges. Three of these raters
were subject matter experts in the field of elementary school teacher education or early
reading instruction while six raters were research assistants working on the project.
All raters were trained to use the rubrics and the scoring program. During the
training, raters were provided with several items to score and examples of acceptable and
unacceptable responses. The raters scored all the items independently and then convened
to discuss their scores and the rationale for their decisions. Through discussion, raters
began successfully reaching consensus on ratings. The process was repeated and the
raters made progress in their observations and rationales. The goal of the training was to
improve judgments and accuracy by teaching the raters to share similar schemas of
correct and incorrect responses. After being trained, preliminary reliability and accuracy
checks were conducted prior to commencing the actual constructed-response scoring.
Interclass correlations, percent agreement among raters, correlations among raters, and
agreement indices between the six research assistants and the three subject matter experts
were calculated.

American Institutes for Research

6

Teacher Assessment Pilot Test

The results suggest that the raters were reliable and accurate, which increased our
confidence in the quality of the item scoring. Table 5 presents the intra-class correlation
obtained after rater training. Conventions based on past research suggest that intra-class
correlations less than .40 are considered “poor,” between .40 and .59 are considered
“fair,” .60 to .74 are considered “good” and intra-class correlations above .74 are
considered “excellent.” The results show that most of the obtained intra-class
correlations were in the good or excellent categories.
Table 5. Intra Class Correlations among Raters Obtained After to Rater Training

Partners
Rater 1 and Rater 2
Rater 1 and Rater 3
Rater 4 and Rater 2
Rater 4 and Rater 3
Rater 5 and Rater 4
Rater 5 and Rater 6
Rater 7 and Rater 1
Rater 6 and Rater 8
Rater 6 and Rater 3
Rater 9 and Rater 2
Rater 9 and Rater 7
Rater 5 and Rater 8
Rater 4 and Rater 7
Rater 6 and Rater 7
Rater 6 and Rater 2
Rater 1 and Rater 8
Rater 9 and Rater 3
Rater 5 and Rater 3
Rater 5 and Rater 9
Rater 5 and Rater 2
Rater 5 and Rater 1
Rater 4 and Rater 8
Average ICC:

Intra-class
Correlations
0.92
0.90
0.89
0.85
0.74
0.70
0.69
0.68
0.67
0.66
0.65
0.61
0.61
0.61
0.65
0.56
0.56
0.53
0.53
0.49
0.44
0.25
0.64

Table 6 presents the percent agreement among raters after training. These data
show that raters agreed more than 83% of the time. These findings further demonstrate
the effectiveness of rater training and that the constructed-response items can be reliably
scored.
Even though the raters were found to be reliable, all of the constructed-response
items were scored by at least two raters as a final consistency check. Scores for each
participant were calculated by computing the average score of the two ratings. Scores for
each item ranged from 0 to 2.

American Institutes for Research

7

Teacher Assessment Pilot Test

Table 6. Average Percent Agreement
Item ID:
sam_01
sao_14
sao_17
sap_01
sas_04
sas_05
sas_08
sas_09
sas_11
sas_15
sas_16

Average Percent Agreement
79.6
82.4
87.4
86.4
76.6
82.6
92.6
87.2
82.5
73.1
88.0

All Items

83.5

Analysis Procedures
Based on the goals of this pilot test, we analyzed all of the items for difficulty and
discrimination as well as analyzed the extent to which the items possessed internal
consistency. However, the multiple-choice and constructed-response items required
somewhat different approaches for meeting these goals. Below we describe our approach
to analysis for each item type. This section is followed by the results of the pilot test.

Multiple-Choice Items
Descriptive statistics and reliability analyses were conducted on the multiplechoice items. Regarding descriptive statistics, the percentage of respondents who
answered each multiple-choice item correctly was calculated (i.e., item difficulty) and the
number of respondents who selected each response option was determined to assess the
quality of the distractors. Regarding reliability, alpha was calculated for each set of
multiple-choice items (alpha for the items that appeared on Version 1 and an alpha for the
items that appeared on Version 2) that we propose to measure pre-service teacher
knowledge of the NRP. We also estimated alpha if all the multiple-choice items were
included on a single version of the assessment using Spearman-Brown. Finally, item-total
correlations were calculated as part of the reliability analysis (an indicator of item
discrimination) and the impact on reliability was determined if an item was removed
from the assessment.

Constructed-Response Items
Analyses for the constructed-response items included calculating the difficulty of
each item and examining reliability of these items. Reliability analyses were conducted
for constructed-response items in a fashion similar to the multiple-choice items. Itemtotal correlations were calculated for each item and reliability was estimated should the
item be removed from the assessment. Likewise we estimated reliability for the complete
set of items if all constructed-response questions were included on a single assessment.

American Institutes for Research

8

Teacher Assessment Pilot Test

Results
Below we present the results of our item analyses. However, we first discuss the
issue of missing data and how we addressed this issue.

Missing Data
A major challenge in all research, particularly when developing measures, is
missing data. It may be recalled that when designing the pilot test and the different forms
that were administered to participants, we made some significant choices. First, we
decided to include every item from the item pool in the pilot under the assumption that
having some data on every item was more important than having no data on a significant
number of items. Second, we counterbalanced sections of the test to guard against order
and fatigue effects. Although these strategies were well thought out and had certain
advantages, they did contribute to the amount of missing data.
In examining the raw responses, we identified two types of missing data. One
occurred when an item was near the end of a section, and the participant simply did not
have enough time to answer the item (Type A: Did Not See). The second type occurred
when a participant could have answered the item but intentionally skipped the item for
whatever reason (Type B: Skipped). Note that missing data were not systematically
tracked; Types A and B were determined by exploring missing data trends. We
presumed that the steady decrease of responses toward the end of each survey section
signified that teachers were running out of time and were unable to complete their items.
The multiple-choice items suffered more than any other item type from the two
forms of missing data described above. This outcome was unanticipated because of the
safeguards we employed when designing the items and structuring the pilot test. For
example, to determine the amount of time the multiple-choice items would take, we
enlisted several research assistants to complete the items as they would for an actual
examination. We timed them during this process and discovered that it required 0.5
minutes per item. To be conservative and account for the range of computer skills among
respondents, we estimated for the actual survey that each multiple-choice item to take
approximately 1 minute. Based on these calculations, and the test time allocated to the
multiple-choice items (80 minutes for 88 and 83 items), we expected that most teachers
would complete the items in these sections. However, data suggested that a large number
of teachers ran out of time. Figure 1 below illustrates the downward trend in responses as
teachers approached the end of the multiple-choice items. Notice how each section
begins with over 200 teachers completing the items. However, over half of the teachers
did not reach the end of the section with approximately 60 respondents completing the
last item.

American Institutes for Research

9

Teacher Assessment Pilot Test

Multiple Choice - Completed Items
250

200

150

100

0

mcs_01
mcs_03
mcm_29
mcp_03
mcs_11
mcs_14
mcs_16
mcm_24
mcs_20
mcm_23
mcp_07
mcs_25
mco_43
mcs_29
mcp_11
mcm_19
mcs_34
mcs_36
mcs_38
mcs_40
mco_34
mcm_15
mcp_16
mcs_51
mcs_50
mco_30
mcp_18
mco_25
mco_23
mcm_11
mcs_58
mco_19
mcm_09
mcs_62
mcp_22
mcs_66
mcs_68
mcs_70
mcp_23
mcm_05
mcs_74
mco_06
mco_04
mcm_01

50

Figure 1. Multiple-choice item response rates by item order.

One possible explanation for the large number of skipped items was the lack of a
“valued” incentive. Teachers who volunteered to participate in this pilot study were paid
a reasonable amount, but not given any explicit enticements to respond to all items and
there were no penalties for running out of time before completing each section.
Alternatively, teachers may have been skipping items that they did not know how to
answer. Given the nature of these items and the fact that they cover a wide range of
testable content, many of the participants may have recognized that some items were
beyond the scope of their knowledge. Thus, if a participant knew that s/he was not
familiar with a particular domain, they may have seen no reason to expend sufficient
effort on the item. This sizeable amount of unanticipated missing data on the multiplechoice items posed a significant challenge when examining reliability of a set of items
which relies on complete data on each item in the analysis. Our strategy for dealing with
this situation is discussed when report the reliability results.

Item Analysis Results
The following sections present the results of our item analyses for the 108
multiple-choice and 24 constructed-response items that we propose for the pre-service
teacher knowledge assessment. For each item type, descriptive data are presented first
followed by our reliability analysis. Once these results have been reviewed, we examine
the extent to which the difficulty of the items varied as a function of participant
experience. Because the assessment will be administered to pre-service teachers, we
wanted to determine if performance on the items was a function of experience in the
classroom.

American Institutes for Research

10

Teacher Assessment Pilot Test

Multiple Choice Items
Item Difficulty. Overall the difficulty analysis showed that multiple-choice items
were moderate to high in difficulty (percent of respondents answering an item correctly).
The average item difficulty across all 108 items was p=.53, with difficulty ranging from a
low of p=.01, for one of the items designed to measure vocabulary, to p=.97, for one of
the comprehension items. Table 7 presents additional information on item difficulty for
the items broken down by each of the five NRP components. Referring to Table 7, item
difficulty was similar across components.
Table 7. Summary of Multiple-Choice Item Characteristics

NRP Component
Comprehension
Fluency
Phonemic Awareness
Phonics
Vocabulary
Total:

Number of
items
27
10
16
22
33
108

N Range
52 -258
95 -229
139 -273
90 - 254
53 – 238

Average
Difficulty
0.55
0.46
0.51
0.57
0.53

Difficulty Range
.16 -.97
.06 -.95
.09 -.85
.10 -.91
.01 - .93

Appendix A presents complete descriptive data for each of the multiple-choice
items including the number of respondents, the item’s difficulty, the answer key, and the
distribution of responses across each item’s response options. Referring to the appendix,
there were 11 items that appeared to be miss keyed or had problems with the distractors.
For example, for one of the fluency items (mcs_66) d was the correct answer, but 84% of
the respondents selected b as the correct alternative. In such cases the keys were checked
and verified to ensure the data were coded correctly. For the pre-service teacher
assessment, it might be best not to include such items since they seem to be either too
difficult or too confusing for the respondents.
Reliability. As described earlier, our reliability analysis focused on the internal
consistency of items. Our expectation was that items developed to measure knowledge of
the NRP should relate to one another. In other words, the 56 items on Version 1 should
be internally consistent with one another as well as the 52 items on Version 2.
To determine reliability, we calculated alpha and item-total correlations for each
version of the assessment. However, it will be recalled that missing data were most
prevalent for the multiple-choice items. Furthermore missing data are most problematic
when determining reliability because this analysis requires complete cases on the items of
interest. Therefore, we treated missing data as incorrect for this round of our analyses.
Although not completely desirable, this approach has been employed in other AIR highstakes testing projects. Furthermore, treating missing data as wrong should only slightly
enhance the item-total correlations and the alphas as opposed to significantly over
estimating these values (AIR staff has conducted Monte Carlo studies testing this
assumption). Nonetheless, because we had to employ this approach, we view these
results as preliminary estimates.

American Institutes for Research

11

Teacher Assessment Pilot Test

Table 8 presents information on the range of item-total correlations and alpha for
each version of the assessment. Appendix B presents item-total correlations for all of the
multiple-choice items. Referring to Table 8, the alphas for each version of the
Table 8. Reliability and Item-total Correlations for the Multiple-Choice Items on each
Version of the Assessment

Version 1
Version 2

Number of
items
56
52

N
283
306

Alpha
0.73
0.75

Item-Total
Range
-.06 - .51
-.04 - .43

assessment exceed .7, which is reasonably high in magnitude. Because reliability is
directly related to test length and the reliabilities in Table 8 are essentially based on half
the number of items we would administer to assess pre-service teacher knowledge, we
estimated the reliability of the proposed assessment by applying the Spearman-Brown
formula to the reliability estimate for Version 1 (the lower value). This analysis indicated
that the reliability for the entire set of multiple-choice items would be .84.

Constructed-Response Item Analysis
Item Difficulty. The proportion of individuals who answered an item correctly
over the total number of individuals is used to calculate item difficulty of dichotomously
scored items. However, since the constructed-response items were scored using 2, 1, and
0, for full credit, partial credit, and no credit, the conventional index of item difficulty
was not used. Therefore, we created an index of item difficulty based on a procedure
developed by the University of Iowa. This procedure yields an index of item difficulty
that ranges from 0 (extremely hard) to 1 (extremely easy), which allows for comparison
of difficulty levels to other item types.
Table 9. Summary of Constructed-Response Item Characteristics

NRP Component
Comprehension
Fluency
Phonemic Awareness
Phonics
Vocabulary
Total:

Number of
items
4
5
3
4
8
24

N Range
95 -177
90 -126
49 -189
122 - 194
64 – 190

Average
Difficulty
0.54
0.60
0.42
0.47
0.66

Difficulty Range
.38 - .76
.43 - .92
.25 - .61
.32 - .70
.51 - .94

Overall the difficulty analysis showed that the constructed-response items were
moderate to high in difficulty. The average item difficulty across all 24 items was p=.57,
with difficulty ranging from a low of p=.25, for one of the items designed to measure
phonemic awareness, to p=.94, for one of the vocabulary items. Table 9 presents
additional information on item difficulty for the items broken down by each of the five
NRP components. Appendix C presents complete descriptive data for each of the
constructed-response items including the number of respondents, the item’s difficulty, the
average item score, and the percent agreement among raters.

American Institutes for Research

12

Teacher Assessment Pilot Test

Reliability. Similar to multiple-choice items, missing data on the constructedresponse questions were treated as incorrect responses and item-total correlations and
alphas for each version of the assessment were calculated. Table 10 reports the range of
item-total correlations for each version of the survey while Appendix D reports complete
item-level data and results.
Table 10. Reliability and Item-total Correlations for the Constructed-Response Items on
each Version of the Assessment

Version 1
Version 2

Number of
items
12
12

N
283
306

Alpha
0.54
0.53

Item-Total
Range
-.06 - .49
-.06 - .44

Referring to Table 10, the reliabilities for the constructed-response items were
somewhat lower than desired (above .7). Similar to the multiple-choice items, we
estimated the reliability for all 24 items using Spearman-Brown formula. This analysis
indicated that the reliability for the entire set of constructed-response items would be .69.
One way to improve reliability of the constructed-response items is to remove the
items with low item-total correlations. To test the effects of this strategy, we removed
four items from Version 1 and three items from Version 2 whose item-total correlations
were approximately zero. This analysis improved the alphas for the constructed-response
items to .73 and .69 for Versions 1 and 2, respectively. Applying the Spearman Brown
formula to the lower alpha (Version 2) produced a reliability estimate of .82 for the
remaining 17 constructed-response items.

Effect of Teaching Experience
One important question that we wanted to answer is whether or not teaching
experience affects performance on the items we propose for the teacher knowledge test.
However, because the teachers in the pilot test were currently practicing teachers and the
teacher assessment will be administered to pre-service teachers we could only indirectly
answer this question. We were interested in this question because we had tried to develop
items that assessed both declarative and procedural knowledge, and we reasoned that
teachers who could draw on extensive experience in classroom settings might score
higher on the test.
To explore the effects of experience, we created four groups of teachers based on
the demographic data collected. The first group consisted of teachers who had three
years or less teaching experience (N= 99), the second group consisted of teachers who
had four to six years experience (N=114), the third group consisted of teachers who had
seven to nine years experience (N=103), and the fourth group consisted of teachers who
had 10 or more years experience (N=265). Next, we calculated the difficulty of the items
for each of these experience subgroups to see if the item difficulty varied by subgroup.
Table 11 reports the mean difficulty of the multiple-choice and constructed-response
questions by subgroup. For the multiple-choice we also calculated these difficulty
estimates by NRP component. Appendix E presents difficulty estimates for each

American Institutes for Research

13

Teacher Assessment Pilot Test

individual item by subgroup. Referring to the table and the appendix, item difficulty
varied little as a function of teaching experience.
Table 11. Mean Item Difficulty by Experience
Less than 3
years

4 to 6 years

7 to 9
years

10 or more
years

Multiple Choice
Comprehension
Fluency
Phonemic Awareness
Phonics
Vocabulary

.55
.46
.51
.53
.53

.58
.47
.49
.55
.52

.56
.46
.53
.59
.52

.53
.45
.52
.58
.53

Constructed-Response

.55

.57

.58

.56

To further explore whether or not experience affected performance on the items,
we correlated the difficulty estimates across items for those teachers with three or fewer
years of experience with each of the more experienced groups. While the averages in
Table 11 indicate if the items are similar, the correlations presented in Table 12 indicate
whether or not the items rank consistently with respect to difficulty across the different
experience groups. In other words, the results in Table 12 show that the items which
were easiest and hardest were the same regardless of experience.
Table 12. Correlations of Item Difficulty Estimates for Teachers with Less than Three Years
Experience with the More Experienced Groups

4 to 6 years

7 to 9
years

10 or more
years

Multiple Choice
Comprehension
Fluency
Phonemic Awareness
Phonics
Vocabulary

.87
.87
.88
.94
.84

.92
.95
.90
.90
.93

.90
.94
.92
.90
.92

Constructed-Response

.87

.88

.91

Combined, although only a proxy, the results presented here indicate that years of
teaching experience is not related to performance on these items. However, we recognize
that no pre-service teachers actually completed the items and therefore, how the items
perform with a highly inexperienced sample may be different. We will rely on the TWG
to provide us guidance as to whether or not, based on the data presented and the items
themselves, the items are likely to perform differently when used to assess pre-service
teacher knowledge of the NRP.

American Institutes for Research

14

Teacher Assessment Pilot Test

Summary
In summary, this report presents data on the performance of a set of multiplechoice and constructed-response items that were designed to assess teacher knowledge of
the NRP. Data were collected on these items from 589 teachers as part of a larger item
set that was pilot tested under the Instructional Processes Research and Development
project sponsored by the National Center of Education Statistics. Although not the ideal
approach for determining the performance of the items presented here, the pilot test
results do provide a first look at how effective these items would be at assessing preservice teacher knowledge of the NRP. Moreover, the results should aid the TWG in
making a determination of the viability of using these items for the pre-service
assessment.
To continue to develop a better understanding of the characteristics of these items,
we are exploring the possibility of collecting additional data. Currently, the items are
being pilot tested as part of AIR’s Professional Development Impact project.
Approximately 80 completed cases on these items should be available in January 2006
(though as with this effort these items are embedded in a larger pilot). We also would
consider administering these items in a single assessment to a group of pre-service
teachers, should the TWG be able to assist us in identifying such a sample. Although we
are confident in our proposed assessment as is, we believe that it is important to explore
any additional options for verifying the reliability and validity of this assessment.

American Institutes for Research

15

Teacher Assessment Pilot Test

Appendix A. Multiple-Choice Descriptive Data

American Institutes for Research

16

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mcs_17
mcs_31
mcs_44
mcs_45
mcs_46
mcs_47
mcs_48
mco_01
mco_02
mco_03
mco_14
mco_15
mco_16
mco_17
mco_18
mco_19
mco_27
mco_29
mco_30
mco_31
mco_32
mco_33
mco_34
mco_35
mco_45
mco_46
mco_47

NRP Component
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension

American Institutes for Research

Version
2
1
1
2
1
2
2
2
1
2
2
1
2
1
2
1
1
1
1
2
2
1
1
2
1
2
1

N
258
206
137
173
152
197
190
63
52
64
109
87
129
119
135
120
157
164
161
174
187
140
159
157
216
238
225

Response
Rate
95%
83%
55%
63%
61%
72%
70%
23%
21%
23%
40%
35%
47%
48%
49%
48%
63%
66%
65%
64%
68%
56%
64%
58%
87%
87%
90%

Difficulty
0.47
0.48
0.97
0.53
0.86
0.45
0.66
0.43
0.60
0.41
0.82
0.26
0.64
0.54
0.77
0.60
0.54
0.87
0.54
0.21
0.62
0.50
0.83
0.20
0.16
0.38
0.60

Correct
Answer
1
4
4
1
3
3
2
3
4
2
4
4
2
3
4
4
3
2
4
1
1
3
1
4
2
3
4

1
47
15
1
53
5
3
11
13
2
19
0
21
14
1
13
5
24
9
1
21
62
8
83
21
25
30
28

2
5
19
2
3
7
45
66
6
31
41
7
13
64
28
5
4
8
87
3
47
3
38
15
50
16
27
3

3
2
18
1
11
86
45
10
43
8
30
11
40
3
54
4
31
54
3
42
1
24
50
0
9
20
38
8

4
45
48
97
33
2
7
13
38
60
11
82
26
19
18
77
60
15
1
54
31
12
4
2
20
39
5
60

17

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mcs_32
mcs_34
mcs_35
mcs_36
mcs_50
mcs_64
mcs_65
mcs_66
mco_20
mco_36

NRP Component
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency

American Institutes for Research

Version
2
1
2
1
1
2
2
1
2
2

N
229
204
220
193
158
141
110
95
160
215

Response
Rate
84%
82%
81%
78%
63%
52%
40%
38%
59%
79%

Difficulty
0.66
0.37
0.31
0.40
0.54
0.14
0.55
0.06
0.95
0.62

Correct
Answer
2
2
3
4
2
2
4
4
3
1

1
29
61
33
1
40
25
10
8
1
62

2
66
37
31
21
54
14
9
84
3
8

3
6
1
31
39
2
10
26
2
95
9

4
0
2
6
40
4
52
55
6
1
21

18

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mcs_01
mcs_02
mcs_03
mcs_04
mcs_05
mcs_13
mcs_14
mcs_15
mcs_16
mcs_55
mcs_56
mco_23
mco_24
mco_25
mco_26
mco_43

NRP Component
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness
Phonemic Awareness

American Institutes for Research

Version
1
2
1
2
1
2
1
2
1
2
1
1
2
1
2
1

N
249
273
192
270
204
166
237
249
239
183
139
139
161
149
193
219

Response
Rate
100%
100%
77%
99%
82%
61%
95%
91%
96%
67%
56%
56%
59%
60%
71%
88%

Difficulty
0.43
0.74
0.20
0.48
0.35
0.85
0.33
0.79
0.09
0.81
0.54
0.25
0.67
0.63
0.73
0.34

Correct
Answer
1
3
4
3
1
3
4
2
2
3
3
4
3
3
4
4

1
43
9
28
49
35
7
52
6
6
2
30
1
12
7
8
32

2
35
16
27
1
6
7
14
79
9
2
6
68
11
16
1
16

3
10
74
26
48
33
85
0
14
84
81
54
6
67
63
19
18

4
12
1
20
3
26
1
33
1
1
16
11
25
11
13
73
34

19

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mcs_06
mcs_18
mcs_19
mcs_20
mcs_21
mcs_22
mcs_24
mcs_25
mcs_26
mcs_37
mcs_38
mcs_39
mcs_51
mcs_52
mcs_68
mcs_69
mcs_70
mco_37
mco_38
mco_39
mco_40
mco_48

NRP Component
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics

American Institutes for Research

Version
2
1
2
1
2
1
2
1
2
2
1
2
1
1
1
2
1
1
2
1
2
2

N
180
235
253
232
230
227
252
208
239
216
185
213
158
162
93
98
90
193
210
184
205
254

Response
Rate
66%
94%
93%
93%
84%
91%
92%
84%
88%
79%
74%
78%
63%
65%
37%
36%
36%
78%
77%
74%
75%
93%

Difficulty
0.75
0.84
0.78
0.89
0.39
0.53
0.90
0.59
0.74
0.71
0.21
0.10
0.76
0.53
0.91
0.37
0.36
0.78
0.47
0.25
0.33
0.31

Correct
Answer
4
2
4
2
1
2
3
2
1
3
2
4
1
3
4
3
4
4
1
1
3
3

1
24
1
0
3
39
12
2
16
74
4
64
16
76
19
3
32
30
6
47
25
7
39

2
0
84
0
89
31
53
4
59
2
12
21
32
19
4
4
16
8
16
9
33
25
7

3
1
11
22
6
16
32
90
18
8
71
10
43
4
53
1
37
27
1
28
5
33
31

4
75
4
78
2
14
4
5
7
16
13
5
10
1
24
91
15
36
78
17
36
36
23

20

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mcs_07
mcs_08
mcs_09
mcs_10
mcs_11
mcs_27
mcs_29
mcs_30
mcs_40
mcs_41
mcs_42
mcs_43
mcs_53
mcs_54
mcs_71
mcs_72
mcs_73
mcs_74
mcs_75
mcs_76
mco_04
mco_05
mco_06
mco_07
mco_08
mco_09
mco_10

NRP Component
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary

American Institutes for Research

Version
1
2
1
2
1
1
1
2
1
2
1
2
2
1
2
1
2
1
2
1
1
2
1
2
1
2
1

N
205
165
212
168
187
216
217
238
132
169
146
150
184
149
97
79
94
64
72
66
53
66
61
73
62
96
88

Response
Rate
82%
60%
85%
62%
75%
87%
87%
87%
53%
62%
59%
55%
67%
60%
36%
32%
34%
26%
26%
27%
21%
24%
24%
27%
25%
35%
35%

Difficulty
0.49
0.01
0.20
0.78
0.82
0.60
0.83
0.93
0.72
0.53
0.32
0.89
0.75
0.34
0.47
0.24
0.13
0.48
0.69
0.77
0.15
0.67
0.46
0.52
0.23
0.55
0.53

Correct
Answer
1
1
4
3
2
3
2
4
2
1
3
3
2
3
3
4
2
4
3
2
4
3
2
4
3
3
1

1
49
1
60
9
2
7
6
1
1
53
32
7
2
9
13
10
15
17
1
2
13
20
36
4
5
25
53

2
6
7
8
13
82
4
83
0
72
6
26
1
75
11
34
24
13
31
26
77
30
3
46
11
16
2
10

3
11
90
12
78
2
60
2
6
21
29
32
89
2
34
47
42
70
3
69
21
42
67
13
33
23
55
32

4
34
2
20
0
14
29
10
93
6
12
10
3
21
54
5
24
2
48
3
0
15
11
5
52
57
18
5

21

Teacher Assessment Pilot Test

Response
Percentages
Item No.
mco_11
mco_12
mco_21
mco_22
mco_41
mco_42

NRP Component
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary

American Institutes for Research

Version
2
1
1
2
1
2

N
94
83
128
180
214
237

Response
Rate
34%
33%
51%
66%
86%
87%

Difficulty
0.27
0.40
0.48
0.82
0.67
0.73

Correct
Answer
3
4
1
1
3
3

1
17
28
48
82
16
13

2
27
18
9
13
4
3

3
27
15
41
2
67
73

4
30
40
2
3
13
11

22

Teacher Assessment Pilot Test

Appendix B. Multiple-Choice Item-Total Correlations

American Institutes for Research

23

Teacher Assessment Pilot Test

Multiple Choice Item-Total Statistics for Version 1

mco_02_cscore
mco_04_cscore
mco_06_cscore
mco_08_cscore
mco_10_cscore
mco_12_cscore
mco_15_cscore
mco_17_cscore
mco_19_cscore
mco_21_cscore
mco_23_cscore
mco_25_cscore
mco_27_cscore
mco_29_cscore
mco_30_cscore
mco_33_cscore
mco_34_cscore
mco_37_cscore
mco_39_cscore
mco_41_cscore
mco_43_cscore
mco_45_cscore
mco_47_cscore
mcs_01_cscore
mcs_03_cscore
mcs_05_cscore
mcs_07_cscore
mcs_09_cscore
mcs_11_cscore
mcs_14_cscore
mcs_16_cscore
mcs_18_cscore
mcs_20_cscore
mcs_22_cscore
mcs_25_cscore
mcs_27_cscore
mcs_29_cscore
mcs_31_cscore
mcs_34_cscore
mcs_36_cscore
mcs_38_cscore
mcs_40_cscore
mcs_42_cscore
mcs_44_cscore
mcs_46_cscore

Scale Mean if
Item Deleted
15.7915
15.8728
15.7986
15.8481
15.7350
15.7845
15.8163
15.6714
15.6466
15.6855
15.7774
15.5689
15.6007
15.3922
15.5936
15.6502
15.4311
15.3640
15.7350
15.3958
15.6360
15.7739
15.4205
15.5159
15.7597
15.6466
15.5406
15.7456
15.3569
15.6219
15.8233
15.2014
15.1661
15.4735
15.4664
15.4452
15.2650
15.5512
15.6325
15.6219
15.7633
15.5618
15.7314
15.4276
15.4346

American Institutes for Research

Scale Variance
if Item Deleted
34.698
35.083
34.771
35.101
34.593
34.914
34.739
34.321
33.875
34.060
34.464
33.544
32.702
32.934
32.852
33.427
32.126
32.849
33.777
33.587
33.949
34.743
33.819
33.790
34.722
34.017
34.469
34.871
34.081
33.775
34.451
33.793
33.990
34.179
34.328
34.312
33.408
33.596
34.056
34.016
34.344
33.226
34.162
32.182
31.885

Item-Total
Correlation
0.057
-0.058
0.039
-0.058
0.062
-0.004
0.058
0.103
0.185
0.162
0.112
0.227
0.399
0.316
0.367
0.277
0.463
0.332
0.253
0.200
0.167
0.038
0.160
0.172
0.039
0.157
0.053
0.000
0.115
0.197
0.156
0.186
0.157
0.099
0.073
0.075
0.243
0.213
0.145
0.150
0.135
0.284
0.160
0.453
0.508

Alpha if Item
Deleted
0.734
0.735
0.735
0.736
0.735
0.736
0.734
0.734
0.730
0.731
0.733
0.728
0.720
0.724
0.722
0.726
0.716
0.723
0.728
0.730
0.731
0.735
0.732
0.731
0.735
0.731
0.736
0.737
0.734
0.730
0.731
0.730
0.731
0.734
0.736
0.736
0.728
0.729
0.732
0.732
0.732
0.726
0.731
0.717
0.714

24

Teacher Assessment Pilot Test

Multiple Choice Item-Total Statistics for Version 1

mcs_50_cscore
mcs_51_cscore
mcs_52_cscore
mcs_54_cscore
mcs_56_cscore
mcs_66_cscore
mcs_68_cscore
mcs_70_cscore
mcs_72_cscore
mcs_74_cscore
mcs_76_cscore

Scale Mean if
Item Deleted
15.5972
15.4735
15.5972
15.7244
15.6325
15.8763
15.5972
15.7845
15.8304
15.7915
15.7173

American Institutes for Research

Scale Variance
if Item Deleted
32.568
32.371
33.163
33.548
33.340
34.960
34.348
34.695
34.872
34.627
34.714

Item-Total
Correlation
0.424
0.423
0.307
0.298
0.287
0.011
0.082
0.055
0.022
0.077
0.030

Alpha if Item
Deleted
0.719
0.719
0.725
0.726
0.726
0.734
0.735
0.734
0.735
0.734
0.736

25

Teacher Assessment Pilot Test

Multiple Choice Item-Total Statistics for Version 2

mco_01_cscore
mco_03_cscore
mco_05_cscore
mco_07_cscore
mco_09_cscore
mco_11_cscore
mco_14_cscore
mco_16_cscore
mco_18_cscore
mco_20_cscore
mco_22_cscore
mco_24_cscore
mco_26_cscore
mco_31_cscore
mco_32_cscore
mco_35_cscore
mco_36_cscore
mco_38_cscore
mco_40_cscore
mco_42_cscore
mco_46_cscore
mco_48_cscore
mcs_02_cscore
mcs_04_cscore
mcs_06_cscore
mcs_08_cscore
mcs_10_cscore
mcs_13_cscore
mcs_15_cscore
mcs_17_cscore
mcs_19_cscore
mcs_21_cscore
mcs_24_cscore
mcs_26_cscore
mcs_30_cscore
mcs_32_cscore
mcs_35_cscore
mcs_37_cscore
mcs_39_cscore
mcs_41_cscore
mcs_43_cscore
mcs_45_cscore
mcs_47_cscore
mcs_48_cscore

Scale Mean if
Item Deleted
17.1013
17.1046
17.0458
17.0654
17.0163
17.1078
16.8987
16.9216
16.8497
16.6928
16.7059
16.8399
16.7320
17.0719
16.8137
17.0850
16.7549
16.8693
16.9706
16.6209
16.8922
16.9314
16.5294
16.7680
16.7484
17.1830
16.7614
16.7288
16.5458
16.7908
16.5490
16.8987
16.4510
16.6111
16.4641
16.6993
16.9706
16.6895
17.1176
16.8954
16.7516
16.8889
16.9020
16.7778

American Institutes for Research

Scale Variance
if Item Deleted
35.560
35.484
34.982
35.471
34.777
35.218
34.445
34.669
34.115
33.702
33.487
33.761
33.856
35.910
34.736
35.691
34.573
34.763
36.061
34.512
35.690
35.881
35.699
35.202
35.710
36.176
35.035
35.352
34.983
35.386
35.199
35.954
35.114
34.907
35.030
35.372
35.202
34.949
36.301
35.425
34.974
34.447
34.948
33.996

Item-Total
Correlation
0.180
0.207
0.276
0.171
0.298
0.295
0.300
0.265
0.345
0.395
0.434
0.408
0.370
0.062
0.224
0.128
0.245
0.231
0.004
0.256
0.065
0.034
0.058
0.137
0.050
0.072
0.166
0.110
0.183
0.107
0.144
0.017
0.181
0.188
0.193
0.106
0.179
0.178
-0.037
0.114
0.175
0.296
0.205
0.350

Alpha if Item
Deleted
0.745
0.744
0.741
0.745
0.740
0.742
0.740
0.741
0.737
0.735
0.733
0.734
0.736
0.748
0.743
0.746
0.742
0.742
0.751
0.741
0.749
0.750
0.750
0.747
0.751
0.747
0.745
0.748
0.745
0.748
0.746
0.751
0.744
0.744
0.744
0.748
0.745
0.745
0.750
0.747
0.745
0.740
0.744
0.737

26

Teacher Assessment Pilot Test

Multiple Choice Item-Total Statistics for Version 2

mcs_53_cscore
mcs_55_cscore
mcs_64_cscore
mcs_65_cscore
mcs_69_cscore
mcs_71_cscore
mcs_73_cscore
mcs_75_cscore

Scale Mean if
Item Deleted
16.7386
16.7059
17.1275
16.9935
17.0719
17.0392
17.1503
17.0261

American Institutes for Research

Scale Variance
if Item Deleted
33.676
33.579
36.000
35.056
35.234
35.330
35.840
34.885

Item-Total
Correlation
0.403
0.418
0.067
0.221
0.239
0.187
0.161
0.281

Alpha if Item
Deleted
0.734
0.734
0.747
0.743
0.743
0.744
0.746
0.741

27

Teacher Assessment Pilot Test

Appendix C. Constructed-Response Descriptive Data

American Institutes for Research

28

Teacher Assessment Pilot Test

Item No.

NRP
Component

sao_11

Comprehension

sao_13

Comprehension

sao_06

Comprehension

sas_05_01

Comprehension

sao_08

Fluency

sas_14

Fluency
Fluency

sas_22
sas_13_01
sas_15

Fluency

sas_24

Fluency
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness

sao_17

Phonics

sas_09

Phonics

sas_16

Phonics

sao_18_02
sas_04

sas_08

Phonics

sao_09

Vocabulary

sas_18

Vocabulary

sas_20

Vocabulary

sas_26

Vocabulary

sas_01_01

Vocabulary

sas_11

Vocabulary

sas_19

Vocabulary

sas_21

Vocabulary

American Institutes for Research

Version

N

Difficulty

Average
score

1
1
2
2
1
1
1
2
2

118
107
95
177
93
114
91
90
126

0.38
0.45
0.76
0.57
0.52
0.55
0.92
0.58
0.43

0.76
0.90
1.52
1.14
1.03
1.09
1.84
1.17
0.87

Percent
agreement
among
raters
80.93
82.24
91.58
86.35
85.48
82.46
93.41
86.11
75.66

2

120

0.61

1.21

90.00

1

189

0.25

0.51

78.04

2
1
1
1
2
2
1
1
1
2
2
2
2

49

0.40
0.32
0.52
0.70
0.32
0.64
0.67
0.52
0.55
0.85
0.60
0.51
0.94

0.80

78.57

0.65
1.04
1.41
0.64
1.28
1.34
1.05
1.09
1.69
1.20
1.03
1.88

89.86
90.26
87.04
93.44
84.33
90.54
88.37
79.69
90.00
84.04
79.51
93.59

194
190
135
122
134
111
86
64
190
166
122
117

29

Teacher Assessment Pilot Test

Appendix D. Constructed-Response Item-Total
Correlations

American Institutes for Research

30

Teacher Assessment Pilot Test

Item No.

NRP
Component

sao_11

Comprehension

sao_13

Comprehension

sao_08

Fluency

sas_14*
sas_22

Fluency
Fluency

sas_04*

Phonemic
Awareness

sao_17*

Phonics

sas_09*

Phonics

sas_16

Phonics

sas_18

Vocabulary

Version

N

Item-Total
Correlation

Alpha if Item
Deleted

1
1
1
1
1

283
283
283
283
283

.37
.21
.37
.02
.41

.48
.51
.48
.56
.45

1
1
1
1
1

283

-.07
.06
-.06
.11
.41

.56

283
283
283
283

.55
.58
.54
.46

Note: (*) indicates items that were deleted for subsequent reliability analysis.

Item No.

NRP
Component

sao_06

Comprehension

sas_05_01

Comprehension

sas_13_01
sas_15
sao_18_02*
sas_24

Fluency
Fluency
Phonemic
Awareness
Phonemic
Awareness

sas_08*

Phonics

sao_09

Vocabulary

sas_01_01*

Vocabulary

sas_11

Vocabulary

sas_19

Vocabulary

sas_21

Vocabulary

Version

N

Item-Total
Correlation

Alpha if Item
Deleted

2
2
2
2

306
306
306
306

.44
.28
.28
.12

.44
.49
.49
.53

2

306

-.06

.58

2
2
2
2
2
2
2

306

.36
-.06
.36
-.05
.10
.47
.44

.49

306
306
306
306
306
306

.56
.47
.59
.54
.45
.43

Note: (*) indicates items that were deleted for subsequent reliability analysis.

American Institutes for Research

31

Teacher Assessment Pilot Test

Appendix E. Item Difficulties by Teacher Experience

American Institutes for Research

32

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mco_02
mco_15
mco_17
mco_19
mco_27
mco_29
mco_30
mco_33
mco_34
mco_45
mco_47
mcs_31
mcs_44
mcs_46
mco_01
mco_03
mco_14
mco_16
mco_18
mco_31
mco_32
mco_35
mco_46
mcs_17
mcs_45
mcs_47
mcs_48

NRP Component
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension
Comprehension

American Institutes for Research

Version
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2

Full
Sample
0.60
0.26
0.54
0.60
0.54
0.87
0.54
0.50
0.83
0.16
0.60
0.48
0.97
0.86
0.43
0.41
0.82
0.64
0.77
0.21
0.62
0.20
0.38
0.47
0.53
0.45
0.66

3 years or
fewer
0.44
0.29
0.57
0.55
0.52
0.93
0.55
0.58
0.74
0.15
0.73
0.45
1.00
0.81
0.43
0.29
0.85
0.78
0.85
0.24
0.59
0.16
0.34
0.33
0.56
0.38
0.68

4 to 6
years
0.70
0.13
0.65
0.68
0.64
0.81
0.49
0.59
0.86
0.25
0.49
0.43
0.97
0.88
0.56
0.53
0.89
0.70
0.71
0.26
0.61
0.14
0.30
0.40
0.56
0.56
0.76

7 to 9
years
0.50
0.39
0.58
0.60
0.46
0.88
0.56
0.57
0.84
0.14
0.60
0.44
1.00
0.92
0.33
0.47
0.75
0.66
0.80
0.18
0.63
0.30
0.42
0.54
0.58
0.49
0.59

10 or
more
years
0.65
0.25
0.45
0.58
0.52
0.88
0.55
0.40
0.85
0.14
0.61
0.51
0.95
0.85
0.37
0.33
0.79
0.52
0.74
0.18
0.62
0.21
0.42
0.54
0.48
0.40
0.64

33

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mcs_34
mcs_36
mcs_50
mcs_66
mco_20
mco_36
mcs_32
mcs_35
mcs_64
mcs_65

NRP Component
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency
Fluency

American Institutes for Research

Version
1
1
1
1
2
2
2
2
2
2

Full
Sample
0.37
0.40
0.54
0.06
0.95
0.62
0.66
0.31
0.14
0.55

3 years or
fewer
0.27
0.44
0.47
0.06
1.00
0.62
0.58
0.37
0.11
0.68

4 to 6
years
0.40
0.39
0.64
0.17
0.97
0.53
0.74
0.33
0.13
0.39

7 to 9
years
0.42
0.27
0.52
0.05
0.97
0.59
0.60
0.28
0.19
0.71

10 or
more
years
0.37
0.44
0.53
0.03
0.90
0.68
0.68
0.28
0.12
0.49

34

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mco_23
mco_25
mco_43
mcs_01
mcs_03
mcs_05
mcs_14
mcs_16
mcs_56
mco_24
mco_26
mcs_02
mcs_04
mcs_13
mcs_15
mcs_55

NRP Component
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness

American Institutes for Research

Version

Full
Sample

3 years or
fewer

4 to 6
years

7 to 9
years

10 or
more
years

1

0.25

0.39

0.17

0.40

0.16

1

0.63

0.60

0.48

0.76

0.66

1

0.34

0.32

0.40

0.30

0.33

1

0.43

0.44

0.39

0.35

0.47

1

0.20

0.13

0.16

0.25

0.23

1

0.35

0.38

0.26

0.27

0.41

1

0.33

0.26

0.35

0.31

0.36

1

0.09

0.03

0.08

0.13

0.10

1

0.54

0.54

0.43

0.65

0.54

2

0.67

0.61

0.84

0.69

0.60

2

0.73

0.90

0.72

0.68

0.66

2

0.74

0.76

0.75

0.69

0.75

2

0.48

0.42

0.56

0.49

0.45

2

0.85

0.77

0.83

0.92

0.86

2

0.79

0.76

0.76

0.82

0.80

2

0.81

0.84

0.72

0.70

0.90

35

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mco_37
mco_39
mcs_18
mcs_20
mcs_22
mcs_25
mcs_38
mcs_51
mcs_52
mcs_68
mcs_70
mco_38
mco_40
mco_48
mcs_06
mcs_19
mcs_21
mcs_24
mcs_26
mcs_37
mcs_39
mcs_69

NRP Component
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics
Phonics

American Institutes for Research

Version
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2

Full
Sample
0.78
0.25
0.84
0.89
0.53
0.59
0.21
0.76
0.53
0.91
0.36
0.47
0.33
0.31
0.75
0.78
0.39
0.90
0.74
0.71
0.10
0.37

3 years or
fewer
0.84
0.26
0.74
0.82
0.31
0.50
0.27
0.81
0.55
0.89
0.33
0.51
0.39
0.22
0.77
0.64
0.39
0.86
0.59
0.65
0.13
0.21

4 to 6
years
0.83
0.30
0.81
0.87
0.59
0.54
0.15
0.81
0.59
0.94
0.24
0.49
0.29
0.37
0.75
0.67
0.30
0.88
0.67
0.69
0.07
0.30

7 to 9
years
0.88
0.35
0.81
0.97
0.57
0.57
0.17
0.80
0.45
0.95
0.35
0.43
0.27
0.33
0.75
0.91
0.42
0.85
0.78
0.74
0.10
0.48

10 or
more
years
0.72
0.20
0.89
0.91
0.57
0.63
0.22
0.70
0.52
0.89
0.43
0.46
0.35
0.31
0.74
0.81
0.41
0.94
0.82
0.73
0.11
0.43

36

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mco_04
mco_06
mco_08
mco_10
mco_12
mco_21
mco_41
mcs_07
mcs_09
mcs_11
mcs_27
mcs_29
mcs_40
mcs_42
mcs_54
mcs_72
mcs_74
mcs_76
mco_05
mco_07
mco_09
mco_11
mco_22
mco_42
mcs_08
mcs_10
mcs_30
mcs_41

NRP Component
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary

American Institutes for Research

Version
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2

Full
Sample
0.15
0.46
0.23
0.53
0.40
0.48
0.67
0.49
0.20
0.82
0.60
0.83
0.72
0.32
0.34
0.24
0.48
0.77
0.67
0.52
0.55
0.27
0.82
0.73
0.01
0.78
0.93
0.53

3 years or
fewer
0.09
0.50
0.23
0.50
0.41
0.36
0.61
0.38
0.11
0.83
0.67
0.69
0.81
0.30
0.30
0.29
0.54
0.86
0.58
0.64
0.50
0.11
0.82
0.68
0.00
0.77
0.88
0.66

4 to 6
years
0.11
0.15
0.18
0.61
0.36
0.46
0.84
0.44
0.18
0.86
0.61
0.76
0.67
0.34
0.44
0.20
0.21
0.75
0.82
0.56
0.57
0.41
0.87
0.72
0.06
0.85
0.93
0.38

7 to 9
years
0.11
0.64
0.23
0.50
0.44
0.38
0.59
0.45
0.14
0.81
0.55
0.86
0.72
0.24
0.30
0.29
0.58
0.73
0.65
0.41
0.59
0.23
0.74
0.81
0.00
0.74
0.94
0.57

10 or
more
years
0.21
0.52
0.24
0.53
0.38
0.57
0.64
0.56
0.26
0.80
0.58
0.88
0.71
0.34
0.32
0.21
0.56
0.76
0.60
0.50
0.54
0.28
0.85
0.72
0.00
0.77
0.95
0.51

37

Teacher Assessment Pilot Test

Multiple –
Choice Item No.
mcs_43
mcs_53
mcs_71
mcs_73
mcs_75

NRP Component
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary

American Institutes for Research

Version
2
2
2
2
2

Full
Sample
0.89
0.75
0.47
0.13
0.69

3 years or
fewer
0.97
0.74
0.67
0.17
0.71

4 to 6
years
0.84
0.79
0.43
0.11
0.67

7 to 9
years
0.97
0.70
0.38
0.14
0.78

10 or more
years
0.84
0.75
0.46
0.11
0.64

38

Teacher Assessment Pilot Test

ConstructedResponse
Item No.
sao_11
sao_13
sas_05_01
sao_06
sas_14
sas_22
sao_08
sas_13_01
sas_15
sas_04
sas_24
sao_18_02
sas_09
sas_16
sao_17
sas_08
sas_18
sas_20
sas_26
sas_01_01
sas_11
sas_19
sas_21
sao_09

NRP Component

Version

Full
Sample
Difficulty

Comprehension
Comprehension
Comprehension
Comprehension
Fluency
Fluency
Fluency
Fluency
Fluency
Phonemic
Awareness
Phonemic
Awareness
Phonemic
Awareness
Phonics
Phonics
Phonics
Phonics
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary
Vocabulary

1
1
2
2
1
1
1
2
2

American Institutes for Research

0.38
0.45
0.57
0.76
0.55
0.92
0.52
0.58
0.43

3 years
or
fewer
0.39
0.52
0.55
0.81
0.54
0.83
0.56
0.47
0.33

4 to 6
years
0.41
0.50
0.61
0.79
0.38
0.92
0.47
0.56
0.48

7 to 9
years
0.40
0.50
0.62
0.70
0.61
0.98
0.55
0.56
0.44

10 or
more
years
0.35
0.38
0.52
0.73
0.59
0.92
0.49
0.66
0.45

1

0.25

0.21

0.27

0.18

0.28

2

0.40

0.32

0.43

0.48

0.37

2
1
1
1
2
1
1
1
2
2
2
2
2

0.61
0.52
0.70
0.32
0.32
0.67
0.52
0.55
0.85
0.60
0.51
0.94
0.64

0.55
0.51
0.67
0.32
0.24
0.62
0.40
0.62
0.89
0.69
0.62
0.93
0.62

0.58
0.49
0.84
0.32
0.31
0.73
0.61
0.57
0.79
0.56
0.49
0.94
0.66

0.61
0.57
0.76
0.26
0.31
0.66
0.67
0.51
0.87
0.59
0.48
0.93
0.66

0.63
0.53
0.65
0.35
0.35
0.66
0.47
0.52
0.85
0.60
0.49
0.95
0.63

39
File Type	application/pdf
Author	dbaker
File Modified	2006-01-13
File Created	2006-01-13