10. CHAPTER 10 PERFORMING A FULL ANALYSIS OF FINAL TEST DATA ADMINISTRATION
Use the CYCLE1 sample data set to carry out this exercise.
The answer key for this test is in the EXCEL workbook,
ItemDataAllTests.xls, in the sheet named CYCLE1.
Continuing the scenario introduced
in Chapter 11, the national
assessment team has produced and administered a test to a national
sample of students.
Final test includes
50 items, representing five content areas (number knowledge, shape and space, relations, problem solving, and uncertainty) in proportions determined by the test specifications.
The final sample design is a stratified cluster sample, with schools as the primary sampling unit and a target sample of 30 students from each school. The sample includes
79 schools, selected
to be representative of 5 national
regions and stratified by rural status and language
of instruction. The total number of students
in the sample is 2242, representing a population of approximately 86000.
This walkthrough follows
the same steps as the analysis of pilot test data. However,
because the final test is concerned primarily
with producing and interpreting scores, the item analysis is typically performed
without the exploratory emphasis present in the analysis of pilot test data. Accordingly, this walkthrough will focus on the unique aspects of final test data analysis
that distinguish it from analysis
of pilot test data. Where
the steps of analysis are identical to those discussed
in the previous chapter, refer to the information presented
in the previous chapters.
Begin the analysis
by clicking the “Response data analysis” from the IATA main
menu.
10.1. Step 1: SETTING UP THE ANALYSIS
The procedures for setting up the analysis
are similar to those in Chapter 11. You must first load a response
file, then load an item data file, then specify the analysis.
If you do not know how to perform these steps, refer to Chapter 11, Step 1 to Step 3, for detailed instructions on how to perform each task. Referring to the contents of the IATA sample data folder:
- The response data file for this chapter is CYCLE1.xls. This file has 2242 records and 58 variables.
- The item
data file is in the Excel file named “ItemDataAllTests.xls” in the table named
“CYCLE1”. Ensure that the correct table name is selected in the item data loading
interface. The CYCLE1 item data has 50 records and 4 variables.
The items in this national
assessment test are a subset of the pilot items in Chapter 11.
The specifications for this analysis are slightly different
from the pilot test data analysis, primarily
resulting from the use of scientific sampling
in the full administration of the national
assessment. The first difference is the name of the identification variable,
which is named “CYCLE1STDID”. The second difference, which will have an effect on the results of the analysis, is the presence of a sample design weight, which is named “CYCLE2weight”. These variable
specifications must be selected from the drop-down menus. In these data, the value of 9 represents
missing responses that will be treated as incorrect. The completed specifications should look like Figure 10.1.
Figure 10.1 Analysis specifications for CYCLE1 data
Note that the item data for the final assessment also include data in the “Level” field in the third column of the table on the left. These data are natural numbers (1 or greater) that represent the expected level of performance or proficiency that that the curriculum
content speicalists assigned
to each test item: Level 1 represents the lowest level of performance (i.e., minimum competence) and Level 4 represents the highest level. Although
every item is assigned a level, it is possible
that students will not achieve even the lowest level.
After verifying that the specifications and data are correct, click the “Next>>” button to continue.
The analysis will begin automatically, updating the interface
with the progress periodically. With larger data sets or slower computers, the analysis may appear to hang on the “Estimating parameters” stage, which is the most time consuming. Do not close the program;
IATA will continue to run and will provide an update when the analysis is complete.
10.2. Step 2: BASIC ANALYSIS RESULTS
Because
problematic items were identified and removed during the analysis of the pilot test data, there are no remaining problematic items in these full test data. You should confirm
that the items are behaving
appropriately by reviewing
1) the item analysis and 2) the test dimensionality results. If you do not know how to perform these steps, refer to Chapter 10, Step 4 and Step 5, for detailed instructions on how to perform these tasks. Note that all of the items have green circles with the exception of MATHC1046, which we identified in the previous chapter as being somewhat
problematic but which we left in the test. Proceed
to the differential item functioning
interface when you have finished.
10.3. Step 3: ANALYSIS OF DIFFERENTIAL ITEM FUNCTIONING
Although DIF analysis was performed
on the pilot test data, the results of DIF analyses tend to be sensitive to sampling errors, so it is good practice to replicate these analyses with the full sample.
Another reason to perform DIF analysis is that there may be additional variables available
in the full sample that were not available in the pilot sample, or the sample provides a sufficient number of cases to perform the DIF analysis. For example, in the pilot data analysed in Chapter 9, all students
in the sample were from urban areas, whereas the full sample contains students
from both rural and urban areas.
Replication of the DIF analyses from the previous chapter is left as an independent exercise. For this example,
we will perform a DIF analysis using the variable “rural”.
We wish to see if rural students
are disadvantaged, relative
to their urban counterparts. For the CYCLE1 data, a value of “1” for this indicator
means that a student is attending
a rural school. In order to specify this analysis
and review the results, perform the following
steps:
1. From the drop-down menu on the left, select the “rural” variable.
When you do so, the table beneath will be populated with the values “0.00” and “1.00”, with values of 56% for “0.00” and 44% for “1.00,” indicating that 44% of the students (unweighted) in the sample attend rural schools.
2. In the table of values, click on the value “1.00” –
this will cause the value of 1.00 (representing rural students) to be entered as
the Focus group in the text box beneath.
3. In the table of values, click
on the value “0.00” – this will cause the value of 0.00 (representing urban students)
to be entered as the Reference group in the text box beneath.
4. Click the “Calculate” button and wait for the calculation
to complete.
When the calculation is complete, in the item list, click
on the header of the “S-DIF” column to sort all the items by the value of the
S-DIF statistic.
When you have completed these steps, the interface will appear as illustrated in Figure 10.2. Compared to the results presented in Chapter 11, the items show much
more
stability in the empirical IRFs than were seen in the PILOT1 data. If you were to replicate the same analyses that were presented in Chapter 11 with the current data,
you would similarly
see fewer differences between groups and generally, much smaller U-DIF statistics. The increased stability is largely the result of increased sample size. Reviewing each of the items, you will see that the majority of both S-DIF and U-DIF statistics are less than 5, indicating
that, after controlling for differences in proficiency between rural and urban students,
the differences in item performance
between rural and urban students
tend to be negligible.
Figure 10.2 DIF analysis results for CYCLE1 data by sex,
item MATHC1008
The purpose of performing DIF analysis
at the final test stage of a national assessment
is to determine if an item should be made ineligible for calculating student scores. At this stage of the analysis, it would be appropriate to share the statistical analysis results with the national
assessment steering committee, who would determine
if the potentially problematic items should be removed or retained. If an item is removed, the analysis
may be rerun by either deleting the item’s answer key in the analysis specifications interface or by unchecking the item in the item analysis interface. For the current example, we shall assume that all items will be retained.
When you have finished reviewing
all items, click the “Next>>” button to continue.
10.4. Step 4: SCALING
The default scale used to calculate the results for the IRT scale scores is the standard or
Z scale, which has a mean of 0 and a standard
deviation of 1. Scores expressed on this scale can appear to be problematic to many stakeholders, because half the students will have ‘negative’ scores.
Similarly, scores bounded by 0 and 100 also have communication challenges, most audiences tend to assume that a score of 50 represents a passing score, which may not be true, depending
on the test specifications.
For communication
purposes, it may be undesirable to report test results with an average
score less than 50 percent or below 0. Journalists, policy-makers and other commentators may not appreciate the statistical nature of negative
values and incorrectly infer that half the population is below or above standard
(or, even worse, that half the population
has ‘negative’ proficiency). Some large-scale
assessments transform their calculated scores into scales which have a mean of 500, 100, or 50 and standard deviations
of 100, 20 and 10, respectively. Each national
assessment team should
select the type of score that is most likely to facilitate
effective communication of results.
There
are two types of scaling
that may be performed in IATA: setting the scale and rescaling. Setting the scale allows you to specify the desired mean and standard deviation of the scale scores. Rescaling
allows you to apply a simple linear transformation to the IRT scores, which is useful if the scale scores must be compared to
a scale that has been established from a previous
analysis. In this case, item parameters from the previous cycle can be used to estimate test scores or equate results from the student data in the new cycle so that the IRT scores that IATA calculates
are comparable to the previous cycle’s calculated IRT scores. The calculated results can then be rescaled
using the rescale function so that they are comparable with the reported scale from the previous cycle.
In either case, the new scale score is created by entering
the name of the new score and
specifying the standard deviation and mean in the appropriate boxes. When you click the “Calculate” button, IATA will produce the new scale scores and display the distribution and summary statistics.
Because the primary function
of the analysis of national
assessment test data is to produce scores that may be interpreted and analysed, the scaling interface
receives more attention
with analyses of full test data than it does with analyses
of pilot test data. There are two main purposes that this walkthrough will make of the scaling interface: first, reviewing the distribution of proficiency relative
to the distribution of test information will inform the quality of inferences that may be made about different ranges of proficiency; and second, creating
a reporting scale for the test results will establish a metric for communicating results to stakeholders.
To review the distribution of IRT scores, select “IRTscore” from the drop-down menu at the upper left of the interface.
The interface will update with descriptive details
about the IRT scores and the test information, as shown in Figure 10.3. The mean of the IRTscore distribution is -0.02 and the standard deviation if 1.04. These values are not meaningful in themselves, as they represent
the arbitrary scale on which the items were
calibrated. The graph indicates that test information, illustrated by the solid black
line, is slightly wider than the distribution of proficiency; this result is statistically ideal in that it minimizes the average standard
error of measurement at all levels of proficiency for the given distribution (see Chapter 15, page 185). The frequency spike at the left hand side of the graph at approximately -3 on the proficiency scale corresponds to students who did not score any items correct
on the test. The test does not have sufficient information to determine these students’
proficiency with accuracy,
because the test does not have many very easy items; as a result, these students are assigned the same arbitrarily low score.
These results also indicate that the test was
relatively difficult for students. The peak of
the information function tends to be located at the region of proficiency where students are most likely to score 50%.
In Figure 10.3, this peak is slightly above the
mean score of -0.02, indicating that above-average students tended to
score only 50% correct. While this
result provides good statistical accuracy, the results may be disappointing to stakeholders who are
used to interpreting any result less than 50% as
a failure.
Figure 10.3 Distribution of proficiency (IRT score) and test information, CYCLE1
data
To produce a more useful reporting scale based on the IRT score, use the “Add New Scale
Score” functions at the bottom right of the interface. For this example,
let us assume that National Steering Committee requested a new scale which requires setting the mean equal to 500 and the standard deviation equal to 100. This scale will be set in the first cycle of the national assessment and be used in subsequent
cycles as well to report on changes
in progress over time. The name of this score will be “NAMscore” (National
Assessment of Mathematics score). To provide these specifications, perform the following
steps:
1. Type “NAMscore” in the text box beneath the “Add New Scale Score” label.
2. Enter a value of “100” for
the St. Deviation.
3. Enter a value of “500” for
the Mean.
4. Make sure the “Set scale”
option is selected. This will ensure that the scale score produced will have a mean
exactly equal to 500 and a standard deviation exactly equal to 100 for the sample
(the Rescale option will simply adjust the existing IRT score by the specified mean
and standard deviation).
5. Click the “Calculate”
button.
When IATA is finished
processing the request, it will update the interface
with the summary graph and statistics for the newly-created scale score, shown in Figure 10.4.
Figure 10.4 Distribution and summary statistics for new scale score (NAMscore), CYCLE1 data
There
are relatively few limitations in selecting a derived scale score.You can use any valid name for the derived scale score so long as it not already used in the response data (see Chapter
8 for naming conventions and restricted variable
names). The mean can be any real number, and the standard
deviation can be any real number greater than 0. However, it is important to ensure that the lowest reported student scores are not less than 0. Since the lowest score is usually around three to four standard deviations
below the mean, it is good practice
to set the mean to be at least 4 standard deviations
above 0. The IEA, for instance, usually reports achievement
results using a mean of 500 and a standard
deviation of 100. The choice of a reporting scale should be discussed
with the national
assessment steering
committee at the initial planning
stages so that all stakeholders understand how to interpret the reported results.
After the new scale score has been created, click the “Next>>” button to continue.
10.5. Step 5: SELECTING TEST ITEMS
The CYCLE1 data represent the initial cycle of a national assessment program. Looking to the future, it will be necessary
in subsequent cycles to alter the test and maintain a linkage to the initial cycle’s results.
To do this, you will need to select a subset of items that are both accurate
and represent the continuum of proficiency.
A reasonable practice
for maintaining a strong linkage between tests is to keep approximately 50% of the items common between adjacent assessments, also known as anchor items. To facilitate the process of selecting anchor items, you can use the item selection functionality of IATA to produce a table of items ranked by their suitability for maximizing accuracy
across the proficiency range. To perform this selection, perform the following
steps:
1. Type the name “ItemRanks” into the name of item selection field.
2. Type the number 50 in the
number of items field to select all items.
3. Leave the lower and upper bounds at their default values
of 2 and 98,
4. Click the “Select Items”
button.
The complete
results are shown in Figure 10.5. All of the available items have been selected and categorized by content and
cognitive levels from their original specifications.
The table of results, stored as an IATA item data table, that is produced by these specifications orders each item according to its
suitability for inclusion on the set
of common items. This table should be provided to the test developers responsible for modifying the
cycle 2 (or next) national assessment so that they can select a set of common items, taking into account information
about the content and psychometric
value of each test item used in the cycle 1 (or first) national assessment. Ideally, a set of anchor items
should have 20-50% the number of items as the
complete test, and the items should represent the content and cognitive test specifications in the same proportions
as the complete test. A pragmatic method of selecting
items would be to begin with the most desirable items and allocate items to the cells of the new test specifications
according to their content and cognitive levels
until the desired number is reached in each cell or the list of items is
exhausted.
Figure 10.5 Selecting items,
CYCLE1 data
10.6. Step 6: SETTING PERFORMANCE STANDARDS
In the first cycle of a national assessment, it is important
to lay the groundwork for interpreting the scores produced
by the assessment. Most modern assessments report results in terms of levels.
International assessments such as
PIRLS, PISA and TIMSS, as well as many national
assessments such as NAEP, publish student achievement scores in terms of performance or benchmark levels
(see Greaney and Kellaghan, 2008; Kellaghan, Greaney,
and Murray, 2009). TIMSS, for example, reported scores using four benchmarks: “low” “intermediate”, “high” and “advanced”
(Martin, Mullis, and Foy, 2008). It is important that the performance standards are meaningful, rather than arbitrary
statistical thresholds like percentiles, because
they are the primary tool used to summarize and report student performance. The process of
defining meaningful performance standards is known as standard setting.
IATA facilitates standard setting procedures by first setting specific response probabilities (RP) of correct
response for each item, then calculating the proficiency levels (RP values) associated
with the specified
RP. For example, if a probability (RP) is set at 50%, then the RP value for an item would be the proficiency level associated with a 50% chance of responding correctly. A wide variety of response probabilities (RPs) are used by different
assessments, typically ranging from 50% to 80% -- the most common practice is to use 67%, which tends to be statistically
optimal at the item level. However, the choice of RP should also be informed by normative definitions of what probability of success constitutes sufficient mastery and knowledge of the consequences of how the standards will be used. For example, in an educational context, where the consequences of reporting failure tend to be greater
than those of reporting success,
lower RPs may be preferred.
Prior to analyzing
the data, a panel of stakeholders including curriculum and teaching experts, in consultation with the national
assessment steering committee, should decide on the number of proficiency levels to be used. Some national assessments
simply choose two levels such as “acceptable” and “not acceptable”; others choose three levels such as “poor”, “adequate”, and “advanced”, while others such as TIMSS and PISA, use four or more. If the stakeholder panel decides on more than two levels, each proficiency level except the lowest level should be defined by a set of items that are considered “answerable” by students displaying that level of performance. Generally, unless there are hundreds of items included
in the assessment (requiring a rotated booklet design), there will not be enough items to adequately define more than three
or four levels.
The interface for performing this analysis is shown in Figure 10.6. On the left, a
drop-
down menu allows you to select a source of items for item selection. As with the item selection interface, you have the option of selecting
any of the item data sources available in the current workflow. For the current analyses, only the “Items1”
table is available[1]. The items from the selected
source are listed in the table beneath the drop- down menu. The values in the “Level” column may be edited directly in each row. To estimate statistically optimal thresholds based on the current item classification, move the vertical slider in the center of the interface to the desired RP. When the interface is opened, the default RP is 67%, indicating
that the criterion using to rank items or estimate
optimal thresholds is a 67% probability of a correct
response on each item.
When you click on the vertical slider or adjust its value, IATA will update the optimal thresholds and produce the results on the right hand side in the graph window and the table of results at the bottom. The graph illustrates the position of each threshold
with vertical lines relative to the distribution of proficiency and the test information function. This information illustrates the usefulness of the levels. For example,
if there are very few respondents in a level, then any summary statistics describing students in that level will be too small or unstable
to be interpretable. Similarly, if the test is not accurate
at the threshold of a level, then the classification of students into that level will be inaccurate.
The table beneath the graph window describes
the items representing each level with the mean and standard
deviation of item b-parameters. The right-most column in the table contains the threshold
that was estimated for each level. In Figure 10.6, the mean and standard deviation
of the b-parameters for Level 4 are , 0.77 and 0.38, respectively. The RP67 threshold
for Level 4 is 1.08. These statistics are useful in determining if the assignment
of items is reasonable. For example, if the standard deviation of items in a level is larger than the distance between the means or thresholds of adjacent levels, the statistical basis for defining the levels may be weak. For
these results, the standard deviation
within levels is approximately 0.35, and the distance between adjacent levels is approximately 0.4, indicating that the levels are well-defined.
Figure 10.6 Default performance standards
interface, CYCLE1
data
IATA uses pre-assignment of items to levels to develop the thresholds
that separate the groups of items on the IRT proficiency scale.
Items are typically
assigned to a level (or some type of cognitive
hierarchy, such as Bloom’s taxonomy)
during the item and test development process. However, the process of defining levels should be flexible
and iterative. IATA allows the items to be assigned and reassigned to different proficiency levels during the analysis. Experience
has shown that the cognitive processes test developers and curriculum experts assume students
use to answer a question
are not necessarily the ones students actually
use. Experts should use the item statistics produced by IATA to
verify their initial
item classifications or to reclassify
items. A common method for doing this is known as the Bookmark procedure.
With the Bookmark
procedure, items are sorted by their RP value and are usually arranged in a booklet,
with one item per page, ordered by RP value. The stakeholder
panel responsible for settings standards
can review items in order of RP value and identify boundaries between cognitively distinct
groups of items where items represent a higher performance standard. The proficiency scores associated with these boundaries can be used to classify both test items and student test results, and the process can be replicated with different
RP values for validation. The item classifications
may be also be updated in IATA in
the “Level” column and used to statistically
estimate thresholds for classifying students.
Consider a scenario where the stakeholder panel has decided to use an RP of 50% to validate the initial classification of items. To provide the evidence required
to perform this validation and reclassification, complete
the following steps:
1. Set the RP to 50% by clicking and dragging the slider as shown in Figure 10.7.
2. Click
the “Save Bookmark Data” button. IATA will produce a confirmation dialogue to
notify you that the data have been saved.
3. Click
the “Next>>” button to navigate to the results viewing screen.
4. Select
the “BookmarkData” table from the drop-down menu.
Figure 10.7 Performance
standards interface with RP=50%, CYCLE1
data
The results of the Bookmark
data creation are shown in Figure 10.8. The data include the item name (Name), IRT parameters (a, b, and c), the existing level classification (Level), the source file of the item statistics (Source), and the RP values (RP50) for each item . In this case, there is only a single RP value column, but a bookmark
data table may include several RP value columns. The selected table of results should be exported and provided to the stakeholder panel responsible for setting standards. When sorted by the “RP50” column,
the data can inform the Bookmark method of classifying items into proficiency levels and, alternately, defining cut-points for defining proficiency levels. Using the Bookmark procedure, stakeholders review each of the items in order of their RP value. When the reviewers
encounter an item that they
feel represents a higher standard of performance, they add a “bookmark” at that location. The RP values immediately prior to the bookmark locations
represent the proposed thresholds for the proficiency levels.
A combination of group discussion and statistical averaging
is typically used to combine the different
thresholds produced by the different
reviewers to produce final thresholds. In order to develop qualitative
descriptions of each proficiency level, the items are classified
by the final thresholds, and the levels are described in terms of the competencies required by their component items.
Figure 10.8 Bookmark data for CYCLE1 data, RP=50%
In practice, a wide variety of information, including the item specifications,
curriculum references, and normative
definitions of what students know and can do at each proficiency level, should be provided simultaneously to the panel of stakeholders
responsible for standard
setting. The stakeholders must reconcile the different sources of information and determine the most useful cut-points and assignments of test items to levels. At
their discretion, the reviewers may also decide to use the item classifications defined ahead-of-time by the item developers instead of reclassifying
items based on the results of the Bookmark procedure. In either case, the thresholds
calculated by IATA represent the statistically optimal
thresholds for the specified item
classifications.
The optimal thresholds
recommended by IATA should be interpreted as suggestions and should be further adjusted manually
for communication purposes.
You can manually change the threshold level by editing
the thresholds directly in the table of results.
After you change the value(s),
the graph is automatically updated.
The most common adjustments performed
include making the thresholds equally-spaced or assigning thresholds
that will, after applying scaling constants, occur at whole increments (e.g., 5 or 10).
Professional judgment should be exercised
when reconciling the evidence from the statistical and content analysis
with the need to communicate results to lay audiences. Simplicity should be balanced
with accurately communicating meaningful differences in student performance.
For the current example, assume that the stakeholder panel, after using the data illustrated in Figure 10.8 to facilitate
the item-by-item review in a Bookmark procedure, proposes the following
set of cut-points: -0.85, -0.25, 0.35, and 0.95 to define the different level. Students with scores falling below -0.85 would be classified as falling below Level 1. These thresholds are only roughly
approximate to the statistically optimal values shown in Figure 10.7, but most stakeholders tend to favour round numbers and even increments because they feel intuitive, even if they are not statistically optimal.
Click
the “<<Back” button to return to the performance standards interface, where you can record these cut-points
in the results data file and assign students to the appropriate levels.
Perform the following steps:
1. Enter the recommended values produced by the committee
of stakeholders into the appropriate rows in the column labelled “Threshold”. Press Enter after the final entry to ensure IATA updates the interface
correctly.
2. Click the “Add Levels” button. IATA will
assign students to their appropriate level based on their IRT scores.
Figure
10.9 illustrates the assignment of the thresholds
for the performance levels. The levels are equally spaced, and apply to a reasonable proportion of students
in each level. Although there is no mathematical reason for the equal spacing of the thresholds, common practice in most national
and international assessments is use equally-spaced
thresholds because they appear more intuitive to lay audiences,
who are the primary audience
for proficiency-level
summaries. In addition, the amount of information at each threshold
is at least two-thirds of the maximum test information, which indicates that the test is sufficiently accurate
at each threshold to make interpretive decisions.
Figure 10.9 Performance
standards interface with manually-set thresholds for CYCLE1
data
In the “Scored”
data table, which can be viewed on the final screen of the analysis workflow, the record for each student will also contain a variable named “Level.”
This variable contains the level of performance standard to which each student is assigned
based on the thresholds shown in Figure 10.9.
When you have finished setting
the performance standard thresholds and applying them to the student scores, click the “Next>>” button to continue to the interface for viewing and saving results.
10.7. Step 7: SAVING RESULTS
On the results viewing and saving interface,
you can view the results produced by the current
example walkthrough. All tables should be saved for both project documentation and to facilitate test linking with subsequent cycles of data. For reference, the item data results of this analysis walkthrough are included in the ItemDataAllTests.xls file, in the worksheet named “ReferenceC1.”
10.8. SUMMARY
In this chapter,
you reviewed the main data analysis functions
in the first IATA workflow. In addition to the analyses common with pilot data, the analysis
of full test data made use of the scaling interface and the development of performance standards.
In the walkthrough in the following
chapter, you will build on the techniques
used in these examples. Two new methods will be introduced for analysing data and specifying analyses: balanced rotated
booklets and partial-credit
test items.
[1] For analyses workflows that make use of linking, the “Items2” and “Merged”
tables are also available.
Không có nhận xét nào:
Đăng nhận xét