12. CHAPTER 12 COMPARING ASSESSMENTS TO THE RESULTS OF OTHER ASSESSMENTS
12.1. Overview
Use the CYCLE2 sample data set to carry out this exercise.
The answer key for this test is in the EXCEL workbook,
ItemDataAllTests.xls in the sheet named CYCLE2.
Governments and others are often interested in finding out if students’
achievement levels have risen, fallen or remained
constant over time. Interest in changing standards
is particularly important in times of curriculum change,
or when there has been substantial
reform in the system including
a change in funding levels. Governments may also be interested in the effects on student achievement of a rapid increase in enrolment due to the implementation of programs such as Education
for All (EFA) or the Fast Track Initiative (FTI), now known as the Global Partnership for Education. With a strong linkage,
scores from a national assessment may be compared
with one that was conducted
earlier and allows you to state that results have or have not changed
in the intervening period.
Following with the existing
national assessment scenario
introduced in previous chapters, this walkthrough introduces the methods required to implement
a follow-up cycle in a national assessment program. After the national
assessment program has already been launched, you will need to follow this workflow
to ensure the interpretations that stakeholder make using the results of a new assessment cycle are consistent with and comparable to those made in the first cycle. The CYCLE2 data in this example represent
the second cycle of a national assessment, following the first cycle, CYCLE1, which was analysed in Chapter 10. A link between these two assessments is possible because
the CYCLE2 test contains several anchor items that are also on the CYCLE1 test. The linkage enables the national
assessment program
to monitor changes
in student performance over time. For more details on the statistical
basis for linking,
refer to Chapter 15 (page 197).
As with the previous chapter,
the contents of this chapter will focus on the interfaces and specifications that are distinct to this workflow.
Review the previous chapters for more detailed explanations of the common workflow interfaces.
From the main menu, click the first menu option,
“Response data analysis
with linking”, to enter the analysis
workflow, as shown in Figure 12.1. This workflow
requires response data, item data (answer keys) for the response data being analysed,
and a reference item data file that will be used to anchor the linked results.
Figure 12.1 Select the “Response data analysis with linking” workflow
12.2. Step 1: ANALYSIS SETUP
1. In the first interface
in the workflow, you must load the response data from the file named “CYCLE2.xls” in the IATA sample data folder. These response data include 2484
records and 61variables (The first case
has the following values: SchoolID = 2 Sex = 2, School Size = 21 and Rural= 0).
2. In the second interface, you
must load the corresponding item data.
When the
ItemDataAllTests.xls file is opened, ensure that the “CYCLE2” table has been selected.
There are 53 records and 4 variables in the CYCLE2 table, including three partial
credit items. (Item MATHC2047 has the following values: Key = C, Level = 1.00, and
it is a Number Knowledge item).
In the third step of the workflow,
you will use a new data loading interface that has not been used in the previous walkthroughs. This interface requests
a file containing reference item data. The reference
item data contain IRT parameters (a, b and, optionally, c) that have been estimated
from a reference sample, such as a previous cycle of a national
or international assessment.
The reference item data file must contain data for at least some of items that are included in the current national assessment. For this example,
you will use the results produced from the analysis
of the CYCLE1 data file. These results are provided
in the IATA sample data folder in the file named ItemDataAllTests.xls in the sheet named “ReferenceC1” (you may have also saved the results from the exercises in Chapter 10). When you open this file, you must ensure that the selected
table is the table named
“ReferenceC1.” This table, which contains 50 records and 11 variables, contains the statistical output describing all the items from the first cycle of the national assessment. These reference data are illustrated in Figure 12.2. In the current example, the CYCLE2 test includes
25 items that have item parameters in the CYCLE1 reference
item data file. It is important to keep the names of items consistent across all data files, because IATA matches
items in the linking procedure
using item names.
Figure 12.2 Reference item data from CYCLE1 used for linking CYCLE2 data
Note that this file also includes several data fields that were calculated
during the analysis of CYCLE1 data in addition to the a, b and c variables
(e.g., Level, Content, Discr, PVal, PBis and Loading). These variables may be left on the data file but they are not used in the linking analysis by IATA. Similarly,
although the reference
item data contain information for all 50 items on the CYCLE1 test, only the information
from the 25 items common to the CYCLE2 test will be used to estimate the linkage.
After loading all three data files, you should continue
to the analysis specifications interface (IATA Page 4/12). The analysis specifications are similar to those for the CYCLE1
data. Enter or select the following
details on the analysis specifications
interface and click the “Next>>” button:
1. The student identification variable
is CYCLE2STDID.
2. The
weight variable is CYCLE2Weight.
3. The
code of ‘9’ is treated as incorrect.
Proceeding to the item analysis interface
(IATA Page 5/12) will automatically begin the analysis.
There are no problematic items present in the data. On reviewing
each of the items, note that, although partial credit items have multiple
scores, there still may be ‘easy’ partial credit items, where the majority of respondents achieve the highest score category (such as MATHSA004) and ‘hard’ partial credit items, where the majority of respondents achieve the lowest score (such as MATHSA005, shown in Figure
12.3)
Figure 12.3 Item analysis results
for CYCLE2 data, item MATHSA005, score=1
You may continue through the workflow to review the test dimensionality results and perform any DIF analyses
that may be of interest
on the available demographic variables (rural, sex, language) by following
the same procedures described in previous chapters.
Although many of the items have warning symbols for one or more DIF analyses, we will assume for the purposes of this example that all the items are okay. After you have finished
reviewing the DIF analysis,
click the “Next>>” button to proceed
to the linking interface.
12.3. Step 2: COMMON ITEM LINKING
The linking interface
is shown in 15 On the left side of the interface
is a button labelled “Calculate” and a table listing all 25 items that are common to both the reference item data and the “new” data from the current national assessment. In the table of items, the first column, “Use” specifies
whether or not you wish to include the item in the estimation of the linking constants
(by default, all items that appear in the reference data and new data are included). Column “L” contains a summary diagnostic symbol for each item; the default caution symbol (the yellow diamond)
is updated after IATA calculates the linking results.
The most effective way to use this interface is to first calculate the results with all items, then examine the diagnostic
information to identify
and remove any items with anomalous results. Repeat these two steps until the linkage is stable.
Figure 12.4 Common item linking results, CYCLE2
to CYCLE1
When you click the “Calculate” button,
IATA will estimate the linking constants
and evaluate the statistical quality of the linkage. When the calculation is finished, IATA displays a summary of the quality of the linkage in the graph on the right and updates the
summary diagnostic symbols in the item table, as shown in Figure 12.4. The graph displays three lines: a solid line, a dashed line, and a dotted line. The solid and dashed
lines display test characteristic curves (TCCs)[1]. The TCC of a test summarizes the statistical behaviour
of the entire set of items, providing
similar information as an IRF, but for many items simultaneously. Ideally,
the linked and reference TCCs should be identical (if the only one line appears to be visible,
it is likely that the two are completely overlapping), indicating that differences in the magnitude
and variability between the link scale and reference scale are accounted
for across the displayed range of proficiency. The dotted line displays the absolute difference between the two TCCs, expressed as a proportion of the total test score. The value of the difference will typically vary across the range of proficiency, indicating that the linkage may not be stable for all ranges of scores.
For score ranges with large differences, the linked results will not be on the same scale as the reference
data and, hence, not be comparable. However,
if the average difference is small (e.g., <0.01), the error may be considered negligible.
In the Figure 12.4, the Target curve (solid black) represents the CYCLE1 test items, and
the Linked curve (dashed red) represents
the CYCLE2 test items after the application of the linkage.
It is difficult to see both curves in Figure 12.4, because the Target and Linked test characteristics curves are near-identical, also indicated by the Error curve, which has a constant value of approximately 0 across the displayed range of proficiency[2].
Beneath
the graph, the estimated linking constants are displayed in the two textboxes.
The location constant adjusts for differences in the magnitude
of the original scales of the new data and the reference data, and the scale constant
adjusts for differences in the variability between the scales (see page 197). Generally speaking,
the two constants can be interpreted together
such that dividing any value expressed on the raw CYCLE2 IRT scale (e.g., a student’s
IRT score
from the current analysis results)
by the scale constant and adding the location constant will render the linked result directly comparable to scores on the CYCLE1 IRT scale. This comparability means
that, after the
scale linkage has been applied,
any remaining differences between the CYCLE1
results and the transformed CYCLE2 results represent
differences in test performance, rather than differences in the tests themselves. Further details on the principles and calculations of IRT linking are presented
in Chapter 15 (see page 197).
In the item table on the right, the diagnostic
symbols are updated after calculation to indicate any potentially problematic linkages at the item level. A linked item is problematic if its linked IRF is very different
from the reference IRF. If you click on any item in the item list, you will be able to view the results of the linking function applied to each test item. As with the overall TCC comparisons, the linked IRF should be similar
to the target IRF[3]. Even if the results for the overall test appear very good, the linking function may not work well for some items. However, there is more sampling error at the item level, so differences between IRFs are typically problematic only if the error between the linked and reference
IRFs is greater than 0.05[4].
A common example of a situation that would cause an individual test item to show idiosyncratic behaviour in a linking analysis
occurs when a specific content area measured by a link item is used as the basis of instructional interventions (such as emphasis on using a paricular form of graph in Mathematics or on an aspect of grammar in Language) between
the two testing periods while other linking items are not. Because performance on that specific
test item is likely to improve in an idiosyncratic way, the linking constants estimated
from all items together
will not account for the item-specific changes
between the first and second administrations.
A mild example of this phenomenon is present in the existing
data, with item MATHC1052, flagged
with a caution symbol. Selecting
this item illustrates the results for the item, shown in Figure 12.5. Although the linking constants appear to have successfully adjusted
for the difference in location of the item (i.e., the difficulty of the item relative
to the given sample), there are some gaps, particularly for higher- proficiencies. The Target IRF and the Linked IRF are distinct from each other, and the dotted line at the bottom, which expresses the difference
between the two IRFs, ranges up to 0.08 but is generally smaller than 0.05. These differences are inconsequential, and, in most practical situations, this amount of error is not problematic.
Figure 12.5 Common item linking results, CYCLE2
to CYCLE1, item MATHC1052
In the event that the differences between the target and linked IRFs are different enough to be problematic (e.g., consistently greater than 0.05 across a wide range of proficiency), the offending item should be removed by unchecking the box next to the item name and the clicking the “Calculate” button.
Although one or two item may be removed without introducing validity
issues, if many items are removed from the estimation of linking functions, the validity of the linkage may become weak, because the anchor items may not adequately reflect the intended
balance of content. In general, if the statistical analysis of results suggests some items should be removed from the linkage, the recommendation should
be brought to the national
assessment steering committee
so that the consequences may be weighed before choosing
whether to include or remove the results. Typically,
the fewer items that are common between the two assessments being linked, the weaker the linkage is and the more problematic items will be identified. However, for the current example, the results indicate a very stable linkage. As a result, the national
assessment team can be confident that the test used for the current national assessment (CYCLE2) is appropriate for monitoring changes in student achievement levels since the previous national assessment.
Beneath
the linking constants
are two controls – a dropdown menu and a button labelled “Update” – that allow you to apply the linking constants directly to the results of the current analysis. To apply the constants to the items parameters, ensure that the “Items1” table is selected
in the drop-down menu and click the “Update” button. To update the estimated IRT scores in the current results, select “Scored”
from the drop- down menu and click the “Update” button. IATA will notify when the results have been updated.
If you updated the “Items1” table, linked a and b parameters will appear in the “Items1”
data table, with the suffix “link” to indicate that they are now expressed on the reference
scale. If you select the “Scored” table to update, the updated
IRT scores will appear in the “Scored”
data table, with the “_link” suffix.
When you have completed updating
both the Items1 and Scored results, click the “Next>>” button to continue.
12.4. Step 3: RESCALING LINKED RESULTS
If you have updated
the Scored results with the linking constants
in the previous interface, IATA’s Scale Review and Scale Setting
interface (IATA Page 9/12) will include the name “LinkedScore” in the drop down menu at the upper left. Select “LinkedScore” from the drop down menu to display the graphical
and statistical summaries for the linked CYCLE2 results, which are expressed on the scale established by the CYCLE1 reference item data. The LinkedScore has a mean of 0.10 and a standard deviation
of 1.07.
To convert the linked IRT score to a scale score that can be compared to the NAMscore variable that was produced during the analysis
of CYCLE1 data, perform the following
steps:
1. Enter the name “NAMscore” in the textbox beneath the “Add New Scale Score” label.
2. Enter “100” in the St. Deviation,
the value originally set for the CYCLE1 data.
3. Enter “500” for the Mean, the value original set for the
CYCLE1 data.
4. Select the Rescale option.
This option will ensure that the new scale score will retain the linkage that was
estimated in the previous interface.
5. Click the “Calculate” button. IATA will create the new scale
score and display the distribution and descriptive statistics as shown in
Figure 12.6.
Figure 12.6 Scale score results, CYCLE2 test scores expressed
on the reporting scale of CYCLE1 (NAMscore)
Because the current workflow is specific
to linking, IATA automatically produces
the new scale score using the linked IRT score. Due to the existence
of appropriate link items, IATA’s linking procedure
has been able to produce NAMscores for two separate assessments which can be compared
as they are on a common scale.
When you have finished adding the new scale score to the results, click the “Next>>”
button to continue.
12.5. Step 4: ASSIGNING PERFORMANCE STANDARDS
The majority of the tasks in the “Response data analysis with linking” workflow
are specified in the same manner as they are in the previous workflows. For simplicity, these exercises are left to you to complete independently. However, the treatment of performance standards is treated differently. After the first national assessment,
standard setting should only be performed as a validation exercise. It is useful to review the thresholds periodically to determine if new performance standards need to be set (e.g., if quality of education
is rapidly improving), but establishing new thresholds for proficiency levels
should typically coincide with major policy changes, such as curriculum reform.
You should not estimate new thresholds in the current workflow, because performance standards were initially
set and assigned in the CYCLE1 data analysis. The thresholds used for this national
assessment (that were established in the first national assessment) are:
· Level 4: 0.95.
· Level 3: 0.35
· Level 2: -0.25
· Level 1: -0.85
You can apply performance standards to the CYCLE2 data by manually entering
the thresholds in the table in the “Threshold” column as shown in Figure 12.7 and clicking the “Add Levels”
button. To update these thresholds, first adjust the RP using the slider; this will cause IATA to generate
table with default values, which you can replace
with the values assigned from the CYCLE1 walkthrough. The “Level” variable will then be added to the
“Scored” student data table. The proportions of
students assigned to each performance standard will be comparable to the
proportions that may have been
estimated using the CYCLE1 results. Although the specifications of the performance standards do not change based on the workflow,
this interface, like the scaling interface, will recognize that you are using a linking workflow and will assign
students and assign students to levels based on the linked IRT score rather than the raw IRT score.
Figure 12.7 Assigning performance standards, CYCLE2 data
For reference, the item data results of this analysis walkthrough are included in the ItemDataAllTests.xls file, in the worksheet named “ReferenceC2.”
12.6. SUMMARY
In this chapter you performed
an analysis of a follow-up administration of a national
assessment. You used common items to link the IRT and public reporting scales of the CYCLE2 data to be comparable to the respective
scales used in the CYCLE1 data.
Using the linked results, you created scale scores and applied performance standards.
Linking
of test results is also useful in many different situations. In addition to comparing results across different
cycles of a national assessment, the following scenarios are also common:
- A country with a number of educational jurisdictions may create a test for each jurisdiction that contains curricular
content that is specific to each jurisdiction. If the tests for the different jurisdictions share some common items, the scores on the different
tests could be used for comparing performance across jurisdictions.
- Test linking can be used to
compare the results of national and international assessments, if the national assessment
includes items previously used and calibrated in an international survey, such as
TIMSS. The parameters that were estimated from the international survey will be
used for linking – however, care should be taken to ensure that the international
test items are suitable in terms of content to the national assessment test
specifications.
In the following
chapter, you will be introduced to some specialised uses of existing
tasks that you have already performed in IATA, including
the use of anchored item parameters with response data to estimate scores, analyse DIF, and link test results.
[1] For a detailed
explanation of Test Characteristic curves, see the Chapter
15, page 206.
[2] For more ‘interesting’ results
with greater error, you can replicate
this analysis with PILOT1 and PILOT2
results as an independent exercise; keep in mind that the goal should be to minimize errors in linkages, so linked tests should
have a sufficient number of anchor items and sample size to produce accurate
statistics.
[3] If you have access to response
data from both assessments, you may perform a more sensitive
analysis by using analysing
DIF in the “Response
data analysis” workflow on the combined response
data files using the source
data identifier as the DIF variable. The IRT score that are produced
will be automatically linked,
but they will not be interpretable on the scale of either test, unless item parameters were anchored (see Chapter 15, page 122).
[4] There is one exception, which occurs
only with highly-discriminating items. Although large differences (e.g., where the error line is greater
than 0.05) between the IRF’s typically indicate problems, if these differences occur only over a small range of proficiency (e.g., across a range of less than 0.4 proficiency points -- two tick marks on the graph’s
default x-axis), the differences will not adversely affect the quality
of the linkage.
Không có nhận xét nào:
Đăng nhận xét