12. CHAPTER 12 COMPARING ASSESSMENTS TO THE RESULTS OF OTHER ASSESSMENTS

12.1. Overview

Use the CYCLE2 sample data set to carry out this exercise. The answer key for this test is in the EXCEL workbook, ItemDataAllTests.xls in the sheet named CYCLE2.

Governments and others are often interested in finding out if students’ achievement levels have risen, fallen or remained constant over time. Interest in changing standards is particularly important in times of curriculum change, or when there has been substantial reform in the system including a change in funding levels. Governments may also be interested in the effects on student achievement of a rapid increase in enrolment due to the implementation of programs such as Education for All (EFA) or the Fast Track Initiative (FTI), now known as the Global Partnership for Education. With a strong linkage, scores from a national assessment may be compared with one that was conducted earlier and allows you to state that results have or have not changed in the intervening period.

Following with the existing national assessment scenario introduced in previous chapters, this walkthrough introduces the methods required to implement a follow-up cycle in a national assessment program. After the national assessment program has already been launched, you will need to follow this workflow to ensure the interpretations that stakeholder make using the results of a new assessment cycle are consistent with and comparable to those made in the first cycle. The CYCLE2 data in this example represent the second cycle of a national assessment, following the first cycle, CYCLE1, which was analysed in Chapter 10. A link between these two assessments is possible because the CYCLE2 test contains several anchor items that are also on the CYCLE1 test. The linkage enables the national assessment program to monitor changes in student performance over time. For more details on the statistical basis for linking, refer to Chapter 15 (page 197).

As with the previous chapter, the contents of this chapter will focus on the interfaces and specifications that are distinct to this workflow. Review the previous chapters for more detailed explanations of the common workflow interfaces.

From the main menu, click the first menu option, “Response data analysis with linking”, to enter the analysis workflow, as shown in Figure 12.1. This workflow requires response data, item data (answer keys) for the response data being analysed, and a reference item data file that will be used to anchor the linked results.

Figure 12.1 Select the “Response data analysis with linking” workflow

12.2. Step 1: ANALYSIS SETUP

This workflow begins with the same initial data loading steps as the response data analysis.

1. In the first interface in the workflow, you must load the response data from the file named “CYCLE2.xls” in the IATA sample data folder. These response data include 2484 records and 61variables (The first case has the following values: SchoolID = 2 Sex = 2, School Size = 21 and Rural= 0).

2. In the second interface, you must load the corresponding item data.

When the ItemDataAllTests.xls file is opened, ensure that the “CYCLE2” table has been selected. There are 53 records and 4 variables in the CYCLE2 table, including three partial credit items. (Item MATHC2047 has the following values: Key = C, Level = 1.00, and it is a Number Knowledge item).

In the third step of the workflow, you will use a new data loading interface that has not been used in the previous walkthroughs. This interface requests a file containing reference item data. The reference item data contain IRT parameters (a, b and, optionally, c) that have been estimated from a reference sample, such as a previous cycle of a national or international assessment.

The reference item data file must contain data for at least some of items that are included in the current national assessment. For this example, you will use the results produced from the analysis of the CYCLE1 data file. These results are provided in the IATA sample data folder in the file named ItemDataAllTests.xls in the sheet named “ReferenceC1” (you may have also saved the results from the exercises in Chapter 10). When you open this file, you must ensure that the selected table is the table named “ReferenceC1.” This table, which contains 50 records and 11 variables, contains the statistical output describing all the items from the first cycle of the national assessment. These reference data are illustrated in Figure 12.2. In the current example, the CYCLE2 test includes 25 items that have item parameters in the CYCLE1 reference item data file. It is important to keep the names of items consistent across all data files, because IATA matches items in the linking procedure using item names.

Figure 12.2 Reference item data from CYCLE1 used for linking CYCLE2 data

Note that this file also includes several data fields that were calculated during the analysis of CYCLE1 data in addition to the a, b and c variables (e.g., Level, Content, Discr, PVal, PBis and Loading). These variables may be left on the data file but they are not used in the linking analysis by IATA. Similarly, although the reference item data contain information for all 50 items on the CYCLE1 test, only the information from the 25 items common to the CYCLE2 test will be used to estimate the linkage.

After loading all three data files, you should continue to the analysis specifications interface (IATA Page 4/12). The analysis specifications are similar to those for the CYCLE1 data. Enter or select the following details on the analysis specifications interface and click the “Next>>” button:

1. The student identification variable is CYCLE2STDID.

2. The weight variable is CYCLE2Weight.

3. The code of ‘9’ is treated as incorrect.

Proceeding to the item analysis interface (IATA Page 5/12) will automatically begin the analysis. There are no problematic items present in the data. On reviewing each of the items, note that, although partial credit items have multiple scores, there still may be ‘easy’ partial credit items, where the majority of respondents achieve the highest score category (such as MATHSA004) and ‘hard’ partial credit items, where the majority of respondents achieve the lowest score (such as MATHSA005, shown in Figure 12.3)

Figure 12.3 Item analysis results for CYCLE2 data, item MATHSA005, score=1

You may continue through the workflow to review the test dimensionality results and perform any DIF analyses that may be of interest on the available demographic variables (rural, sex, language) by following the same procedures described in previous chapters. Although many of the items have warning symbols for one or more DIF analyses, we will assume for the purposes of this example that all the items are okay. After you have finished reviewing the DIF analysis, click the “Next>>” button to proceed to the linking interface.

12.3. Step 2: COMMON ITEM LINKING

The linking interface is shown in 15 On the left side of the interface is a button labelled “Calculate” and a table listing all 25 items that are common to both the reference item data and the “new” data from the current national assessment. In the table of items, the first column, “Use” specifies whether or not you wish to include the item in the estimation of the linking constants (by default, all items that appear in the reference data and new data are included). Column “L” contains a summary diagnostic symbol for each item; the default caution symbol (the yellow diamond) is updated after IATA calculates the linking results. The most effective way to use this interface is to first calculate the results with all items, then examine the diagnostic information to identify and remove any items with anomalous results. Repeat these two steps until the linkage is stable.

Figure 12.4 Common item linking results, CYCLE2 to CYCLE1

When you click the “Calculate” button, IATA will estimate the linking constants and evaluate the statistical quality of the linkage. When the calculation is finished, IATA displays a summary of the quality of the linkage in the graph on the right and updates the summary diagnostic symbols in the item table, as shown in Figure 12.4. The graph displays three lines: a solid line, a dashed line, and a dotted line. The solid and dashed lines display test characteristic curves (TCCs)[1]. The TCC of a test summarizes the statistical behaviour of the entire set of items, providing similar information as an IRF, but for many items simultaneously. Ideally, the linked and reference TCCs should be identical (if the only one line appears to be visible, it is likely that the two are completely overlapping), indicating that differences in the magnitude and variability between the link scale and reference scale are accounted for across the displayed range of proficiency. The dotted line displays the absolute difference between the two TCCs, expressed as a proportion of the total test score. The value of the difference will typically vary across the range of proficiency, indicating that the linkage may not be stable for all ranges of scores. For score ranges with large differences, the linked results will not be on the same scale as the reference data and, hence, not be comparable. However, if the average difference is small (e.g., <0.01), the error may be considered negligible.

In the Figure 12.4, the Target curve (solid black) represents the CYCLE1 test items, and the Linked curve (dashed red) represents the CYCLE2 test items after the application of the linkage. It is difficult to see both curves in Figure 12.4, because the Target and Linked test characteristics curves are near-identical, also indicated by the Error curve, which has a constant value of approximately 0 across the displayed range of proficiency[2].

Beneath the graph, the estimated linking constants are displayed in the two textboxes. The location constant adjusts for differences in the magnitude of the original scales of the new data and the reference data, and the scale constant adjusts for differences in the variability between the scales (see page 197). Generally speaking, the two constants can be interpreted together such that dividing any value expressed on the raw CYCLE2 IRT scale (e.g., a student’s IRT score from the current analysis results) by the scale constant and adding the location constant will render the linked result directly comparable to scores on the CYCLE1 IRT scale. This comparability means that, after the scale linkage has been applied, any remaining differences between the CYCLE1 results and the transformed CYCLE2 results represent differences in test performance, rather than differences in the tests themselves. Further details on the principles and calculations of IRT linking are presented in Chapter 15 (see page 197).

In the item table on the right, the diagnostic symbols are updated after calculation to indicate any potentially problematic linkages at the item level. A linked item is problematic if its linked IRF is very different from the reference IRF. If you click on any item in the item list, you will be able to view the results of the linking function applied to each test item. As with the overall TCC comparisons, the linked IRF should be similar to the target IRF[3]. Even if the results for the overall test appear very good, the linking function may not work well for some items. However, there is more sampling error at the item level, so differences between IRFs are typically problematic only if the error between the linked and reference IRFs is greater than 0.05[4].

A common example of a situation that would cause an individual test item to show idiosyncratic behaviour in a linking analysis occurs when a specific content area measured by a link item is used as the basis of instructional interventions (such as emphasis on using a paricular form of graph in Mathematics or on an aspect of grammar in Language) between the two testing periods while other linking items are not. Because performance on that specific test item is likely to improve in an idiosyncratic way, the linking constants estimated from all items together will not account for the item-specific changes between the first and second administrations.

A mild example of this phenomenon is present in the existing data, with item MATHC1052, flagged with a caution symbol. Selecting this item illustrates the results for the item, shown in Figure 12.5. Although the linking constants appear to have successfully adjusted for the difference in location of the item (i.e., the difficulty of the item relative to the given sample), there are some gaps, particularly for higher- proficiencies. The Target IRF and the Linked IRF are distinct from each other, and the dotted line at the bottom, which expresses the difference between the two IRFs, ranges up to 0.08 but is generally smaller than 0.05. These differences are inconsequential, and, in most practical situations, this amount of error is not problematic.

Figure 12.5 Common item linking results, CYCLE2 to CYCLE1, item MATHC1052

In the event that the differences between the target and linked IRFs are different enough to be problematic (e.g., consistently greater than 0.05 across a wide range of proficiency), the offending item should be removed by unchecking the box next to the item name and the clicking the “Calculate” button. Although one or two item may be removed without introducing validity issues, if many items are removed from the estimation of linking functions, the validity of the linkage may become weak, because the anchor items may not adequately reflect the intended balance of content. In general, if the statistical analysis of results suggests some items should be removed from the linkage, the recommendation should be brought to the national assessment steering committee so that the consequences may be weighed before choosing whether to include or remove the results. Typically, the fewer items that are common between the two assessments being linked, the weaker the linkage is and the more problematic items will be identified. However, for the current example, the results indicate a very stable linkage. As a result, the national assessment team can be confident that the test used for the current national assessment (CYCLE2) is appropriate for monitoring changes in student achievement levels since the previous national assessment.

Beneath the linking constants are two controls – a dropdown menu and a button labelled “Update” – that allow you to apply the linking constants directly to the results of the current analysis. To apply the constants to the items parameters, ensure that the “Items1” table is selected in the drop-down menu and click the “Update” button. To update the estimated IRT scores in the current results, select “Scored” from the drop- down menu and click the “Update” button. IATA will notify when the results have been updated. If you updated the “Items1” table, linked a and b parameters will appear in the “Items1” data table, with the suffix “link” to indicate that they are now expressed on the reference scale. If you select the “Scored” table to update, the updated IRT scores will appear in the “Scored” data table, with the “_link” suffix.

When you have completed updating both the Items1 and Scored results, click the “Next>>” button to continue.

12.4. Step 3: RESCALING LINKED RESULTS

If you have updated the Scored results with the linking constants in the previous interface, IATA’s Scale Review and Scale Setting interface (IATA Page 9/12) will include the name “LinkedScore” in the drop down menu at the upper left. Select “LinkedScore” from the drop down menu to display the graphical and statistical summaries for the linked CYCLE2 results, which are expressed on the scale established by the CYCLE1 reference item data. The LinkedScore has a mean of 0.10 and a standard deviation of 1.07.

To convert the linked IRT score to a scale score that can be compared to the NAMscore variable that was produced during the analysis of CYCLE1 data, perform the following steps:

1. Enter the name “NAMscore” in the textbox beneath the “Add New Scale Score” label.

2. Enter “100” in the St. Deviation, the value originally set for the CYCLE1 data.

3. Enter “500” for the Mean, the value original set for the CYCLE1 data.

4. Select the Rescale option. This option will ensure that the new scale score will retain the linkage that was estimated in the previous interface.

5. Click the “Calculate” button. IATA will create the new scale score and display the distribution and descriptive statistics as shown in Figure 12.6.

Figure 12.6 Scale score results, CYCLE2 test scores expressed on the reporting scale of CYCLE1 (NAMscore)

Because the current workflow is specific to linking, IATA automatically produces the new scale score using the linked IRT score. Due to the existence of appropriate link items, IATA’s linking procedure has been able to produce NAMscores for two separate assessments which can be compared as they are on a common scale.

When you have finished adding the new scale score to the results, click the “Next>>” button to continue.

12.5. Step 4: ASSIGNING PERFORMANCE STANDARDS

The majority of the tasks in the “Response data analysis with linking” workflow are specified in the same manner as they are in the previous workflows. For simplicity, these exercises are left to you to complete independently. However, the treatment of performance standards is treated differently. After the first national assessment, standard setting should only be performed as a validation exercise. It is useful to review the thresholds periodically to determine if new performance standards need to be set (e.g., if quality of education is rapidly improving), but establishing new thresholds for proficiency levels should typically coincide with major policy changes, such as curriculum reform.

You should not estimate new thresholds in the current workflow, because performance standards were initially set and assigned in the CYCLE1 data analysis. The thresholds used for this national assessment (that were established in the first national assessment) are:

· Level 4: 0.95.

· Level 3: 0.35

· Level 2: -0.25

· Level 1: -0.85

You can apply performance standards to the CYCLE2 data by manually entering the thresholds in the table in the “Threshold” column as shown in Figure 12.7 and clicking the “Add Levels” button. To update these thresholds, first adjust the RP using the slider; this will cause IATA to generate table with default values, which you can replace with the values assigned from the CYCLE1 walkthrough. The “Level” variable will then be added to the “Scored” student data table. The proportions of students assigned to each performance standard will be comparable to the proportions that may have been estimated using the CYCLE1 results. Although the specifications of the performance standards do not change based on the workflow, this interface, like the scaling interface, will recognize that you are using a linking workflow and will assign students and assign students to levels based on the linked IRT score rather than the raw IRT score.

Figure 12.7 Assigning performance standards, CYCLE2 data

For reference, the item data results of this analysis walkthrough are included in the ItemDataAllTests.xls file, in the worksheet named “ReferenceC2.”

12.6. SUMMARY

In this chapter you performed an analysis of a follow-up administration of a national assessment. You used common items to link the IRT and public reporting scales of the CYCLE2 data to be comparable to the respective scales used in the CYCLE1 data.

Using the linked results, you created scale scores and applied performance standards.

Linking of test results is also useful in many different situations. In addition to comparing results across different cycles of a national assessment, the following scenarios are also common:

- A country with a number of educational jurisdictions may create a test for each jurisdiction that contains curricular content that is specific to each jurisdiction. If the tests for the different jurisdictions share some common items, the scores on the different tests could be used for comparing performance across jurisdictions.

- Test linking can be used to compare the results of national and international assessments, if the national assessment includes items previously used and calibrated in an international survey, such as TIMSS. The parameters that were estimated from the international survey will be used for linking – however, care should be taken to ensure that the international test items are suitable in terms of content to the national assessment test specifications.

In the following chapter, you will be introduced to some specialised uses of existing tasks that you have already performed in IATA, including the use of anchored item parameters with response data to estimate scores, analyse DIF, and link test results.

[1] For a detailed explanation of Test Characteristic curves, see the Chapter 15, page 206.

[2] For more ‘interesting’ results with greater error, you can replicate this analysis with PILOT1 and PILOT2 results as an independent exercise; keep in mind that the goal should be to minimize errors in linkages, so linked tests should have a sufficient number of anchor items and sample size to produce accurate statistics.

[3] If you have access to response data from both assessments, you may perform a more sensitive analysis by using analysing DIF in the “Response data analysis” workflow on the combined response data files using the source data identifier as the DIF variable. The IRT score that are produced will be automatically linked, but they will not be interpretable on the scale of either test, unless item parameters were anchored (see Chapter 15, page 122).

[4] There is one exception, which occurs only with highly-discriminating items. Although large differences (e.g., where the error line is greater than 0.05) between the IRF’s typically indicate problems, if these differences occur only over a small range of proficiency (e.g., across a range of less than 0.4 proficiency points -- two tick marks on the graph’s default x-axis), the differences will not adversely affect the quality of the linkage.

MỘT SỐ VẤN ĐỀ GIÁO DỤC

Thứ Năm, 5 tháng 11, 2015

Hướng dẫn sử dụng phần mềm phân tích đề thi IATA (Chapter 12_Convert to PDF to Word full)