10. CHAPTER 10 PERFORMING A FULL ANALYSIS OF FINAL TEST DATA ADMINISTRATION

Use the CYCLE1 sample data set to carry out this exercise. The answer key for this test is in the EXCEL workbook, ItemDataAllTests.xls, in the sheet named CYCLE1.

Continuing the scenario introduced in Chapter 11, the national assessment team has produced and administered a test to a national sample of students. Final test includes 50 items, representing five content areas (number knowledge, shape and space, relations, problem solving, and uncertainty) in proportions determined by the test specifications. The final sample design is a stratified cluster sample, with schools as the primary sampling unit and a target sample of 30 students from each school. The sample includes 79 schools, selected to be representative of 5 national regions and stratified by rural status and language of instruction. The total number of students in the sample is 2242, representing a population of approximately 86000.

This walkthrough follows the same steps as the analysis of pilot test data. However, because the final test is concerned primarily with producing and interpreting scores, the item analysis is typically performed without the exploratory emphasis present in the analysis of pilot test data. Accordingly, this walkthrough will focus on the unique aspects of final test data analysis that distinguish it from analysis of pilot test data. Where the steps of analysis are identical to those discussed in the previous chapter, refer to the information presented in the previous chapters.

Begin the analysis by clicking the “Response data analysis” from the IATA main menu.

10.1. Step 1: SETTING UP THE ANALYSIS

The procedures for setting up the analysis are similar to those in Chapter 11. You must first load a response file, then load an item data file, then specify the analysis. If you do not know how to perform these steps, refer to Chapter 11, Step 1 to Step 3, for detailed instructions on how to perform each task. Referring to the contents of the IATA sample data folder:

- The response data file for this chapter is CYCLE1.xls. This file has 2242 records and 58 variables.

- The item data file is in the Excel file named “ItemDataAllTests.xls” in the table named “CYCLE1”. Ensure that the correct table name is selected in the item data loading interface. The CYCLE1 item data has 50 records and 4 variables.

The items in this national assessment test are a subset of the pilot items in Chapter 11.

The specifications for this analysis are slightly different from the pilot test data analysis, primarily resulting from the use of scientific sampling in the full administration of the national assessment. The first difference is the name of the identification variable, which is named “CYCLE1STDID”. The second difference, which will have an effect on the results of the analysis, is the presence of a sample design weight, which is named “CYCLE2weight”. These variable specifications must be selected from the drop-down menus. In these data, the value of 9 represents missing responses that will be treated as incorrect. The completed specifications should look like Figure 10.1.

Figure 10.1 Analysis specifications for CYCLE1 data

Note that the item data for the final assessment also include data in the “Level” field in the third column of the table on the left. These data are natural numbers (1 or greater) that represent the expected level of performance or proficiency that that the curriculum content speicalists assigned to each test item: Level 1 represents the lowest level of performance (i.e., minimum competence) and Level 4 represents the highest level. Although every item is assigned a level, it is possible that students will not achieve even the lowest level.

After verifying that the specifications and data are correct, click the “Next>>” button to continue. The analysis will begin automatically, updating the interface with the progress periodically. With larger data sets or slower computers, the analysis may appear to hang on the “Estimating parameters” stage, which is the most time consuming. Do not close the program; IATA will continue to run and will provide an update when the analysis is complete.

10.2. Step 2: BASIC ANALYSIS RESULTS

Because problematic items were identified and removed during the analysis of the pilot test data, there are no remaining problematic items in these full test data. You should confirm that the items are behaving appropriately by reviewing 1) the item analysis and 2) the test dimensionality results. If you do not know how to perform these steps, refer to Chapter 10, Step 4 and Step 5, for detailed instructions on how to perform these tasks. Note that all of the items have green circles with the exception of MATHC1046, which we identified in the previous chapter as being somewhat problematic but which we left in the test. Proceed to the differential item functioning interface when you have finished.

10.3. Step 3: ANALYSIS OF DIFFERENTIAL ITEM FUNCTIONING

Although DIF analysis was performed on the pilot test data, the results of DIF analyses tend to be sensitive to sampling errors, so it is good practice to replicate these analyses with the full sample. Another reason to perform DIF analysis is that there may be additional variables available in the full sample that were not available in the pilot sample, or the sample provides a sufficient number of cases to perform the DIF analysis. For example, in the pilot data analysed in Chapter 9, all students in the sample were from urban areas, whereas the full sample contains students from both rural and urban areas.

Replication of the DIF analyses from the previous chapter is left as an independent exercise. For this example, we will perform a DIF analysis using the variable “rural”. We wish to see if rural students are disadvantaged, relative to their urban counterparts. For the CYCLE1 data, a value of “1” for this indicator means that a student is attending a rural school. In order to specify this analysis and review the results, perform the following steps:

1. From the drop-down menu on the left, select the “rural” variable. When you do so, the table beneath will be populated with the values “0.00” and “1.00”, with values of 56% for “0.00” and 44% for “1.00,” indicating that 44% of the students (unweighted) in the sample attend rural schools.

2. In the table of values, click on the value “1.00” – this will cause the value of 1.00 (representing rural students) to be entered as the Focus group in the text box beneath.

3. In the table of values, click on the value “0.00” – this will cause the value of 0.00 (representing urban students) to be entered as the Reference group in the text box beneath.

4. Click the “Calculate” button and wait for the calculation to complete.

When the calculation is complete, in the item list, click on the header of the “S-DIF” column to sort all the items by the value of the S-DIF statistic.

When you have completed these steps, the interface will appear as illustrated in Figure 10.2. Compared to the results presented in Chapter 11, the items show much more stability in the empirical IRFs than were seen in the PILOT1 data. If you were to replicate the same analyses that were presented in Chapter 11 with the current data, you would similarly see fewer differences between groups and generally, much smaller U-DIF statistics. The increased stability is largely the result of increased sample size. Reviewing each of the items, you will see that the majority of both S-DIF and U-DIF statistics are less than 5, indicating that, after controlling for differences in proficiency between rural and urban students, the differences in item performance between rural and urban students tend to be negligible.

Figure 10.2 DIF analysis results for CYCLE1 data by sex, item MATHC1008

The purpose of performing DIF analysis at the final test stage of a national assessment is to determine if an item should be made ineligible for calculating student scores. At this stage of the analysis, it would be appropriate to share the statistical analysis results with the national assessment steering committee, who would determine if the potentially problematic items should be removed or retained. If an item is removed, the analysis may be rerun by either deleting the item’s answer key in the analysis specifications interface or by unchecking the item in the item analysis interface. For the current example, we shall assume that all items will be retained.

When you have finished reviewing all items, click the “Next>>” button to continue.

10.4. Step 4: SCALING

The default scale used to calculate the results for the IRT scale scores is the standard or Z scale, which has a mean of 0 and a standard deviation of 1. Scores expressed on this scale can appear to be problematic to many stakeholders, because half the students will have ‘negative’ scores. Similarly, scores bounded by 0 and 100 also have communication challenges, most audiences tend to assume that a score of 50 represents a passing score, which may not be true, depending on the test specifications.

For communication purposes, it may be undesirable to report test results with an average score less than 50 percent or below 0. Journalists, policy-makers and other commentators may not appreciate the statistical nature of negative values and incorrectly infer that half the population is below or above standard (or, even worse, that half the population has ‘negative’ proficiency). Some large-scale assessments transform their calculated scores into scales which have a mean of 500, 100, or 50 and standard deviations of 100, 20 and 10, respectively. Each national assessment team should select the type of score that is most likely to facilitate effective communication of results.

There are two types of scaling that may be performed in IATA: setting the scale and rescaling. Setting the scale allows you to specify the desired mean and standard deviation of the scale scores. Rescaling allows you to apply a simple linear transformation to the IRT scores, which is useful if the scale scores must be compared to a scale that has been established from a previous analysis. In this case, item parameters from the previous cycle can be used to estimate test scores or equate results from the student data in the new cycle so that the IRT scores that IATA calculates are comparable to the previous cycle’s calculated IRT scores. The calculated results can then be rescaled using the rescale function so that they are comparable with the reported scale from the previous cycle.

In either case, the new scale score is created by entering the name of the new score and specifying the standard deviation and mean in the appropriate boxes. When you click the “Calculate” button, IATA will produce the new scale scores and display the distribution and summary statistics.

Because the primary function of the analysis of national assessment test data is to produce scores that may be interpreted and analysed, the scaling interface receives more attention with analyses of full test data than it does with analyses of pilot test data. There are two main purposes that this walkthrough will make of the scaling interface: first, reviewing the distribution of proficiency relative to the distribution of test information will inform the quality of inferences that may be made about different ranges of proficiency; and second, creating a reporting scale for the test results will establish a metric for communicating results to stakeholders.

To review the distribution of IRT scores, select “IRTscore” from the drop-down menu at the upper left of the interface. The interface will update with descriptive details about the IRT scores and the test information, as shown in Figure 10.3. The mean of the IRTscore distribution is -0.02 and the standard deviation if 1.04. These values are not meaningful in themselves, as they represent the arbitrary scale on which the items were calibrated. The graph indicates that test information, illustrated by the solid black line, is slightly wider than the distribution of proficiency; this result is statistically ideal in that it minimizes the average standard error of measurement at all levels of proficiency for the given distribution (see Chapter 15, page 185). The frequency spike at the left hand side of the graph at approximately -3 on the proficiency scale corresponds to students who did not score any items correct on the test. The test does not have sufficient information to determine these students’ proficiency with accuracy, because the test does not have many very easy items; as a result, these students are assigned the same arbitrarily low score.

These results also indicate that the test was relatively difficult for students. The peak of the information function tends to be located at the region of proficiency where students are most likely to score 50%. In Figure 10.3, this peak is slightly above the mean score of -0.02, indicating that above-average students tended to score only 50% correct. While this result provides good statistical accuracy, the results may be disappointing to stakeholders who are used to interpreting any result less than 50% as a failure.

Figure 10.3 Distribution of proficiency (IRT score) and test information, CYCLE1 data

To produce a more useful reporting scale based on the IRT score, use the “Add New Scale Score” functions at the bottom right of the interface. For this example, let us assume that National Steering Committee requested a new scale which requires setting the mean equal to 500 and the standard deviation equal to 100. This scale will be set in the first cycle of the national assessment and be used in subsequent cycles as well to report on changes in progress over time. The name of this score will be “NAMscore” (National Assessment of Mathematics score). To provide these specifications, perform the following steps:

1. Type “NAMscore” in the text box beneath the “Add New Scale Score” label.

2. Enter a value of “100” for the St. Deviation.

3. Enter a value of “500” for the Mean.

4. Make sure the “Set scale” option is selected. This will ensure that the scale score produced will have a mean exactly equal to 500 and a standard deviation exactly equal to 100 for the sample (the Rescale option will simply adjust the existing IRT score by the specified mean and standard deviation).

5. Click the “Calculate” button.

When IATA is finished processing the request, it will update the interface with the summary graph and statistics for the newly-created scale score, shown in Figure 10.4.

Figure 10.4 Distribution and summary statistics for new scale score (NAMscore), CYCLE1 data

There are relatively few limitations in selecting a derived scale score.You can use any valid name for the derived scale score so long as it not already used in the response data (see Chapter 8 for naming conventions and restricted variable names). The mean can be any real number, and the standard deviation can be any real number greater than 0. However, it is important to ensure that the lowest reported student scores are not less than 0. Since the lowest score is usually around three to four standard deviations below the mean, it is good practice to set the mean to be at least 4 standard deviations above 0. The IEA, for instance, usually reports achievement results using a mean of 500 and a standard deviation of 100. The choice of a reporting scale should be discussed with the national assessment steering committee at the initial planning stages so that all stakeholders understand how to interpret the reported results.

After the new scale score has been created, click the “Next>>” button to continue.

10.5. Step 5: SELECTING TEST ITEMS

The CYCLE1 data represent the initial cycle of a national assessment program. Looking to the future, it will be necessary in subsequent cycles to alter the test and maintain a linkage to the initial cycle’s results. To do this, you will need to select a subset of items that are both accurate and represent the continuum of proficiency.

A reasonable practice for maintaining a strong linkage between tests is to keep approximately 50% of the items common between adjacent assessments, also known as anchor items. To facilitate the process of selecting anchor items, you can use the item selection functionality of IATA to produce a table of items ranked by their suitability for maximizing accuracy across the proficiency range. To perform this selection, perform the following steps:

1. Type the name “ItemRanks” into the name of item selection field.

2. Type the number 50 in the number of items field to select all items.

3. Leave the lower and upper bounds at their default values of 2 and 98,

4. Click the “Select Items” button.

The complete results are shown in Figure 10.5. All of the available items have been selected and categorized by content and cognitive levels from their original specifications. The table of results, stored as an IATA item data table, that is produced by these specifications orders each item according to its suitability for inclusion on the set of common items. This table should be provided to the test developers responsible for modifying the cycle 2 (or next) national assessment so that they can select a set of common items, taking into account information about the content and psychometric value of each test item used in the cycle 1 (or first) national assessment. Ideally, a set of anchor items should have 20-50% the number of items as the complete test, and the items should represent the content and cognitive test specifications in the same proportions as the complete test. A pragmatic method of selecting items would be to begin with the most desirable items and allocate items to the cells of the new test specifications according to their content and cognitive levels until the desired number is reached in each cell or the list of items is exhausted.

Figure 10.5 Selecting items, CYCLE1 data

Once IATA has completed this analysis, click the “Next>>” button to continue.

10.6. Step 6: SETTING PERFORMANCE STANDARDS

In the first cycle of a national assessment, it is important to lay the groundwork for interpreting the scores produced by the assessment. Most modern assessments report results in terms of levels. International assessments such as PIRLS, PISA and TIMSS, as well as many national assessments such as NAEP, publish student achievement scores in terms of performance or benchmark levels (see Greaney and Kellaghan, 2008; Kellaghan, Greaney, and Murray, 2009). TIMSS, for example, reported scores using four benchmarks: “low” “intermediate”, “high” and “advanced” (Martin, Mullis, and Foy, 2008). It is important that the performance standards are meaningful, rather than arbitrary statistical thresholds like percentiles, because they are the primary tool used to summarize and report student performance. The process of defining meaningful performance standards is known as standard setting.

IATA facilitates standard setting procedures by first setting specific response probabilities (RP) of correct response for each item, then calculating the proficiency levels (RP values) associated with the specified RP. For example, if a probability (RP) is set at 50%, then the RP value for an item would be the proficiency level associated with a 50% chance of responding correctly. A wide variety of response probabilities (RPs) are used by different assessments, typically ranging from 50% to 80% -- the most common practice is to use 67%, which tends to be statistically optimal at the item level. However, the choice of RP should also be informed by normative definitions of what probability of success constitutes sufficient mastery and knowledge of the consequences of how the standards will be used. For example, in an educational context, where the consequences of reporting failure tend to be greater than those of reporting success, lower RPs may be preferred.

Prior to analyzing the data, a panel of stakeholders including curriculum and teaching experts, in consultation with the national assessment steering committee, should decide on the number of proficiency levels to be used. Some national assessments simply choose two levels such as “acceptable” and “not acceptable”; others choose three levels such as “poor”, “adequate”, and “advanced”, while others such as TIMSS and PISA, use four or more. If the stakeholder panel decides on more than two levels, each proficiency level except the lowest level should be defined by a set of items that are considered “answerable” by students displaying that level of performance. Generally, unless there are hundreds of items included in the assessment (requiring a rotated booklet design), there will not be enough items to adequately define more than three or four levels.

The interface for performing this analysis is shown in Figure 10.6. On the left, a drop- down menu allows you to select a source of items for item selection. As with the item selection interface, you have the option of selecting any of the item data sources available in the current workflow. For the current analyses, only the “Items1” table is available[1]. The items from the selected source are listed in the table beneath the drop- down menu. The values in the “Level” column may be edited directly in each row. To estimate statistically optimal thresholds based on the current item classification, move the vertical slider in the center of the interface to the desired RP. When the interface is opened, the default RP is 67%, indicating that the criterion using to rank items or estimate optimal thresholds is a 67% probability of a correct response on each item.

When you click on the vertical slider or adjust its value, IATA will update the optimal thresholds and produce the results on the right hand side in the graph window and the table of results at the bottom. The graph illustrates the position of each threshold with vertical lines relative to the distribution of proficiency and the test information function. This information illustrates the usefulness of the levels. For example, if there are very few respondents in a level, then any summary statistics describing students in that level will be too small or unstable to be interpretable. Similarly, if the test is not accurate at the threshold of a level, then the classification of students into that level will be inaccurate.

The table beneath the graph window describes the items representing each level with the mean and standard deviation of item b-parameters. The right-most column in the table contains the threshold that was estimated for each level. In Figure 10.6, the mean and standard deviation of the b-parameters for Level 4 are , 0.77 and 0.38, respectively. The RP67 threshold for Level 4 is 1.08. These statistics are useful in determining if the assignment of items is reasonable. For example, if the standard deviation of items in a level is larger than the distance between the means or thresholds of adjacent levels, the statistical basis for defining the levels may be weak. For these results, the standard deviation within levels is approximately 0.35, and the distance between adjacent levels is approximately 0.4, indicating that the levels are well-defined.

Figure 10.6 Default performance standards interface, CYCLE1 data

IATA uses pre-assignment of items to levels to develop the thresholds that separate the groups of items on the IRT proficiency scale. Items are typically assigned to a level (or some type of cognitive hierarchy, such as Bloom’s taxonomy) during the item and test development process. However, the process of defining levels should be flexible and iterative. IATA allows the items to be assigned and reassigned to different proficiency levels during the analysis. Experience has shown that the cognitive processes test developers and curriculum experts assume students use to answer a question are not necessarily the ones students actually use. Experts should use the item statistics produced by IATA to verify their initial item classifications or to reclassify items. A common method for doing this is known as the Bookmark procedure.

With the Bookmark procedure, items are sorted by their RP value and are usually arranged in a booklet, with one item per page, ordered by RP value. The stakeholder panel responsible for settings standards can review items in order of RP value and identify boundaries between cognitively distinct groups of items where items represent a higher performance standard. The proficiency scores associated with these boundaries can be used to classify both test items and student test results, and the process can be replicated with different RP values for validation. The item classifications may be also be updated in IATA in the “Level” column and used to statistically estimate thresholds for classifying students.

Consider a scenario where the stakeholder panel has decided to use an RP of 50% to validate the initial classification of items. To provide the evidence required to perform this validation and reclassification, complete the following steps:

1. Set the RP to 50% by clicking and dragging the slider as shown in Figure 10.7.

2. Click the “Save Bookmark Data” button. IATA will produce a confirmation dialogue to notify you that the data have been saved.

3. Click the “Next>>” button to navigate to the results viewing screen.

4. Select the “BookmarkData” table from the drop-down menu.

Figure 10.7 Performance standards interface with RP=50%, CYCLE1 data

The results of the Bookmark data creation are shown in Figure 10.8. The data include the item name (Name), IRT parameters (a, b, and c), the existing level classification (Level), the source file of the item statistics (Source), and the RP values (RP50) for each item . In this case, there is only a single RP value column, but a bookmark data table may include several RP value columns. The selected table of results should be exported and provided to the stakeholder panel responsible for setting standards. When sorted by the “RP50” column, the data can inform the Bookmark method of classifying items into proficiency levels and, alternately, defining cut-points for defining proficiency levels. Using the Bookmark procedure, stakeholders review each of the items in order of their RP value. When the reviewers encounter an item that they feel represents a higher standard of performance, they add a “bookmark” at that location. The RP values immediately prior to the bookmark locations represent the proposed thresholds for the proficiency levels. A combination of group discussion and statistical averaging is typically used to combine the different thresholds produced by the different reviewers to produce final thresholds. In order to develop qualitative descriptions of each proficiency level, the items are classified by the final thresholds, and the levels are described in terms of the competencies required by their component items.

Figure 10.8 Bookmark data for CYCLE1 data, RP=50%

In practice, a wide variety of information, including the item specifications, curriculum references, and normative definitions of what students know and can do at each proficiency level, should be provided simultaneously to the panel of stakeholders responsible for standard setting. The stakeholders must reconcile the different sources of information and determine the most useful cut-points and assignments of test items to levels. At their discretion, the reviewers may also decide to use the item classifications defined ahead-of-time by the item developers instead of reclassifying items based on the results of the Bookmark procedure. In either case, the thresholds calculated by IATA represent the statistically optimal thresholds for the specified item classifications.

The optimal thresholds recommended by IATA should be interpreted as suggestions and should be further adjusted manually for communication purposes. You can manually change the threshold level by editing the thresholds directly in the table of results. After you change the value(s), the graph is automatically updated. The most common adjustments performed include making the thresholds equally-spaced or assigning thresholds that will, after applying scaling constants, occur at whole increments (e.g., 5 or 10). Professional judgment should be exercised when reconciling the evidence from the statistical and content analysis with the need to communicate results to lay audiences. Simplicity should be balanced with accurately communicating meaningful differences in student performance.

For the current example, assume that the stakeholder panel, after using the data illustrated in Figure 10.8 to facilitate the item-by-item review in a Bookmark procedure, proposes the following set of cut-points: -0.85, -0.25, 0.35, and 0.95 to define the different level. Students with scores falling below -0.85 would be classified as falling below Level 1. These thresholds are only roughly approximate to the statistically optimal values shown in Figure 10.7, but most stakeholders tend to favour round numbers and even increments because they feel intuitive, even if they are not statistically optimal.

Click the “<<Back” button to return to the performance standards interface, where you can record these cut-points in the results data file and assign students to the appropriate levels. Perform the following steps:

1. Enter the recommended values produced by the committee of stakeholders into the appropriate rows in the column labelled “Threshold”. Press Enter after the final entry to ensure IATA updates the interface correctly.

2. Click the “Add Levels” button. IATA will assign students to their appropriate level based on their IRT scores.

Figure 10.9 illustrates the assignment of the thresholds for the performance levels. The levels are equally spaced, and apply to a reasonable proportion of students in each level. Although there is no mathematical reason for the equal spacing of the thresholds, common practice in most national and international assessments is use equally-spaced thresholds because they appear more intuitive to lay audiences, who are the primary audience for proficiency-level summaries. In addition, the amount of information at each threshold is at least two-thirds of the maximum test information, which indicates that the test is sufficiently accurate at each threshold to make interpretive decisions.

Figure 10.9 Performance standards interface with manually-set thresholds for CYCLE1 data

In the “Scored” data table, which can be viewed on the final screen of the analysis workflow, the record for each student will also contain a variable named “Level.”

This variable contains the level of performance standard to which each student is assigned based on the thresholds shown in Figure 10.9.

When you have finished setting the performance standard thresholds and applying them to the student scores, click the “Next>>” button to continue to the interface for viewing and saving results.

10.7. Step 7: SAVING RESULTS

On the results viewing and saving interface, you can view the results produced by the current example walkthrough. All tables should be saved for both project documentation and to facilitate test linking with subsequent cycles of data. For reference, the item data results of this analysis walkthrough are included in the ItemDataAllTests.xls file, in the worksheet named “ReferenceC1.”

10.8. SUMMARY

In this chapter, you reviewed the main data analysis functions in the first IATA workflow. In addition to the analyses common with pilot data, the analysis of full test data made use of the scaling interface and the development of performance standards.

In the walkthrough in the following chapter, you will build on the techniques used in these examples. Two new methods will be introduced for analysing data and specifying analyses: balanced rotated booklets and partial-credit test items.

[1] For analyses workflows that make use of linking, the “Items2” and “Merged” tables are also available.

MỘT SỐ VẤN ĐỀ GIÁO DỤC

Thứ Năm, 5 tháng 11, 2015

Hướng dẫn sử dụng phần mềm phân tích đề thi IATA (Chapter 10_Convert to PDF to Word full)