9. CHAPTER 9 ANALYZING DATA FROM A PILOT TEST ADMINISTRATION

9.1. Overview

Use the PILOT1 sample data set to carry out this exercise. The answer key for this test is in the EXCEL workbook, ItemDataAllTests in the sheet named PILOT1.

Let us consider the following scenario. A national assessment team and its curriculum experts have created a set of new multiple-choice items in order to evaluate grade 10 students’ mathematics skills. These new test items were considered adequate for representing the national curriculum. The items had been created to reflect the main content categories determined by the national steering committee (number knowledge, shape and space, relations, problem solving, and uncertainty). The final version of the test is meant to be administered to 10th grade students of all proficiency levels and is intended to contain 50 items.

As a first step, the national assessment team administered an 80 item test to a total of 262 students, sampled from 7 schools in each of 3 regions, with test booklets in 2 languages. This is a larger number of items than will be included on the final test, but there are typically many items that are developed for a test that will not function well for a variety of reasons (e.g., too easy or too difficult, confusing instructions). A test development process may produce two or three times as many items as will be used in the final test. Most of these items will be reject e d by review panels prior to the pretest state. However, a national assessment team should still pretest at least 50% more items than are required for the final test. This pilot test is intended to test the operational protocols for the survey as well as to determine the composition of items in the final test that will be in the national assessment (to a different sample of students). The student response data file contains each student’s multiple-choice answers to each of the 80 items as well as some school-level variables (region identification, school identification, school type and school size) and some student- level information (sex and language).

From the main menu, click the first menu option, “Response data analysis”, to enter the analysis workflow, as shown in Figure 9.1. If, at any stage in the workflow, you receive an error or receive results that are different than expected, return to a previous step or begin the analysis again from the main menu.

Figure 9.1 Select the “Response data analysis” workflow

9.2. Step 1. LOADING RESPONSE DATA

Regardless of the analysis path chosen, you must direct IATA to load previously collected or produced data (for example, national assessment pilot test data, or an item data file). IATA is flexible and has simple procedures and buttons for loading response data, item data, or both. Regardless of the analysis path or type of data, you must tell IATA which data file to import and which data in the file to use. IATA can import data in SPSS (*.sav), EXCEL (*.xls/*.xlsx), tab-delimited (*.txt), and comma- separated (*.csv) formats. Because EXCEL data files can contain several separate tables, you must specify which table is to be imported for the analysis.

The first screen in this analysis path requires you to import a response data file into IATA. The data-loading interface is shown in Figure 9.2. The instructions begin with the words “EXAMINEE RESPONSE DATA…” to indicate that you are loading data containing responses to items and explain the general contents expected to be in the data file. Below the instructions are two boxes: a file path summary, and a drop down menu for selecting data tables in the selected file. To the right of these boxes is the button labelled “Open File”. The table at the bottom of the interface displays the data for a selected data source. If there are more than 500 rows of data, only the first 500 will be display ed. If you have selected a data format that selects multiple tables, such as Excel or Access, then the name of the first table in the data file will appear in the drop-down box. Otherwise, the name of the file will appear in the drop-down box. For multi-table data files, the desired data may not be in the first table. You should verify that the appropriate data are selected by reviewing the contents of the data table, which will appear in the large area at the bottom of the interface. If the active table does not contain the desired data, you can select a different table by clicking the drop- down menu.

Figure 9.2 Response data loading interface

For this example, you must load the PILOT1.xls file:

1. Click Open File to select a data file. In the file browser, navigate to the folder on your desktop that contains the IATA sample data.

2. Choose the Excel (*.xls) file format. If you see (*.xlsx) in the box to the right of the file name field, use the dropdown arrow and click on (*.xls).

3. Select (or type) PILOT1.xls.

4. Click Open or press the Enter key.

When the file opens, a pop-up dialog will remind you to confirm that the data you have selected contain the correct item-response data. Click OK to continue. Confirm that the sample pilot data are correctly loaded; your interface should look like Figure 9.2.

The data shown in figure 11.2 show the records for each student who took the pilot test. The first seven variables from the left describe demo graphic and sampling information about the students:

· PILOT1STDID – unique student identification code;

· SCHOOLID – unique school identification code;

· Sex – the sex of the student (1=male, 2=female);

· SchoolSize – the total number of students in the school;

· Rural – the location of the school (0=urban, 1=rural);

· Region – a numeric identifier for the geographic region;

· Language – a numeric identifier for the language of the test administration.

The first mathematics test item appears in column 8 and is labeled MATHC1019. Scroll across to see that the file contains data on 80 items; the item in the last column is labeled MATHC1041. The item names are arbitrary and do not reflect their position on the test. Most cells have values A, B, C or D indicating students’ choice of options. Cells which have 9 indicate that a student did not respond to the item.

As with most pilot samples, the students represent a sample of convenience, rather than a scientific representation of the population. Sample weights are only valid when they are produced as a product of a scientific sample design. Accordingly, there are no sample weights in the PILOT1 response data file.

After verifying that you have loaded the correct response data file, click the “Next>>” button.

9.3. Step 2: LOADING THE ANSWER KEY

You must also load the item answer keys so that IATA can perform the analysis correctly. As with the response data, the item data are in Excel format in the IATA data folder on your desktop.

1. Click Open File to select a data file. In the file browser, navigate to the folder on your desktop that contains the IATA sample data.

2. Choose the Excel (*.xls) file format.

3. Select (or type) ItemDataAllTests.xls.

4. Click Open or press the Enter key.

When the file opens, a pop-up dialog will remind you to remind you that IATA will estimate any missing item parameters. Click OK to continue. The selected data file contains tables for all the different examples in this book. Ensure that you have correctly selected the table named “PILOT1” in the dropdown menu. Confirm that the correct item data are correctly loaded; your interface should look like Figure 9.3. If you wish to find information on a specific item easily, you can sort the items by clicking on the header for the Name column.

Figure 9.3 Item data for the PILOT1 response data

When you have confirmed that the correct item data have been loaded, click the “Next>>” button to continue.

9.4. Step 3: ANALYSIS SPECIFICATIONS

Every workflow that uses response data requires you to provide certain specifications that will affect the results of all subsequent analyses. These specifications include answer key and item meta data, respondent identification variable, sample design weighting, and treatment of missing data codes. The interface for providing these specifications is shown in Figure 9.4. The large panel on the left contains a table of the test items in the response data file with the columns headers “Name”, “Key”, “Level” and “Content”. If an item data file has been loaded, the table will only contain variables that have been identified as test items; otherwise, the table will contain all variables. If you had skipped the loading of an item data file, you would need to manually enter the answer key specifications for each item in this table (see section 8.3.2.119).

In the center section of the interface, there is a button labelled “Update response value list”. You will need to click this button if you change the answer key specifications, either by manually entering answer keys or deleting existing answer keys. When you click this button, IATA will populate the two drop-down menus with lists of variables in the response data that have not been assigned an answer key and list all of the response values present for the variables identified as test items. If you have loaded an item data file, these menus will already be populate d with values.

Below the “Update response value list” button, there are several controls for providing optional specifications: a drop down menu for specifying the identification (ID) variable, a drop-down menu for selecting the weight variable, and a table for specifying treatment of missing value codes. Specifying an ID variable may be necessary to merge the test results produced by IATA with other data sources. The ID variable should uniquely identify each student; if you do not specify an ID variable, IATA will produce a variable named “UniqueIdentifier” to serve this purpose. The weight variable is used to ensure that the statistics produced during the analysis are appropriate for the sample design of the national assessment. If no weight variable is provided, IATA will assume that all students in the data receive the same weight, equal to 1.

Figure 9.4 Analysis specifications for the PILOT1 data

You can inform IATA that a response value is a missing response code by clicking one of the checkboxes next to the value in the “Specify missing treatment” table. By default, IATA assumes that all response values represent actual student responses. If the box in the “Incorrect” column is checked, then IATA will treat that value as an invalid response that will be scored as incorrect. If the box in the “Do Not Score” column is checked, then IATA will treat that value as omitted, and the value will not affect a student’s test results. By default, if there are any completely empty or blank cells in the response data, IATA will treat them as incorrect, unless you have manually specified “Do Not Score” treatment.

For this walkthrough, the answer key and response data have both been entered, so the list of items shown in Figure 9.4 contains only those variables with answer keys in the item data. It is a good idea to review the answer key table to confirm that the keys and other data about each item are correct and complete, because any errors at this stage will produce even more errors in subsequent tasks in the workflow. In the middle of the screen, you will need to specify the additional analysis details. Use the following specifications:

1. Use the first drop-down menu to select PILOT1STDID variable as the ID variable.

2. These data do not have a sample weight, so you may leave the second drop- down menu blank.

3. The value of 9 will be treated as incorrect, so check appropriate box in the table of values in the “Specify missing treatment” section. Although there are no blank entries in the PILOT1 data, you can leave the default specification of treating blank entries as incorrect.

When the specifications have been entered, the interface should look the same as Figure 9.4.

Confirm that your specifications are correct and click the “Next” button to continue. The data will begin processing automatically. The processing stages are: Setting up data, Scoring, Estimating parameters, IRT scaling, Calculating True Scores, and Factor analysis. As the processing continues, the interface will display the current stage of processing. Depending on the speed of your computer and the size of your data, this analysis may take seconds to minutes to complete processing. When IATA finishes processing, it will display the results in the item analysis interface

9.5. Step 4. ITEM ANALYSIS

When the data processing has finished, the item analysis interface will be updated with the results, shown in Figure 9.5. Using the item analysis interface, you can access these results as well as view and save diagnostic information about each test item.

There are four types of results displayed in this interface:

1. Statistics and statistical parameters describing each item (on the left);

2. A graphical illustration of the relationship between student proficiency and the probability of correctly responding to an item, also known as an Item Response Function or IRF (at the top right);

3. A contingency table describing the proportions of students with high, medium, and low test scores who endorsed each of the different item responses, also known as a distractor analysis (at the middle right); and

4. A plain-language summary of the item analysis results (at the bottom right)

Figure 9.5 Item analysis results for the PILOT1 data, item MATHC1019

The table on the left side of the item analysis interface presents statistical information well as a symbol describing the overall suitability of each item (see page 23). The Name of each item is in the column to the right of the summary symbols. You can examine the detailed results for an individual item by using the arrow keys or mouse to highlight the row in which the item appears. You can use the checkboxes in the “Use” column for each row to include or exclude items from the analysis. Uncheck one of these item boxes to remove the item from the analysis. You may then click the “Analyze” button to rerun the analysis with the reduced set of items. Return all items to their original state by clicking the “Reset Items” button. Note that clicking “Reset Items” will reset all items, so if you wish to permanently remove an item from the analysis, you should delete its answer key in the analysis specifications interface. The “Scale” button does not re-estimate any item parameters; it simply calculates IRT scale scores for the response data using the item parameters that have already been estimated or loaded into IATA from an external data file.

9.5.1. Item Statistics

The three columns to the right of the item name contain classical item statistics: the item discrimination index (“Discr”), the point-biserial correlation (PBis), and the item facility (“PVal”), also sometimes referred to as item difficulty, although larger values indicate an easier test item. The final three columns, which may be hidden from view, requiring you to scroll in the table, are estimates of item response theory (IRT) parameters: the slope parameter (“a”), the location parameter (“b”) and the pseudo- guessing parameter (“c”). In-depth discussions of these statistics and parameters and how they relate to each other are presented in the Chapter 15 (page 149).

In general, the classical statistics may be interpreted directly. The item facility (PVal) ranges between 0 and 1 and describes how easy an item is for the given sample: a value of 0 indicates that no students responded correctly, and a value of 1 indicates all students responded correctly. The discrimination index and point-biserial correlation provide alternate measures of the same relationship, which is how strongly related responses to each item are to the overall test score. For both statistics, the value should be greater than 0.2. These guidelines should not be considered absolute, because these indices are also affected by factors other than the discrimination of the items, including the accuracy of the overall test. For example, the item facility tends to limit the absolute value of both the discrimination index and the point-biserial correlation. If the item facility differs substantially from 0.5 (e.g., less than 0.2 or greater than 0.8), the discrimination index and point-biserial correlation will underestimate the relationship between proficiency and performance of students on a test item.

Although extremely easy or difficult items tend to reduce the observed relationships with proficiency, they may also cover important curriculum content that should be included in the test or they may (in the case of easy items for instance) be required to sustain student motivation during testing. For these or other reasons, it is often desirable to include a relatively small number of very easy or difficult items.

In contrast, the IRT parameters should not be interpreted in isolation; although each describes a specific behaviour of the test item, the relationship between responses to the item and overall proficiency are the result of interactions between all three parameters as well as the proficiency level of individual students.

Most items in the current analysis have a green circle, indicating that they have no major problems and are relatively satisfactory. By scrolling down the item list on the left, you will see 13 items with diamond-shaped caution symbols (MATHC1047, MATHC1013, MATHC1002, MATHC1070, MATHC1034, MATHC1035, MATHC1032, MATHC1010, MATHC1068, MATHC1046, MATHC1024, MATHC1058, and MATHC1030). One item (MATHC1075) has a triangular warning symbol and is considered a potentially problematic item. The best practice is to examine the results for all items, regardless of the summary symbol IATA assigns, but for this walkthrough, we will focus on a few examples.

By default, the results for the first item are displayed in the graph and table on the right. IATA has assigned this item, MATHC1019, a green circle[1]. Each of the results IATA produces for this item is explained in the following sections.

9.5.2. Item Response Function (IRF)

In the graphics window on the right-hand side of item analysis interface, IATA will display the Item Response Function (IRF) for a selected test item. Reviewing the IRF is typically more intuitive than examining the IRT parameters or item statistics to determine the relative usefulness of different test items. A useful item will have a strong relationship with proficiency, indicated by an IRF that has a strong S-shape, with a narrow region in which the curve is almost vertical. The slope of the IRF for MATHC1019 is consistently positive, but the relationship is weak, without any region with a notably steeper slope. This shallow slope corroborates the low discrimination index (Discr=0.36) and low point-biserial correlation (PBis=0.35).

As with any statistical modeling method, IRT is only useful if the data fit the theoretical model. For each item or score value, IATA produces a graphic of the theoretical IRF produced using the estimated parameters as well as the empirical IRF estimated directly from the proportions of correct responses at each proficiency level. The graphic can be used to assess the suitability of using IRT to describe each item. If the IRT model is appropriate, the red dashed line will appear to be very similar to the solid black line, where deviations are less than 0.05, particularly in the region between -1 and 1, where there are many students. For MATHC1019, the theoretical and empirical IRF’s are almost identical, indicating that, although the item itself may have a weak relationship with proficiency, its statistical properties are accurately described by the IRF.

9.5.3. Distractor Analysis

In the bottom right of the item analysis interface, IATA produces statistics for each response value (including missing value codes and incorrect response values) and a textual summary of the analysis. The statistics are estimated separately for groups of low, medium and high performing students, based on their percent-correct test score, as well as the entire sample. This table, shown in detail in Figure 9.6, is also referred to as a distractor analysis.

Figure 9.6 Distractor analysis for item MATHC109, PILOT1 data

There are many reasons why an item may have a low or even a negative discrimination relationship with proficiency. These include: poor wording, confusing instructions, sampling errors, and miskeying or miscoding of responses. Distractor analysis may be used to detect and remediate some of these common errors by looking at patterns in item responses. A well-functioning item should have the following characteristics:

1. The correct column option, denoted by the asterisk (*), should have a high percentage for the high group, and successively lower percentages for the medium and low groups. MATCHC1019 satisfies this condition, with values of 47.9, 19.9 and 11.4 for the high, medium and low groups, respectively.

2. For the low skilled group, the percentage choosing the correct option should be lower than the percentage choosing any one of the other options. All of the incorrect options (A, B and C) for MATHC1019 exhibit this pattern.

3. Each of the columns corresponding to incorrect response values should have approximately equal percentages in each skill level and overall compared to the other incorrect response values. MATHC1019 violates this pattern, because option B is endorsed by almost twice as many incorrect respondents as either A or C.

4. For the high-skilled group, the percentage choosing the correct option should be higher than the percentage choosing any one of the other options. MATHC1019 satisfies this pattern: 47.9 is greater than the values for A (14.1), B (23.9) and C (14.1).

5. For all groups, the percentage of missing value codes (denoted by an X) should be close to 0. A substantial proportion of students had missing responses (code 9), but the occurrence was greater in low performers than high performers, suggesting that the decision to treat the code as incorrect (rather than omitted) was reasonable.

6. Missing response codes that are treated as omitted (denoted by OMIT) should have equal percentages of students in each skill level. This code was not used for these data.

IATA provides a textual summary about the item performance, including warnings if the discrimination is unacceptably low and, if so, suggests what may be done to improve it. For example, IATA will identify distractors that are not effective in eliciting endorsements from respondents (or have statistical profiles similar to correct responses)[2]. If IATA does detect any common problems in the data, a verbal summary of the results is displayed in the text box beneath the distractor analysis table.

Examining the results for MATHC1019, the textual summary on the bottom right recommends examining the response option coded as “A”. Looking at the distractor analysis table, we can see that response “A” is endorsed by approximately the same proportion of high-performing students as low-performing students, indicating that it does not function well as a distractor.

The distractor analysis of national assessment data may also be useful to providers of in-service education courses for teachers and also to curriculum personnel. The results may help identify common misconceptions and errors made by students. Curriculum authorities can also use the data to judge the appropriateness of specific material for a particular grade level.

9.5.4. Comparing Different Items

Turning to the second item on the test, MATHC1027, which is shown in Figure 9.7, we find that, compared to the previous item, it has a stronger relationship with proficiency, indicated by the steeper IRF and the larger discrimination (0.65) and point-biserial correlation (0.53). The theoretical and empirical IRFs are almost identical, indicating that the statistical item response model is appropriate to the response data. The distractor analysis table shows that 73.2 percent students in the “High” group selected the correct option (C) compared to 19.9 percent in the medium and 8.6 pecent in the low group. All of the data for incorrect response values (A, B and D), as well as the missing response code (9), were more likely to be selected by low-performing students than high-performing students.

Figure 9.7 Item analysis results for PILOT1 data, item MATHC1027

In contrast to the two items we have examined, items with triangular warning symbols are typically poor items whose inclusion on the test may produce misleading or less useful results. The number of poor items that appear in a pilot test such as this one can be minimized by following item-creation guidelines described in Volume 2 in this series (Anderson and Morgan 2008). The only item with a warning symbol in these data is MATHC1075, shown in Figure 9.8. By clicking on the item you will see the item results indicate an almost nonexistent relationship of either the correct or incorrect responses with proficiency. Although a missing response code is still related to proficiency, the expected pattern was not evident. Students in the lowest group were not most likely to select each of the three incorrect options, nor were students in the high group least likely to do so (This item was particularly weak at disciminating between Medium and Low level students). The discrimination index is low (0.14) as is the point-biserial correlation (0.16). This item may be related to proficiency, but because so few students answered correctly (PVal=0.12), it is not possible to estimate the relationship. As responses to this item are not clearly dependent on proficiency, including this item in the test would tend to increase the influence of random factors in the test scores. Including this item (and other problematic items) in the analysis may also reduce the accuracy of statistical estimates for other test items, because the item statistics and parameters are analyzed using the test scores.

Figure 9.8 Item analysis results for PILOT1 data, item MATHC1075

Items can be removed from the analysis by clicking the check box to the left of each item name. After removing an item, the results should be recalculated by clicking on the “Analyze” button before removing any other items. The removal of a single item will affect the results of all other items. If there are many problematic items, you should remove only one at a time, because some items flagged as problematic may only appear so because of the influence of worse items on the analysis results. If you accidentally remove too many items, you may individually recheck each item or by clicking the “Reset Items” button above the item list to reset the entire item list. For this example, we will remove MATHC1075 and rerun the analysis, producing the results in Figure 9.9, in which the results for MATHC1075 are highlighted after removal. Note that the Discr and Pbis data for this item have been replaced by NaN (meaning “not a number”) or out-of range values; they will not affect subsequent calculations. For removed items, the distractor analysis table on the right does not appear, and there is a message in the textual summary to re-analyse the test data.

Because we only removed a single item, the statistics for the remaining items are relatively unchanged.

Figure 9.9 Item analysis results for PILOT1 data, item MATHC1061

You may continue to review all the items by clicking on each row in the item list or by navigating with the up and down arrow keys. Note that the verbal summaries provided by IATA are based solely on statistical evidence and are not informed by the content of items. An item that is given a poor rating by IATA may not be a poor item universally; a poor rating indicates that the item may not provide useful information when the current test is used with the current population.

In general, the recommendations IATA provides for editing or removing items should be considered in the context of the purpose of the test and the initial reasons for including the specific item. For example, some items should be retained regardless of their statistical properties due to (a) their positive effect on student motivation (such as easy initial items) or (b) the need to adequately represent key aspects of the curriculum. However, all items with negative discrimination indices should be removed or re-keyed (if the key has been entered incorrectly) before proceeding with other analyses. Such items introduce noise or unwanted variation into the item response data and reduce the accuracy of estimates for other items. Removing some apparently weak items during analysis of pilot data will help increase the accuracy of the statistical results. However, the selection of the final set of items following the pilot or trial testing should be carried out jointly by subject matter specialists working closely with the person or team responsible for the overall quality of the national assessment test..

When you have finished reviewing all the items, click the “Next>>” button to continue.

9.6. Step 5: TEST DIMENSIONALITY

One of the statistical assumptions of IRT, as well as a requirement for the valid interpretation of test results, is that performance on the test items represents a single interpretable construct or dimension. Ideally a national achievement test of a construct such as mathematics or science should measure the single construct or dimension that it is designed to measure and should not measure other constructs or dimensions such as reading ability. The purpose of the test dimensionality interface is to detect any violations of the assumptions that: 1) there is only a single dominant dimension influencing test performance, and 2) the relationships between performance on pairs or groups of items can be explained by this dominant dimension. In most cases, the second assumption proceeds from the first, but for long tests (e.g., with more than 50 items), small groups of items may be locally dependent without having a noticeable effect on the overall test dimensionality.

The analysis of test dimensionality determines the degree to which the test measures different dimensions of proficiency and the extent to which each item relates to each dimension. The fewer number of dimensions that strongly influence the test items, the more valid any interpretations of the test scores are. Although, this evidence is insufficient to confirm a test’s validity, it can provide important information on the content of specific items. Other aspects of validity, such as content validity (which is very important in the context of a national assessment) are typically considered more important than statistical data when determining the validity of a test or an item (see Anderson and Morgan, 2008 for a description of procedures designed to ensure that a test has adequate content validity).

From a statistical perspective, the estimation of IRT parameters and scores depends on the concept of likelihood, which assumes that the probability of an event (e.g., a correct response) is conditional on a single dimension representing proficiency. If different items are conditional on different dimensions, then the estimated parameters and scores will be incorrect.

When this interface appears, the graph on the left illustrates both the scree plot for the overall test as well as the square factor loadings for the first item, MATHC1019, shown in Figure 9.10. On the left hand side of the interface is a table similar to that in the item analysis interface. Summary symbols (explained on page 23) in the column labelled “F” next to the item “Name” column describe the overall suitability of an item in terms of its relationship to the primary dimension common to most other items on the test. To the right of the “Name” column, the classical item facility (“PVal”) is displayed, along with the loading of the item on the primary dimension (“Loading”). The loading ranges from -1 to 1 and is the correlation between performance on each item and the primary test dimension. For example, the value of 0.34 for MATHC1019 indicates that the scored responses to this item have a correlation of 0.34 with the overall test score (percent-correct). There is no ‘ideal’ value[3], but better items are indicated by loadings closer to 1.

Figure 9.10 Test and item dimensionality for PILOT1 data, item MATHC1019

The results in the table should be interpreted together with the graphical results displayed on the right hand side of the interface. The main result displayed in the graphics window is the scree plot, which describes the proportion of variance explained by each potential dimension (eigenvalue). The dashed red line connecting circle-shaped markers arranged from left to right that illustrate the relative influence of each potential dimension (eigenvalue[4]) on the overall test results, and the solid blue line connecting box-shaped markers describes the relative influence of each potential dimension on the individual test items (squared loading). The magnitude of the eigenvalues is less important than the pattern of the scree plot. The scree plot for the overall test should have a single point on the upper left of the chart (at approximately 0.30 in Figure 9.10) should connect to a near-horizontal straight line at the bottom of the chart that continues to the right side of the graph. This “L”-shaped pattern with only two distinct line segments, shown in Figure 9.10, suggests that a single common dimension is responsible for the PILOT1 test results. The greater the number of distinct line segments it takes to connect the top-left point to the near-horizontal line at the bottom, the more dimensions are likely to be underlying test performance.

Selecting each item in the list on the left will display the item-specific scree plot on the right. Ideally, the scree plot for each individual item should be similar to the overall test -- highest value in the item-specific line should be on the far left (corresponding to the main dimension of the test). However, item-specific characteristics may introduce different patterns, and these item-specific patterns are not necessarily problematic. For example, item MATHC1019 in Figure 9.10 is not problematic; although there are some non-zero loadings on other dimensions, the strongest loading is on the primary dimension. In general, the item-specific results only need to be consulted if there is clearly more than one dimension underlying test performance (i.e., there are more than two distinct line segment making up the red line). In that case, you should identify and examine items whose item-specific plots have squared loading values corresponding to the same dimensions as the problematic eigenvalues.

One caveat in the interpretation of scree plots is the effect of item facility. In tests where most items have similar item facilities, items with facilities much higher or lower than the other items tend to produce artificial “difficulty factors,” particularly with non-normal distributions of percent-correct test scores. The items with extreme facilities may appear to define a separate factor simply because certain students (e.g., high or low performers) will generate patterns of response that appear unusually strongly-related compared to the relationships between other tests items. However, these ‘difficulty factors’ are not inherently problematic. Reviewing the item loadings may help determine if secondary factors are artefact or actual problems. To determine if a secondary factor is a difficulty factor, examine the item loadings of the items with low (<0.2) or high (>0.8) item facilities (PVal). If the item loadings of these items have a peak that corresponds to the position of the secondary factor, it is most likely a difficulty factor and can be ignored.

Item Loadings

The IRT model assumes “local independence” between items, meaning that responses to an item should not depend on the responses to another item. Ideally, under IRT, a test should have questions that are independent in all dimensions except for the primary test dimension. Significant local item dependency can result in inaccurate estimation of item parameters, test statistics and student proficiency. For example, a math test that includes a complex problem solving question might assign a set of different scores for each of the logical steps required to compute the final answer. If the test-taker answered step 1 incorrectly, it influences the probability of correct response on each subsequent step. This set of dependent test items would be inappropriate for IRT modeling– in this case, item should be properly treated as a single partial credit item.

Local dependence is typically problematic only in items that are weakly related to the primary dimension, so the most effective way to use this interface is to sort the items by the “Loading” column by clicking on the column header once[5] (see Figure 9.11), and comparing the poorly loading items to identify common peaks in their item loading graphs. If many poorly-loading items have peaks in their loading plots that correspond to the same dimension, they may have some local dependency. These statistics tend to be sensitive to sampling error, so any results from this statistical review should be used to motivate more detailed item content review rather than make definitive decisions.

After sorting the items, the selected item is MATHC1075; because this item was removed from the analysis in the previous item analysis step, the loading for this item is NaN, and no results are show for the item (the graph only displays the scree plot for the entire test). IATA assigns a triangular warning symbol to any item whose dimensionality may be problematic in terms of affecting the estimation of other statistics. Note that IATA has only flagged one other item with the triangular warning symbol. Figure 9.11 displays the results for this item, MATHC1035. Items MATHC1035 is relatively weakly related to the primary dimension and has a noticeably stronger relationship to the second dimension, which suggests it may be measuring a dimension that is distinct from that of most other items. However, these results by themselves are not conclusive evidence to warrant removal of this item from the test. Curriculum experts and experienced teachers should review any statistically problematic items to determine if there is a content-related issue that might warrant their removal or revision.

Figure 9.11 Comparison of item dimensionality results for PILOT1 data, items MATHC1035 and MATHC1034

IATA assigns a diamond-shaped caution symbol to any item if is has a stronger loading on a secondary dimension than on the primary test dimension, but if results are likely not problematic for any subsequent calculations. A typical example is shown in Figure 9.12, for item MATHC1002. This item is related to several dimensions, but because these dimensions have such little influence on the overall test results, as indicated by the relatively small eigenvalues (dashed red line) corresponding to the peaks of the strong loadings (solid blue line), determination of whether the dimensionality of the item is acceptable or not should be a matter of test content rather than one of statistics.

Figure 9.12 item dimensionality results for PILOT1 data, items MATHC1002

All tests are multidimensional to some extent, because it is impossible for all items to test the exact same thing without actually being the exact same item. Therefore, if the overall scree plot does not indicate any problems then it is likely that the effects of any item-level multidimensionality or codependence will be negligible. For this example, all items will be retained for subsequent analyses because the overall scree plot does not indicate any problems.

When you have finished reviewing the items, click the “Next>>” button to continue to the differential item functioning analysis interface.

9.7. Step 6: DIFFERENTIAL ITEM FUNCTIONING

The principles and rationale for analysis of Differential Item Functioning (DIF) are discussed in detail in Chapter 15 (page 192). In brief, DIF analysis examines the extent to which the IRF of an item is stable across different groups of students. If the IRF is different for two different groups, then the scores that are estimated using the IRF may be biased either universally or for students within specific ranges of proficiency. The DIF analysis controls for differences in average group proficiency, meaning that the relative advantages and disadvantages expressed by the DIF results are independent of differences in the average proficiency in the different groups.

The DIF analysis interface is shown in Figure 9.13. On the left hand side is the set of four controls used to specify the analysis. The drop-down menu at the top allows you to select a variable from the list of variables in the response data that are not test items. Once you select a variable, IATA will list the unique values of this variable in the “Possible values” table, along with the un-weighted percentage of students who have each value. To select the groups to compare, first click on the value that you wish to be the focus group, and then click on the value representing the reference group. The focus and reference group specification determines how the summary statistics are calculated; the estimations use the weighted sample distribution of proficiency of the focus group to calculate average bias and stability statistics. To change focus and reference groups, click on different values in the “Possible values” table; the values assigned to focus and reference groups will be updated in the text boxes at the bottom left. The statistics are most sensitive to the focus group, so the usual practice is to ensure that the focus group is a minority or historically disadvantaged group.

Figure 9.13 DIF analysis results for PILOT1 data by sex, item MATHC1046

For this example, we will perform a DIF analysis using the variable “sex”. We wish to see if female students are disadvantaged, relative to their male counterparts. In orderto specify this analysis and review the results, perform the following steps:

1. From the drop-down menu on the left, select the “sex” variable. When you do so, the table beneath will be populated with the values “1.00” and “2.00”, with values of 50% for each value, indicating that the sample has equal numbers of males and females.

2. In the table of values, click on the value “1.00” – this will cause the value of 1.00 (representing females) to be entered as the Focus group in the text box beneath.

3. In the table of values, click on the value “2.00” – this will cause the value of 2.00 (representing males) to be entered as the Reference group in the text box beneath.

4. Click the “Calculate” button and wait for the calculation to complete.

5. When the calculation is complete, in the item list, click on the header of the “S-DIF” column to sort all the items by the value of the S-DIF statistics.

When you have completed these steps, the interface will appear as illustrated in Figure 9.13. There are 15 items in this example that IATA flags with either a warning or caution symbol. For each item, two statistics are calculated, S-DIF and U-DIF. S- DIF describes the average vertical difference between the groups (focus minus reference), and U-DIF describes the average absolute differences between the groups. The value of the U-DIF statistic is always positive and larger in absolute value than that of S-DIF. Even if there is no systematic advantage for one group (S-DIF is close to 0), an item may have a stronger relationship with proficiency in one group, which would produce a larger U-DIF statistic.

An example of an item with consistent DIF, where the absolute values of S-DIF and U-DIF are identical is MATHC1035, illustrated in Figure 9.14. For this item, the female advantage is apparent across the entire proficiency range. The consistent difference suggests that females are more likely to perform better on this item than males, even if they have the exact same level of proficiency. The S-DIF statistic indicates that, on average, the probability of correct response for females was over 23 percentage points higher than for males of comparable proficiency.

Figure 9.14 DIF analysis results for PILOT1 data by sex, item MATHC1035

With DIF analysis, the statistics and figures tend to be very sensitive to sampling error, which may lead to items appearing to have differences that might not be present in a larger sample. IATA assigns a warning symbol when the coefficient of sampling variation[6] for the S-DIF statistic is less than 0.2, indicating that the observed difference is most likely not due to sampling error, or where there is a very large differences in either S-DIF or U-DIF that should be examined even in small samples.

Because of the sensitivity to sampling error, occasionally the graphical results may be misleading. At the upper and lower ends of the proficiency scale, there tend to be few respondents, particularly with small samples such as the current example. Often, the responses of one or two respondents may dictate the appearance of the graphs at these extremes. As summary statistics weight the calculation by the number of focus group students at each proficiency level, they are not affected as much by random error as the graphs. The graph for the results for MATHC1042 in Figure 9.15 provides an example of how graphical results mislead in some instances. Although the graph suggests a very large disadvantage for females (the lightly haded region), the actual S-DIF statistic (-2.01) indicates a relatively weak disadvantage.

Figure 9.15 DIF analysis results for PILOT1 data by sex, item MATHC1042

Observed evidence of DIF might also be found when item-specific content is not as strongly aligned with the primary test dimension as other items. For example, in mathematics, a common learning objective for younger students is to recognize different measurement tools for different units (such as centimeters, kilograms, degrees centigrade). Students in remote or disadvantaged areas, even if they are strong in mathematics, may not have the same exposure to these tools as students in urban areas. As a result, they may be systematically disadvantaged on test items requiring this specific knowledge. However, this disadvantage is not a property of the test items; it is a consequence of a specific disadvantage in proficiency. Before reaching any conclusions about bias against specific students, curriculum content experts who are sensitive to possible ethnic, geographical or gender differences should examine the test items to confirm that there is evidence of bias from a content perspective that agrees with the statistical evidence.

DIF analysis should be performed for all demographic characteristics and groups that will be compared in major analyses of results; presence of DIF on with respect to one characteristic typically has no relation to the presence or absence of DIF with respect to another characteristic. Usually, the most important variables to consider for DIF are the sampling stratification variables (such as Region), or possibly variables from the background questionnaire. The PILOT1 data have three demographic variables: Sex, Language and Region. As an independent exercise, you can carry out similar DIF analyses for Language, and Region by completing the same steps as for the sex DIF analysis, making sure to select the minority group as the focus group and click Calculate to update the results.

Figure 9.16 illustrates a common DIF result in translation situations, where errors in translation render a good test item confusing to students in the translated version. The results are from a DIF analysis for the Language variable for item MATHC1064. This item is an extreme example of DIF in that correct response is strongly related to proficiency in one population (in this case, language=2) and has a weak or nonexistent relationship in the other (language=1).

Figure 9.16 DIF analysis results for PILOT1 data by language, item MATHC1064

The DIF analysis in IATA can serve as a research tool to determine if specific groups of students have problems with specific sub-domains. DIF analysis can also facilitate an understanding of differences that may be introduced in different language versions of a test that have been translated. Statistical evidence of DIF can be used to help translators to correct translation errors revealed during pilot or trial testing. It can also be used to perform exploratory research into actual performance differences that might exist among students.

The primary purpose of DIF analysis is to prompt discussion and review of the pilot test items and to guide the interpretation of results. For each DIF analysis that is run, IATA saves the results to a data table[7]. These results, and any particularly interesting graphs, should be copied[8], saved and shared with curriculum content specialists to determine possible explanations for the pattern of differences between the focus and reference groups. If there is clear agreement that an item is biased, it should be removed from the analysis specifications on page 2 of IATA and the previous IATA analyses should be repeated. Finally, it is worth repeating that, as the results of DIF analyses are notoriously susceptible to sampling error, any decision about whether or not to include a particular test item in the final version of the test based on the suspicion of bias should have a strong curriculum or content justification. We will proceed in this walkthrough without removing any of the test items.

When you have finished performing DIF analyses and reviewing the results, click the “Next>>” button.

9.8. Step 7: SCALE REVIEW

The technique of developing a numeric metric for interpreting test performance is called scaling. IATA reports the test results using the following scale scores: PercentScore, Percentile, RawZScore, ZScore, IRTscore and TrueScore. These scales are explained greater detail in detail in Table 8.1. Performance on these default scales is either summarized on a scale of 0 to 100 or on the standard scale, which has a mean of 0 and standard deviation of 1. You should use the scale that is most useful to the intended purpose of communicating results – different stakeholders may prefer different types of scales. In general, the IRTscore is the most useful score across the widest range of purposes, but it is has the communication disadvantage that approximately half the students have score less than 0. Many stakeholders do not know how to interpret negative scale scores, so it is preferable to create a new scale so that none of the student scores have values less than 0.

The interface for reviewing the scale scores and creating additional scale scores is shown in Figure 9.17. On the left hand side, there is a drop-

down menu and a graph window. You can select any of the scale score

types from the drop-down menu, which will graph the distribution of the selected scale score. Figure 10.10 presents the graph for the selected scale score, PercentScore. On the right is a panel presenting summary statistics for the selected score. At the bottom right is a set of controls for rescaling the IRTscore by applying a new standard deviation and mean. The rescaling procedure applies only to the IRTscore, which is the primary score output of IATA.

Figure 9.17 The scale review and scale setting interface

9.8.1. Test Score Distributions and Test Information

IATA displays score distributions as histograms, where each bar represents a range of scores, and the height of each bar represents the proportion of students with scores in that range. For score types that are expressed on scales with means approximately 0 and standard deviations approximately 1 (StandardizedZscore, RawZScore, and IRTscore), IATA also plots the test information function as a solid line. The test information function describes how accurate the test is at different proficiency levels on the standard scale on which the items are scaled (for more information, refer to Chapter 15, page 185). The test information function is inversely related to the standard error of measurement; if the test information is high, the standard error of measurement will be low. The test information function should be interpreted in relation to the specific testing needs or purpose of the test. For example, if the purpose of the test is to identify low proficiency students, a test that is most accurate for high proficiency level students would be unsuitable and would not serve as an appropriate measure for identifying low proficiency students. In general, the average error of measurement for all students will be minimized if the information function for a test is slightly wider, but about the same shape and location, as the distribution of proficiency for the students being tested. Comparing the test information function to the distribution of test scores can illuminate whether the test design would benefit from modifying the balance of items with greater accuracy for high or low performers.

9.8.2. Summary Statistics

IATA produces the following summary statistics for each test score:

1. Mean

2. Standard deviation

3. Skewness

4. Kurtosis

5. Interquartile range

6. 25th percentile

7. Median

8. 75th percentile

9. Response rate

10. Reliability

11. Total number of respondents

12. Number of items in the test

13. Number of items included in the analysis.

The first eight statistics describe the distribution of estimated scores. Use the scrollbar on the right of the table to view the last three rows.

These statistics help determine the adequacy of the scale scores for various purposes (e.g., secondary statistical analysis or reporting by quantiles). The last five statistics describe the conditions under which the analysis was conducted and provide a holistic rating of the test, which should be checked to confirm that the analysis was conducted on the proper data according to correct specifications. These statistics were described in Part 1 of this volume. Response rate describes the average number of valid (non- missing) responses on each of the items. Reliability is an overall summary measure of a test’s average accuracy to the given sample of students. Both response rate and reliability range from 0 to 1 and should be as high as possible. The total number of items included in the analysis reflects the fact that some items may be dropped from the analysis as they may have been considered inadequate due to poor wording, confusing to students or other technical inadequacies. For the current walkthrough, the number of respondents is 262, the number of items is 80, and the number of “Okay” items is 79, because item MATHC1075 was removed from the analysis.

The scaling interface is more useful in for final assessment administrations rather than pilot testing. The unweighted pilot test sample is not representative, so the distributions of results should not be generalized to population performance. Also, because no test scores will be reported, there is no need to generate derived scale scores, and further results from the scaling interface are not relevant to the analysis of the PILOT1 data. The scaling interface will be discussed in greater detail in Chapter 10. You may click the “Next>>” button to continue to the next task.

9.9. Step 8: SELECTING TEST ITEMS

Optimal selection of items using IATA is available whenever an item data file has been loaded or created during an analysis of response data. IATA can automatically select items based on their statistical item characteristics in order to produce the most efficient test for a given test length and purpose. The basic principle underlying IRT- based test construction is that the test designer has some expectation about the degree of measurement error that a test should have at different levels of proficiency in addition to requirements about the balance of content that must be included in the test.

In general, the more items there are in a test, the more information it can generate about the proficiency level of examinees. Unfortunately, tests with too many items are generally neither practical nor desirable; they can be unnecessarily disruptive in school and can result in test-taker fatigue and deterioration of student motivation, resulting in less accurate results. Overly long tests are also costly to develop, administer, score, and process. To be most efficient, a test should only include the most informative test items from the pool of available items. IATA can help develop a test with the minimum number of test items necessary to meet the purposes of policy makers and other stakeholders.

Determining an acceptable level of standard error depends on the purpose of the assessment. While it would be ideal to build a test with high information at all proficiency levels, this would require many items, which increases the length of time each student spends taking the test, which in turn may lower the validity of the test results by allowing fatigue and boredom to influence test scores. If a test is norm- referenced, then detailed information (and lower error of measurement) is required for all levels of proficiency. In contrast, if a test is criterion-referenced, then information is only required around the proficiency thresholds at which decisions are made.

However, item selection at the pilot stage should not be determined solely by the results of statistical analysis. The validity of the interpretation of results is the most important consideration in constructing national achievement, and indeed most other, tests. The test scores should adequately and accurately represent the domain being measured. The most important tools for maintaining test validity are the theoretical frameworks and the table of specifications or test blueprint. A blueprint helps determine the balance of content and cognitive skill levels to be included in a test (see Anderson and Morgan, 2008).

The interface for selecting optimal test items is shown in Figure 9.18. On the left, a drop-down menu allows you to select a source of items for item selection. In this example, the “Items1” table is available, which contains the results of the current analysis[9]. Beneath the data source selection are fields that allow you to specify the name that will be applied to the item selection and the total number of items to select from the item data. The table beneath these fields contains a list of all the calibrated items in the selected data source, along with the proficiency level “Level” and content area (“Content”) associated with each item. Although the latter two data fields are typically read into IATA in an item data file, the data may also be manually edited directly in the table. The statistical selection process does not require Level and Content specifications, but having detailed information about each item will help you optimize the selection of item while maintaining desired content representation.

Clicking the checkbox to the left of an item name will force IATA to select the item, regardless of its statistical properties.

Beneath the item table, there are two sliding controls that allow you to specify the range of proficiency within which you wish to maximize test accuracy. The controls are set such that the minimum value corresponds to the 2nd percentile of proficiency and the top maximum corresponds to the 98th percentile (the current selected value is displayed to the right of each sliding control). You can specify a narrower range in which to maximize the information by modifying upper and lower bounds to reflect your assessment goals. IATA will select items to produce the minimum standard error of measurement in the range of proficiency between the lower and upper bounds, assuming a normal distribution of proficiency.

Figure 9.18 Item selection results for PILOT1 data, 50 items

The primary purpose of pilot testing the assessment items is to determine which items will be most useful in the final administration of the national assessment. Because the items have been calibrated with a non-representative sample, it may be useful to triangulate the item selection process using several criteria. Because the sample is entirely from urban schools, it is likely that the distribution of proficiency in the sample is slightly higher in terms of average proficiency than the overall population. In other words, selecting test items to optimize accuracy for students with slightly- below average proficiency in the current sample will likely optimize accuracy for average students in the full sample. Keeping in mind that we wish to create a 50 item final test, we can enter these specifications into IATA as follows:

1. In the “Name of item selection” box, type “50Items” (the name is arbitrary; we use the name here so that you may compare the results you produce to the results in the IATA sample data folder).

2. In the “Total number of items” box, enter the number 50.

3. Move the slider control for the upper bound so that it has a value of 80; this specification indicates that the item selection will not attempt to maximize accuracy above the 80th percentile in proficiency distribution of the current sample, in order to offset the higher proficiency of the pilot sample relative to the general population.

4. Click the “Select Items” button.

When IATA has performed the task, your interface should appear as in Figure 9.18. On the left hand side in the items list, you can view the actual 50 items that have been selected. (The last one is MATHC1041). On the right hand side, the graph displays the collective information and expected error of measurement of the selected items if they were administered as a test. The results indicate that the item selection is most accurate around the proficiency score of 0 (average proficiency in the current sample). The table beneath the graph summarizes the distribution of selected items across different content areas and cognitive levels (for these data, all items have been a default value of 1; values may be edited directly in the item table or uploaded in the initial item data file). If the data in this table indicate that the statistically optimal selection does not adequately conform to the test blueprint, you can modify the balance of content by manually selecting and deleting specific items using the checkboxes next to each item name in the table on the left. As you manually select items, the summary of the test properties on the right will be automatically updated.

The item selection is also recorded as an item data table in IATA with the name “CustomTest50ItemsA.” As with all results produced by IATA, you can view and export this data table by advancing to the final interface of the workflow (see Section 9.11). The items in the table are sorted in the order of suitability for the selection criteria, with the most suitable items at the top.

Given the small number of items in the current analysis, a user may use IATA simply to order all of the items in order of suitability to the desired range of proficiency (i.e., below the 80th percentile in the current sample). The test development team may then review the item data file produced by IATA and, when selecting items for the final test, use a ranking of the items in terms of suitability while ensuring that the appropriate balance of the different content is maintained. To create a new item selection, perform the following steps:

1. Click the “Clear” button to remove all previous selections from the item list.

2. Enter a new name for the item selection, “79Items” (if you used the name that was already used, the previous results would be overwritten).

3. Enter the maximum number of items available (79) as the total number of items. If you enter a number that is greater than the number of available items, IATA will only select from the available items.

4. You may leave the upper bound at 80, since the target range of proficiency has not changed.

5. Click the “Select Items” button.

Figure 9.19 presents some of the results of the analysis of the 79 item pilot test. A table of results (named “CustomTest79Iems”) has been added to the IATA result set, which may be viewed on the final interface of the workflow. Test developers can use this information to help improve the quality of items to be used in the national assessment.

Figure 9.19 Item selection results for PILOT1 data, 79 items

The process of item selection is dependent on the quality of available items. IATA cannot introduce accuracy to specific regions of proficiency if there are no items with information in those regions. The automated process can help select the best items that are available, but it cannot make the items themselves more accurate.

When you are finished reviewing the results, click the “Next>>” button to continue.

9.10. Step 9: PERFORMANCE STANDARDS

At the pilot test stage, there is insufficient evidence to support the setting of performance standards. Although some information is available about the statistical item properties and the specifications that were used to create the items, there is not yet any detailed information about the distribution of proficiency in the student population. Therefore, any attempt to set performance standards at the pilot stage would be unnecessary and potentially misleading.

Because this walkthrough example of the analysis of pilot test data does not require any standard setting, you can click the “Next>>” button to continue to the results viewing and saving interface.

9.11. Step 10: VIEWING AND SAVING RESULTS

For all analysis workflows, IATA produces a number of different results in data table format. Data table results from IATA can be viewed and saved on the final interface of each workflow. The results viewing and saving interface allows you to review each of the data tables of results produced during the analysis workflow. The interface will display the data table that is selected in the drop-down menu. To change the data source, select a different table from the drop-down menu, as shown in Figure 9.20. Chapter 9 (Table 8.5) provides a complete list and description of the available data tables produced by IATA.

Note that, although you did not specify the creation of any performance standards, the table “PLevels” is created automatically using default specification values.

Figure 9.20 Viewing results from the analysis of PILOT1 data

You may save these tables of results in a single output file or multiple files by clicking the “Save Data” button. You may save a single table or all tables at once to a variety of formats. There are two recommended file formats for saving IATA output: Excel (*.xls/*.xlsx) and SPSS (*.sav). In general, Excel is preferable, because all data tables may be saved into a single data file. The Excel format may also be opened in free software such as OpenOffice (downloadable from http://www.openoffice.org/). However, early versions of Excel are limited to a maximum of 255 variables. If your data file has more variables, IATA will only save the first 255 into the *.xls. To save larger data files, you must use the *.sav or *.xlsx formats. SPSS files have the advantage that they can store larger data tables efficiently and can store metadata (if they are edited in the SPSS software package). Note, however, that SPSS has one main limitation: each data table will be saved into a separate file.

A file dialog will ask you to specify the file name and location for the results, as well as the output format. Choose the desired data format and click the “Save” button to finish saving the table or tables[10]. The resulting files contain all of the tabular results produced during the entire analysis workflow, providing documentation of the analysis.

For reference, the item data results of this analysis walkthrough from the table named “Items1” are included in the ItemDataAllTests.xls file in the worksheet named “ReferenceP1.”

For a real pilot test analysis (i.e., not using simulated data), the results tables and any graphics that you have copied and pasted during the analysis workflow should be provided to the test developers, who would then use the information to modify the test, selecting, ordering, and adding items, as required, in order to maximize the accuracy and usefulness of the final test form.

9.12. SUMMARY

In this chapter, you were introduced to the analysis of pilot test data with IATA. You used the “Response data analysis” workflow to analysis response data using an answer key file. The different stages in the workflow included loading data, specifying the analysis, item analysis, dimensional analysis, analysis of differential item functioning, and item selection. Creating scale scores and developing performance standards were not performed, because the distribution of proficiency in the pilot sample was not representative of the population.

In the next chapter, the example continues with the same national assessment program, after the final test has been constructed and administered to the complete national assessment sample.

[1] See chapter 9 for a discussion of the symbols and their meanings.

[2] For more information on common issues identifiable with distractor analysis, see Chapter 15, page 170.

[3] It is unreasonable to have a loading equal to 1, because this would require each respondent to have the same score on every item. This requirement implies that the test could produce only two distinct score values, which is not very informative.

[4] The values displayed in IATA have been standardized to express the proportion of total variance accounted for by each eigenvalue.

[5] Clicking on the header twice will sort the column in descending order.

[6] The coefficient of sampling variation is calculated as the standard error of the S-DIF statistic divided by the absolute value of the S-DIF statistic.

[7] All results from this walkthrough are available for reference and comparison in the IATA sample data folder in the Excel table named, “ReferencePILOT1.xls.” The DIF result tables are in the worksheets with names beginning in “DIF_”.

[8] You can copy any of the DIF analysis graphs, by placing the cursor on the graph and using Copy and Paste functions from the right-click menu.

[9] For different analyses that involve linking, you may select from previously calibrated item data (“Items2”) or the set of items that are common to two item data sources (“MergedItems”)

[10] If you save all tables and select the SPSS (*.sav) output format, each result table will be exported as a separate *.sav data file, with the name you provide as a prefix to all the table names.

MỘT SỐ VẤN ĐỀ GIÁO DỤC

Thứ Năm, 5 tháng 11, 2015

Hướng dẫn sử dụng phần mềm phân tích đề thi IATA (Chapter 9_Convert to PDF to Word full)