9. CHAPTER 9 ANALYZING DATA FROM A PILOT TEST ADMINISTRATION
9.1. Overview
Use the PILOT1 sample data set to carry out this exercise.
The answer key for this test is in the EXCEL workbook, ItemDataAllTests in the sheet named PILOT1.
Let us consider
the following scenario. A national assessment team and its
curriculum experts have created a set of new multiple-choice items in order to evaluate
grade 10 students’
mathematics skills. These new test items were considered adequate for representing the national curriculum. The items had been created to reflect
the main content categories determined by the national
steering committee (number knowledge, shape and space, relations,
problem solving, and uncertainty). The final version of the test is meant to be administered to 10th grade students of all proficiency levels and is intended to contain 50 items.
As a first step, the national assessment team administered an 80 item test to a total of 262 students, sampled from 7 schools in each of 3 regions, with test booklets
in 2 languages. This is a larger number of items than will be included on the final test, but there are typically many items that are developed
for a test that will not function
well for a variety of reasons (e.g., too easy or too difficult, confusing
instructions). A test development process may produce
two or three times as many items as will be used in the final test. Most of these items will be reject
e d by review panels prior to the pretest state. However, a national
assessment team should still pretest at least 50% more items
than are required
for the final test. This
pilot test is intended to test the operational protocols for the survey as well as to determine the composition of items in the final test that will be in the national
assessment (to a different sample of students). The student response data file contains each student’s multiple-choice answers to each of the 80 items as well as some school-level
variables (region identification, school identification, school type and school size) and some student- level information (sex and language).
From the main menu, click the first menu option,
“Response data analysis”, to enter the analysis
workflow, as shown in Figure 9.1. If, at any stage in the workflow, you receive an error or receive results that are different than expected, return to a previous step or begin the analysis again from the main menu.
Figure 9.1 Select the “Response data analysis” workflow
9.2. Step 1. LOADING RESPONSE DATA
Regardless of the analysis path chosen, you must direct IATA to
load previously collected or produced
data (for example,
national assessment pilot test data, or an item data file). IATA is flexible and has simple procedures and buttons for loading response data, item data, or both. Regardless of the
analysis path or type of data, you must
tell IATA which data file to
import and which data in the file to use. IATA can import data
in SPSS (*.sav), EXCEL
(*.xls/*.xlsx), tab-delimited (*.txt), and comma-
separated (*.csv) formats. Because EXCEL data files can contain several separate tables, you must specify which
table is to be imported for the analysis.
The first screen in this analysis path requires you to import a response
data file into IATA. The data-loading
interface is shown in Figure 9.2. The instructions begin with the words “EXAMINEE
RESPONSE DATA…” to indicate that you are loading data containing responses
to items and explain the general contents expected
to be in the data file. Below the instructions are two boxes: a file path summary,
and a drop down menu
for selecting data tables in the selected
file. To the right of these boxes is the button
labelled “Open File”. The table at the bottom of the interface
displays the data for a selected data source. If there are more than 500 rows of data, only the first 500 will be display
ed. If you have selected
a data format that selects multiple tables, such as Excel or Access, then the name of the first table in the data file will appear in the drop-down box. Otherwise, the name of the file will appear in the drop-down box. For multi-table data files, the desired data may not be in the first table. You should verify that the appropriate data are selected
by reviewing the contents of the data table, which will appear in the large area at the bottom of the interface. If the active table does
not contain the desired data, you can select a different table by clicking
the drop- down menu.
Figure 9.2 Response
data loading interface
1. Click Open File to select a data file. In the file browser,
navigate to the folder on
your desktop that contains the IATA sample data.
2. Choose the Excel (*.xls) file format.
If you see (*.xlsx) in the box to the right of the file name field, use the dropdown
arrow and click on (*.xls).
3. Select (or type) PILOT1.xls.
4. Click Open or press the Enter key.
When the file opens, a pop-up dialog will remind you to confirm that the data you have selected
contain the correct item-response data. Click OK to continue. Confirm that the sample pilot data are correctly loaded; your interface
should look like Figure 9.2.
The data shown in figure 11.2 show the records for each student who took the pilot test. The first seven variables from the left describe demo graphic and sampling information about the students:
· PILOT1STDID – unique student identification code;
· SCHOOLID – unique school identification code;
· Sex – the sex of the student (1=male,
2=female);
· SchoolSize – the total number of students in the school;
· Rural – the location
of the school (0=urban, 1=rural);
· Region – a numeric
identifier for the geographic region;
· Language – a numeric
identifier for the language of the test administration.
The first mathematics test item appears in column 8 and is labeled MATHC1019. Scroll across to see that the file contains data on 80 items; the item in the last column is labeled MATHC1041.
The item names are arbitrary and do not reflect their position on the test. Most cells have values A, B, C or D indicating
students’ choice of options. Cells
which have 9 indicate that a student did not respond to the item.
As with most pilot samples,
the students represent
a sample of convenience, rather than a scientific representation of the population.
Sample weights are only valid when they are produced
as a product of a
scientific sample design.
Accordingly, there are no sample
weights in the PILOT1 response data file.
After verifying that you have loaded the correct response
data file, click the “Next>>”
button.
9.3. Step 2: LOADING THE ANSWER KEY
You must also load the item answer keys so that IATA can perform
the analysis correctly. As with the response data, the item data are in Excel format in the IATA data folder on your desktop.
1. Click Open
File to select a data file. In the file browser, navigate to the folder on
your desktop that contains the IATA sample data.
2. Choose the Excel (*.xls) file
format.
3. Select (or type)
ItemDataAllTests.xls.
4. Click Open or press the Enter
key.
When the file opens, a pop-up dialog will remind you to remind you that IATA will estimate any missing item parameters. Click OK to continue. The selected data file contains tables for all the different
examples in this book. Ensure that you have correctly selected the table named “PILOT1” in the dropdown
menu. Confirm that the correct
item data are correctly loaded; your interface
should look like Figure 9.3. If you wish to find information
on a
specific item easily, you can sort the items by clicking on the header for the Name column.
Figure 9.3 Item data for the PILOT1 response
data
When you have confirmed that the correct item data have been loaded, click the “Next>>” button to continue.
9.4. Step 3: ANALYSIS SPECIFICATIONS
Every
workflow that uses response data requires you to provide certain specifications
that will affect the results of all subsequent analyses. These specifications include
answer key and item meta data, respondent identification variable,
sample design weighting, and treatment
of missing data codes. The interface for providing these specifications is shown in Figure 9.4. The large panel on the left contains a table of the test items in the response data file with the columns headers “Name”, “Key”, “Level” and “Content”. If an item data file has been loaded, the table will only contain variables that have been identified as test items; otherwise, the table will contain all variables. If you had skipped the loading of an item data file, you would need to manually enter the answer key specifications for each item in this table (see section 8.3.2.119).
In the center section of the interface,
there is a button labelled
“Update response value list”. You will need to click this button if you change the answer key specifications, either by manually entering
answer keys or deleting existing
answer keys. When you click this button, IATA will populate the two drop-down menus with lists of variables in the response data that have not been assigned an answer key and
list all of the response values
present for the variables identified as test items. If you have loaded an item data file, these menus will already
be populate d with values.
Below the “Update response value list” button, there are several controls for providing
optional specifications: a drop down menu for specifying the identification (ID) variable,
a drop-down menu for selecting
the weight variable,
and a table for specifying treatment of missing
value codes. Specifying an ID variable
may be necessary to merge the test results produced
by IATA with other data sources. The ID variable
should uniquely identify each student;
if you do not specify an ID variable, IATA will produce a variable named “UniqueIdentifier” to serve this purpose. The weight variable
is used to ensure that the statistics
produced during the analysis are appropriate for the sample design of the national
assessment. If no weight variable
is provided, IATA will assume that all students in the data receive the same weight, equal to 1.
Figure 9.4 Analysis
specifications for the PILOT1 data
You can inform IATA that a response value is a missing response
code by clicking one of the checkboxes
next to the value in the “Specify missing treatment” table. By default,
IATA assumes that all response values represent actual student responses.
If the box in the “Incorrect” column is checked, then IATA will treat that value as an invalid response
that will be scored as incorrect. If the box in the “Do Not Score” column is checked, then IATA will treat that value as omitted,
and the value will not affect a student’s test results. By
default, if there are any completely empty or blank cells in the response data, IATA will treat them as incorrect, unless you have manually specified
“Do Not Score” treatment.
For this walkthrough, the answer key and response
data have both been entered, so the list of items shown in Figure 9.4 contains only those variables
with answer keys in the item data. It is a good idea to review the answer key table to confirm that the keys and other data about each item are correct and complete, because any errors at this stage will
produce even more errors in subsequent tasks in the workflow. In the middle of the screen, you will need to specify the additional analysis details. Use the following specifications:
1. Use the first drop-down menu to select PILOT1STDID variable as the ID variable.
2. These data do not have a sample
weight, so you may leave the second drop- down menu blank.
3. The value of 9 will be treated
as incorrect, so check appropriate box in the table of values in the “Specify missing
treatment” section. Although there are no blank entries in the PILOT1 data, you
can leave the default specification of treating blank entries as incorrect.
When the specifications have been entered,
the interface should look the same as Figure 9.4.
Confirm that your specifications are
correct and click the “Next” button to continue.
The data will begin processing automatically. The processing stages are:
Setting up data, Scoring, Estimating
parameters, IRT scaling, Calculating True Scores, and Factor analysis. As the processing continues, the interface
will display the current stage of
processing. Depending on the speed of your computer and the size of your data, this analysis may take seconds to
minutes to complete processing. When IATA finishes
processing, it will display the results in the item analysis interface
9.5. Step 4. ITEM ANALYSIS
When the data processing has finished, the item analysis
interface will be updated with the results, shown in Figure 9.5. Using the item analysis
interface, you can access these results as well as view and save diagnostic information about each test item.
There
are four types of results displayed in this interface:
1. Statistics
and statistical parameters describing each item (on the left);
2. A graphical illustration of the relationship between student proficiency and the probability of correctly responding to an item, also known as an Item Response
Function or IRF (at the top right);
3. A contingency table describing the proportions of students with high, medium, and low test scores who endorsed
each of the different item responses, also known as a distractor analysis (at the middle right); and
4. A plain-language
summary of the item analysis
results (at the bottom right)
Figure 9.5 Item analysis results for the PILOT1 data, item MATHC1019
The table on the left side of the item analysis
interface presents statistical information well as a symbol describing
the overall suitability of each item (see page 23). The Name of each item is in the column to the right of the summary symbols. You can examine the detailed results for an individual item by using the arrow keys or mouse to highlight
the row in which the item appears.
You can use the checkboxes
in the “Use” column for each row to include or exclude
items from the analysis. Uncheck one of these item boxes to remove the item from the analysis.
You may then click the “Analyze” button to rerun the analysis with the reduced set of items. Return all items to their original
state by clicking
the “Reset Items” button. Note that clicking
“Reset Items” will
reset all items, so if you wish to permanently remove an item from the analysis, you should delete its answer key in the analysis
specifications interface. The “Scale” button does not re-estimate any item parameters; it simply calculates IRT scale scores for the response data using the item parameters that have already been estimated or loaded into IATA from an external
data file.
9.5.1. Item Statistics
The three columns to the right of the item name contain classical
item statistics: the item discrimination index (“Discr”), the point-biserial correlation (PBis), and the item facility (“PVal”), also sometimes
referred to as item difficulty, although larger values indicate an easier test item. The final three columns, which may be hidden from view, requiring you to scroll in the table, are estimates of item response
theory (IRT) parameters: the slope parameter (“a”), the location
parameter (“b”) and the pseudo- guessing parameter (“c”). In-depth discussions of these statistics
and parameters and how they relate to each other are presented
in the Chapter 15 (page 149).
In general, the classical statistics
may be interpreted directly. The item facility (PVal) ranges between 0 and 1 and describes how easy an item is for the given sample: a value of 0 indicates
that no students responded correctly, and a value of 1 indicates all students responded
correctly. The discrimination index and point-biserial correlation provide alternate
measures of the same relationship, which is how strongly related responses to each item are to the overall test score. For both statistics, the value should be greater than 0.2. These guidelines should not be considered absolute,
because these indices are also affected by factors other than the discrimination of the items,
including the accuracy
of the overall test. For example, the item facility
tends to limit the absolute value of both the discrimination index and the point-biserial correlation. If the item facility differs substantially from 0.5 (e.g., less than 0.2 or greater than 0.8), the discrimination index and point-biserial correlation will underestimate the relationship between proficiency and performance of students on a test item.
Although extremely easy or difficult items tend to reduce the observed relationships
with proficiency, they may also cover important curriculum
content that should be included in the test or they may (in the case of easy items for instance)
be required to sustain student motivation during testing. For these or other reasons,
it is often desirable to include a relatively small number of very easy or difficult
items.
In contrast, the IRT parameters
should not be interpreted in isolation; although
each describes a specific behaviour
of the test item, the relationship between responses to the item and overall
proficiency are the result of interactions between all three parameters as well as the proficiency level of individual students.
Most items in the current analysis have a green circle, indicating
that they have no major problems and are relatively
satisfactory. By scrolling
down the item list on the left, you will see 13 items with diamond-shaped
caution symbols (MATHC1047,
MATHC1013, MATHC1002, MATHC1070, MATHC1034, MATHC1035,
MATHC1032, MATHC1010, MATHC1068, MATHC1046, MATHC1024, MATHC1058, and MATHC1030). One item (MATHC1075) has a triangular warning symbol and is considered
a potentially problematic item. The best practice is to examine
the results for all items, regardless of the summary symbol IATA assigns, but for this walkthrough, we will focus on a few examples.
By default, the results for the first item are displayed in the graph and table on the right. IATA has assigned this item, MATHC1019, a green circle[1]. Each of the results IATA produces
for this item is explained
in the following sections.
9.5.2. Item Response Function (IRF)
In the graphics window on the right-hand side of item analysis interface,
IATA will display the Item Response
Function (IRF) for a selected
test item. Reviewing
the IRF is typically more intuitive than examining the IRT parameters
or item statistics to determine the relative usefulness
of different test items. A useful item will have a strong relationship with proficiency, indicated
by an IRF that has a strong S-shape, with
a narrow region in which the curve is almost vertical. The slope of the IRF for MATHC1019 is consistently positive,
but the relationship is weak, without any region with
a notably steeper slope. This shallow slope corroborates the low discrimination
index (Discr=0.36) and low point-biserial
correlation (PBis=0.35).
As with any statistical modeling method, IRT is only useful if the data fit the
theoretical model. For each item or score value, IATA produces a graphic of the theoretical IRF produced using the estimated
parameters as well as
the empirical IRF estimated directly from the proportions of correct responses
at each proficiency level. The
graphic can be used to assess the suitability of using IRT to describe each item. If the IRT model is appropriate, the red dashed line will appear to be very similar to the solid black line, where deviations
are less than 0.05, particularly in the region between -1 and 1, where there are many students.
For MATHC1019, the theoretical and empirical IRF’s are almost identical, indicating that, although the item itself may have a weak relationship with proficiency, its statistical properties
are accurately described
by the IRF.
9.5.3. Distractor Analysis
In the bottom right of the item analysis interface, IATA produces
statistics for each response value (including missing value codes and incorrect
response values) and a textual
summary of the analysis.
The statistics
are estimated separately
for groups of low, medium and high performing
students, based on their percent-correct
test score, as well as the entire sample. This table, shown in detail in Figure 9.6, is also referred
to as a distractor analysis.
Figure 9.6 Distractor
analysis for item MATHC109, PILOT1 data
There are many reasons why an item may have a low or even a negative discrimination relationship with proficiency. These include: poor wording, confusing
instructions, sampling errors, and miskeying
or miscoding of responses. Distractor
analysis may be used to detect and remediate some of these common errors by looking
at patterns in item responses. A well-functioning item should have the following characteristics:
1. The correct column option, denoted
by the asterisk (*), should have a high percentage for the high group, and successively lower percentages for the medium
and low groups. MATCHC1019 satisfies this condition, with values of
47.9, 19.9 and 11.4 for the high, medium and low groups, respectively.
2. For the low skilled group, the percentage choosing the
correct option should be lower than the percentage choosing any one of the other
options. All of the incorrect options (A, B and C) for MATHC1019 exhibit this
pattern.
3. Each of the columns corresponding to incorrect response
values should have approximately equal percentages in each skill level and overall
compared to the other incorrect response values. MATHC1019 violates this
pattern, because option B is endorsed by almost twice as many incorrect
respondents as either A or C.
4. For the high-skilled group, the percentage choosing the
correct option should be higher than the percentage choosing any one of the other
options. MATHC1019 satisfies this pattern: 47.9 is greater than the values for A
(14.1), B (23.9) and C (14.1).
5. For all groups, the percentage of missing value codes (denoted
by an X) should be close to 0. A substantial proportion of students had missing
responses (code 9), but the occurrence was greater in low performers than high performers,
suggesting that the decision to treat the code as incorrect (rather than omitted)
was reasonable.
6. Missing response codes that are treated as omitted (denoted
by OMIT) should have equal percentages of students in each skill level. This code
was not used for these data.
IATA provides a textual summary about the item performance, including
warnings if the discrimination is unacceptably low and, if so, suggests what may be done to improve
it. For example, IATA will identify
distractors that are not effective
in eliciting endorsements from respondents (or have statistical profiles similar to correct
responses)[2]. If IATA does detect any common problems in the data, a
verbal summary of the results is
displayed in the text box beneath the distractor analysis table.
Examining the results for MATHC1019,
the textual summary
on the bottom right recommends examining the response option coded as “A”. Looking at the distractor
analysis table, we can see that response “A” is endorsed by approximately the same proportion of high-performing students as low-performing students, indicating that it does not function
well as a distractor.
The distractor analysis
of national assessment data may also be useful to providers
of in-service education courses for teachers
and also to curriculum personnel. The results may help identify common misconceptions and errors made by students.
Curriculum authorities can also use the data to judge the appropriateness of specific material
for a particular grade level.
9.5.4. Comparing Different Items
Turning to the second item on the test, MATHC1027, which is shown in Figure 9.7, we find that, compared
to the previous item, it has a stronger relationship with proficiency, indicated by the steeper IRF and the larger discrimination (0.65) and point-biserial
correlation (0.53). The theoretical and empirical IRFs are almost identical, indicating that the statistical item response model is appropriate to the response data. The distractor analysis table shows that 73.2 percent students in the “High”
group selected the correct option (C) compared
to 19.9 percent in the medium and
8.6 pecent in the low group. All of the data for incorrect
response values (A, B and D), as well as the missing response
code (9), were more likely to be selected
by low-performing students than high-performing students.
Figure 9.7 Item analysis results for PILOT1 data, item MATHC1027
In contrast to the two items we have examined, items with triangular warning symbols are
typically poor items whose inclusion
on the test may produce
misleading or less useful results. The number of poor items that appear in a pilot test such as this one can be minimized by following item-creation
guidelines described in Volume 2 in this series (Anderson
and Morgan 2008). The only item with a warning
symbol in these data is MATHC1075, shown in Figure 9.8. By clicking
on the item you will see the item results indicate an almost nonexistent relationship of either the correct or incorrect responses with proficiency. Although a missing
response code is still related to
proficiency, the expected
pattern was not evident. Students
in the lowest group were
not most likely to select each of
the three incorrect options, nor were students in the high group
least likely to do so (This item was particularly weak at
disciminating between Medium and Low level students). The discrimination index is low (0.14) as is the point-biserial correlation (0.16). This item may be related to proficiency, but because so few students
answered correctly (PVal=0.12), it is not possible to estimate the relationship. As responses to this item are not clearly dependent
on proficiency, including this item in the test would tend to increase the influence of random factors in
the test scores. Including
this item (and other problematic items) in the analysis may also reduce the accuracy of statistical estimates
for other test items, because the item statistics and parameters are analyzed using the test scores.
Figure 9.8 Item analysis results for PILOT1 data, item MATHC1075
Items can be removed from the analysis by clicking the check box to the left of each item
name. After removing
an item, the results should be recalculated by clicking on the “Analyze” button before removing any other items. The removal of a single item will affect the results of all other items. If
there are many problematic items, you should
remove only one at a time, because some items flagged as problematic may only appear so because
of the influence of worse items on the analysis
results. If you accidentally remove too many items, you may individually recheck
each item or by clicking the “Reset
Items” button above the item list to reset the entire item list. For this example, we will remove MATHC1075 and rerun the analysis, producing
the results in Figure 9.9, in which the results for MATHC1075 are highlighted after removal. Note that the Discr and Pbis data for this item have been replaced by NaN (meaning “not a number”) or out-of range values; they will not affect subsequent
calculations. For removed items, the distractor
analysis table on the right does not appear, and there is a message in the textual summary to re-analyse the test data.
Because
we only removed a single item, the statistics for the remaining
items are relatively unchanged.
Figure 9.9 Item analysis results for PILOT1 data, item MATHC1061
You may continue to review all the items by clicking
on each row in the item list or by navigating with the up and down arrow keys. Note that the verbal summaries provided by IATA are based solely on statistical evidence
and are not informed by the content of items. An item that is given a poor rating by IATA may not be a poor item universally; a poor rating indicates
that the item may not provide useful information when the current test is used with the current population.
In general, the recommendations IATA provides for editing or removing items should be
considered in the context of the purpose of the test and the initial reasons for including the specific item. For example,
some items should be retained
regardless of their statistical properties
due to (a) their positive
effect on student motivation (such as easy initial items) or (b) the need to adequately
represent key aspects of the curriculum. However,
all items with negative discrimination indices should be removed or re-keyed (if the key has been entered incorrectly) before proceeding with other analyses.
Such items introduce
noise or unwanted variation into the item response data and reduce the accuracy
of estimates for other items. Removing
some apparently weak items during analysis of pilot data will help increase the accuracy of the statistical results. However,
the selection of the final set of items following the pilot or trial testing should be carried out jointly by subject matter specialists working closely with the person or team responsible for the overall quality of the national assessment test..
When you have finished reviewing
all the items, click the “Next>>” button
to continue.
9.6. Step 5: TEST DIMENSIONALITY
One of the statistical assumptions of IRT, as well as a requirement for the valid interpretation of test results,
is that performance on the test items represents a single interpretable construct or dimension. Ideally a national achievement test of a construct such as mathematics or science should measure the single construct
or dimension that it is designed to measure and should not measure other constructs or dimensions such as reading ability. The purpose of the test dimensionality interface is to detect any violations of the assumptions that: 1) there is only a single dominant dimension influencing test performance, and 2) the relationships between
performance on pairs or groups of items can be explained by this dominant
dimension. In most cases, the second assumption proceeds from the first, but for long tests (e.g., with more than 50 items), small groups of items may be locally
dependent without having a noticeable effect on the overall test dimensionality.
The analysis of test dimensionality determines the degree to which the test measures
different dimensions of proficiency and the extent to which each item relates to each dimension. The fewer number of dimensions that strongly influence
the test items, the more valid any interpretations of the test scores are. Although,
this evidence is insufficient to confirm a test’s validity,
it can provide important information on the content
of specific items. Other aspects of validity,
such as content validity (which is very important in the context of a national assessment) are typically considered
more important than statistical data when determining the validity of a test or an item (see Anderson and Morgan, 2008 for a description of procedures designed
to ensure that a test has adequate
content validity).
From a statistical
perspective, the estimation of IRT parameters and scores depends on the concept of likelihood, which
assumes that the probability of an event (e.g., a correct response) is conditional on a single dimension
representing proficiency. If different
items are conditional on different dimensions, then the estimated parameters and scores will be incorrect.
When this interface
appears, the graph on the left illustrates both the scree plot for the overall
test as well as the square factor loadings for the first item, MATHC1019,
shown in Figure 9.10. On the left hand side of the interface is a table similar to that in the item analysis interface. Summary symbols (explained on page 23) in the column labelled “F” next to the item “Name” column describe the overall suitability of an item in terms of its relationship to the primary dimension common to most other items on the test. To the right of the “Name” column,
the classical item facility (“PVal”)
is displayed, along with the loading of the item on the primary dimension
(“Loading”). The loading ranges from -1 to 1 and is the correlation between performance on each item
and the primary test dimension. For example, the value of 0.34 for MATHC1019 indicates that the scored responses to this item have a correlation of 0.34 with the overall
test score (percent-correct). There is no ‘ideal’ value[3], but better items are indicated by loadings closer to 1.
Figure 9.10 Test and item dimensionality for PILOT1 data, item MATHC1019
The results in the table should be interpreted together with the graphical results displayed on the right hand side of the interface. The main result displayed in the graphics window is the scree plot,
which describes the proportion of variance
explained by each potential
dimension (eigenvalue). The dashed red line connecting
circle-shaped markers arranged
from left to right that illustrate the relative influence
of each potential dimension (eigenvalue[4]) on the overall test results,
and the solid blue line connecting box-shaped
markers describes the relative influence
of each potential dimension on the individual
test items (squared
loading). The magnitude
of the eigenvalues is less important than the pattern of the scree plot. The scree plot for the overall test should have a single point on the upper left of the chart (at approximately
0.30 in Figure 9.10) should connect to a near-horizontal straight
line at the bottom of the chart that continues
to the right side of the graph. This “L”-shaped pattern
with only two distinct line segments, shown in Figure 9.10, suggests
that a single common dimension is responsible for the PILOT1 test results. The greater the number of distinct line segments it takes to connect the top-left point to the near-horizontal line at the bottom, the more dimensions
are likely to be underlying test performance.
Selecting each item in the list on the left will display the item-specific scree plot on the right. Ideally, the scree plot for each individual item should be similar to the overall
test -- highest value in the item-specific
line should be on the far left (corresponding to the main dimension of the test). However, item-specific characteristics may introduce different patterns, and these item-specific patterns
are not necessarily problematic. For example, item MATHC1019 in Figure 9.10 is not problematic; although
there are some non-zero loadings on other dimensions, the strongest loading is on the primary dimension.
In general, the item-specific
results only need to be consulted if there is clearly more than one dimension underlying test performance (i.e., there are more than two distinct line segment making up the red line). In that case, you should identify and examine items whose item-specific
plots have squared
loading values corresponding to the same dimensions as the problematic
eigenvalues.
One caveat in the interpretation of scree plots is the effect of item facility.
In tests where most items have similar
item facilities, items with facilities much higher or lower than the other items tend to produce artificial “difficulty factors,” particularly
with non-normal distributions of percent-correct
test scores. The items with extreme facilities may appear to define a separate factor simply because
certain students (e.g., high or low performers) will generate patterns
of response that appear unusually strongly-related compared to the relationships between other tests items. However, these ‘difficulty factors’ are not inherently problematic. Reviewing the item loadings
may help determine
if secondary factors
are artefact or actual problems. To determine
if a secondary factor is a difficulty factor, examine the item loadings
of the items with low (<0.2) or high (>0.8)
item facilities (PVal). If the item loadings
of these items
have a peak that corresponds to the position
of the secondary factor, it is most likely a difficulty factor and can be ignored.
Item Loadings
The IRT model
assumes “local independence” between items, meaning that responses
to an item should not depend on the responses
to another item. Ideally, under IRT, a test should have questions
that are independent in all dimensions except for the primary
test dimension. Significant local item dependency can result in inaccurate estimation of item parameters, test statistics and student proficiency. For example, a math test that includes a complex problem solving question might assign a set of different scores for each of the logical steps required to compute the final answer.
If the test-taker answered step 1 incorrectly, it influences the probability of correct response on each subsequent step. This set of dependent
test items would be inappropriate for IRT modeling– in this case, item should be properly treated as a single partial credit item.
Local
dependence is typically
problematic only in items that are weakly related to the primary
dimension, so the most effective
way to use this interface
is to sort the items by the “Loading” column by clicking
on the column header once[5]
(see Figure 9.11), and comparing
the poorly loading items to identify common peaks in their item loading graphs. If many poorly-loading
items have peaks in their loading plots that correspond to the same dimension, they may have some local dependency. These statistics tend to be sensitive to sampling error, so any results from this statistical
review should be used to motivate more detailed item content review rather than make definitive decisions.
After sorting the items, the selected
item is MATHC1075; because
this item was removed from the analysis
in the previous item analysis
step, the loading for this item is NaN, and no results are show for the item (the graph only displays the scree plot for the entire test). IATA assigns a triangular warning symbol to any item whose dimensionality may be problematic in terms of affecting the estimation of other statistics. Note that IATA has only flagged one other item with the triangular warning symbol. Figure 9.11 displays
the results for this item, MATHC1035.
Items MATHC1035 is relatively weakly related to the primary dimension and has a noticeably stronger relationship to the second dimension, which suggests it may be measuring a dimension that is distinct from that of most other items. However, these results by themselves are not conclusive evidence to warrant removal of this item from the test. Curriculum experts
and experienced teachers
should review any statistically problematic items to determine
if there is a content-related
issue that might warrant their removal or revision.
Figure 9.11 Comparison
of item dimensionality results for PILOT1 data, items MATHC1035 and MATHC1034
IATA assigns a diamond-shaped caution symbol to any item if is has a stronger loading on a secondary
dimension than on the primary test dimension, but if results are likely not problematic for any subsequent calculations. A typical example is shown in Figure 9.12, for item MATHC1002. This item is related to several dimensions, but because these dimensions have such little influence on the overall test results, as indicated
by the relatively
small eigenvalues (dashed
red line) corresponding to the peaks of the strong loadings (solid blue line), determination of whether the dimensionality of the item is acceptable
or not should be a matter of test content rather than one of statistics.
Figure 9.12 item dimensionality results for PILOT1 data,
items MATHC1002
All tests are multidimensional to some extent, because it is impossible
for all items to test the exact same thing without actually
being the exact same item. Therefore,
if the overall scree plot does not indicate any problems then it is likely that the effects of any item-level multidimensionality or codependence will be negligible.
For this
example, all items will be retained for subsequent analyses
because the overall scree plot
does not indicate
any problems.
When you have finished reviewing
the items, click the “Next>>” button to continue to
the differential item functioning analysis
interface.
9.7. Step 6: DIFFERENTIAL ITEM FUNCTIONING
The principles and rationale for analysis of Differential Item Functioning (DIF) are discussed in detail in Chapter 15 (page 192). In brief, DIF analysis examines
the extent to which the IRF of an item is stable across different
groups of students.
If the IRF is different
for two different groups, then the scores that are estimated using the IRF may be biased either universally or for students
within specific ranges of proficiency. The DIF analysis controls for differences in average group proficiency, meaning that the relative
advantages and disadvantages expressed by the DIF results are
independent of differences in the average proficiency
in the different groups.
The DIF analysis
interface is shown in Figure 9.13. On the left hand side is the set of four controls used to specify the analysis. The drop-down menu at the top allows you to select a variable from the list of variables
in the response data that are not test items.
Once you select a variable,
IATA will list the unique values of this variable in the “Possible
values” table, along with the un-weighted percentage of students
who have each value. To select the groups to compare, first click on the value that you wish to be the focus group, and then click on the value representing the reference group. The focus and reference group specification determines how the summary statistics are calculated; the estimations use the weighted
sample distribution of proficiency of the focus group to calculate
average bias and stability statistics. To change focus and reference
groups, click on different
values in the “Possible values” table; the values assigned to focus and reference groups will be updated in the text boxes at the bottom left. The statistics
are most sensitive to the focus group, so the usual practice
is to ensure that the focus group is a minority or historically disadvantaged group.
Figure 9.13 DIF analysis results for PILOT1 data by sex,
item MATHC1046
For this example,
we will perform a DIF analysis using the variable “sex”. We wish to see if female students
are disadvantaged, relative
to their male counterparts. In orderto specify this analysis and review the results, perform the following steps:
1. From the drop-down menu on the left, select the “sex” variable. When you do so, the table beneath will be populated with the values “1.00” and “2.00”, with values of 50% for each value, indicating that the sample has equal numbers of males and females.
2. In the table of values, click
on the value “1.00” – this will cause the value of 1.00 (representing females) to
be entered as the Focus group in the text box beneath.
3. In the table of values, click
on the value “2.00” – this will cause the value of 2.00 (representing males) to
be entered as the Reference group in the text box beneath.
4. Click the “Calculate” button
and wait for the calculation to complete.
5. When the calculation is complete,
in the item list, click on the header of the “S-DIF” column to sort all the items
by the value of the S-DIF statistics.
When you have completed these steps, the interface will appear as illustrated in
Figure 9.13. There are 15 items in this example that IATA flags with either a warning or
caution symbol. For each item, two statistics
are calculated, S-DIF and U-DIF. S- DIF describes the average vertical
difference between the groups (focus minus reference), and U-DIF describes
the average absolute
differences between the groups. The value of the U-DIF statistic
is always positive
and larger in absolute value than that of S-DIF. Even if there is no systematic advantage for one group (S-DIF is close to 0), an item may have a stronger
relationship with proficiency in one group, which would
produce a larger U-DIF statistic.
An example of an item with consistent DIF, where the absolute values of S-DIF and U-DIF are identical is MATHC1035, illustrated in Figure 9.14. For this item, the female advantage
is apparent across the entire proficiency range.
The consistent difference suggests that females are more likely to perform better on this item than males,
even if they have the exact same level of proficiency. The S-DIF statistic indicates that, on average,
the probability of correct response
for females was over 23 percentage points higher than for males of comparable proficiency.
Figure 9.14 DIF analysis results for PILOT1 data by sex,
item MATHC1035
With DIF analysis, the statistics and figures tend to be very sensitive
to sampling error, which may lead to items appearing to have differences that might not be present in a larger sample.
IATA
assigns a warning
symbol when the coefficient of sampling variation[6]
for the S-DIF statistic
is less than 0.2, indicating
that the observed difference is most likely not due to sampling error, or where there is a very large differences in either S-DIF or U-DIF that should be examined
even in small samples.
Because
of the sensitivity to sampling
error, occasionally the graphical results may be misleading. At the upper and lower ends of the proficiency scale, there tend to be few respondents, particularly with small samples such as the current example. Often, the responses of one or two respondents may dictate the appearance of the graphs at these extremes. As summary statistics weight the calculation by the number of focus group students at each proficiency level, they are not affected
as much by random error as the graphs. The graph for the results for MATHC1042 in Figure 9.15 provides an example of how graphical results mislead in some instances.
Although the graph suggests a very large disadvantage for females (the lightly haded region), the actual S-DIF statistic (-2.01) indicates a relatively weak disadvantage.
Figure 9.15 DIF analysis results for PILOT1 data by sex,
item MATHC1042
Observed evidence
of DIF might
also be found when item-specific content is not as strongly
aligned with the primary test dimension as other items. For example,
in mathematics, a common learning
objective for younger students is to recognize different measurement tools for different units (such as centimeters, kilograms,
degrees centigrade). Students
in remote or disadvantaged
areas, even if they are strong in mathematics, may not have the same exposure to these tools as students in urban areas. As a result, they may be systematically disadvantaged on test items requiring
this specific knowledge. However, this disadvantage is not a property of the test items; it is a consequence of a specific
disadvantage in proficiency. Before
reaching any conclusions about bias against specific students,
curriculum content experts who are sensitive to possible ethnic,
geographical or gender differences should
examine the test items to confirm that there is evidence of bias from a content perspective that agrees with the statistical evidence.
DIF analysis should be performed
for all demographic characteristics and groups that will be compared in major analyses
of results; presence of DIF on with respect to one characteristic typically has no relation
to the presence or absence of DIF with respect
to another characteristic. Usually, the most important variables to
consider for DIF are the sampling
stratification variables (such as Region), or possibly variables from the background questionnaire. The PILOT1
data have three demographic variables: Sex, Language
and Region. As an independent exercise, you can carry out similar DIF analyses for Language, and Region
by completing the same steps as for the sex DIF analysis, making sure to select the minority group as the
focus group and click Calculate to
update the results.
Figure
9.16 illustrates a common DIF result in translation situations, where errors in translation render a good test item confusing
to students in the translated
version. The results
are from a DIF analysis
for the Language variable
for item MATHC1064. This item is an extreme
example of DIF in that correct response is strongly related to proficiency in one population (in this case, language=2) and has a weak or nonexistent relationship in the other (language=1).
Figure 9.16
DIF analysis results for PILOT1 data by language, item MATHC1064
The DIF analysis in IATA can serve as a research tool to determine
if specific groups of students
have problems with specific sub-domains. DIF analysis can also facilitate
an understanding of differences that may be introduced in different language
versions of a test that have been translated. Statistical evidence of DIF can be used to help translators to correct translation errors revealed during pilot or trial testing. It can also be used to perform exploratory research
into actual performance differences that might
exist among students.
The primary purpose of DIF analysis is to prompt discussion and review
of the pilot test items and to guide the interpretation of results. For each DIF
analysis that is run, IATA saves the results to a data table[7]. These
results, and any particularly interesting graphs, should be copied[8], saved and shared with curriculum content specialists to determine possible
explanations for the pattern of differences between the focus and reference groups. If there is clear agreement that an item is biased, it should be removed
from the analysis specifications on page 2 of IATA and the previous IATA analyses should be repeated. Finally, it is worth repeating that, as the results of DIF analyses are notoriously susceptible to sampling error, any decision
about whether or not to include a particular test item in the final version of the test based on the suspicion of bias should have a strong curriculum
or content justification. We will proceed in this walkthrough without removing any of the test items.
When you have finished performing
DIF analyses and reviewing the results, click the “Next>>” button.
9.8. Step 7: SCALE REVIEW
The technique of developing a numeric metric for interpreting test performance is called scaling. IATA reports the test results using the following scale scores: PercentScore, Percentile, RawZScore, ZScore, IRTscore
and TrueScore. These scales are
explained greater detail in detail in Table 8.1. Performance on these default scales is either summarized
on a scale of 0 to 100 or on the standard scale, which has a mean of 0 and standard deviation of 1. You should use the scale that is most useful to the intended purpose of communicating results – different
stakeholders may prefer different types of scales. In general, the IRTscore is the most useful score across the widest range of purposes,
but it is has the communication disadvantage that approximately half the students
have score less than 0. Many stakeholders do not know how to interpret negative scale scores, so it is preferable to create a new scale so that none of the student scores have values less than 0.
The interface for reviewing the scale scores and creating
additional scale scores is shown in Figure 9.17. On the left hand side, there is a drop-
down menu and a graph window. You can select any of the scale score
types from the drop-down menu, which
will graph the distribution of the selected
scale score. Figure 10.10 presents
the graph for the selected scale score, PercentScore. On the right is a panel presenting summary statistics for the selected
score. At the bottom right is a set of controls for rescaling the IRTscore by applying
a new standard deviation and mean. The rescaling
procedure applies only to the IRTscore, which is the primary score output of IATA.
Figure 9.17 The scale review and scale setting
interface
9.8.1. Test Score Distributions and Test Information
IATA displays score distributions as histograms, where each bar represents a range of scores, and the height of each bar represents the proportion of students with scores in that range. For score types that are expressed
on scales with means approximately 0 and standard deviations
approximately 1 (StandardizedZscore, RawZScore, and IRTscore),
IATA also plots the test information function as a solid line. The test information function
describes how accurate
the test is at different
proficiency levels on
the standard scale on which the items are scaled (for more information, refer to Chapter
15, page 185). The test information function is inversely
related to the
standard error of measurement; if the test information is high, the standard error of
measurement will be low. The test information function
should be interpreted in relation to the specific
testing needs or purpose of the test. For example,
if the purpose of the test is to identify low proficiency students, a test that is most accurate for high proficiency level students would be unsuitable and would not serve as an appropriate
measure for identifying low proficiency students. In general, the average error of measurement for all students will be minimized
if the information function for a test is slightly
wider, but about the same shape and location, as the distribution of proficiency for the students
being tested. Comparing
the test information function to the distribution of test scores can illuminate whether the test design would benefit from modifying the balance of items with greater
accuracy for high or low
performers.
9.8.2. Summary Statistics
1. Mean
2. Standard
deviation
3. Skewness
4. Kurtosis
5. Interquartile range
6. 25th percentile
7. Median
8. 75th percentile
9. Response
rate
10. Reliability
11. Total number of respondents
12. Number
of items in the test
13. Number
of items included
in the analysis.
The first eight statistics describe the distribution of estimated scores. Use the scrollbar on the right of the table to view the last three rows.
These
statistics help determine
the adequacy of the scale scores for various purposes (e.g., secondary
statistical analysis or reporting by quantiles). The last five statistics describe the conditions under which the analysis was conducted and provide a holistic rating of the test, which should be checked
to confirm that the analysis
was conducted on the proper data according
to correct specifications. These statistics were described in Part 1 of this volume.
Response rate describes
the average number of valid (non- missing) responses
on each of the items. Reliability is an overall
summary measure of a test’s average accuracy
to the given sample of students. Both response rate and reliability range from 0 to 1 and should be as high as possible. The total number of items included in the analysis
reflects the fact that some items may be dropped
from the analysis as they may have been considered inadequate due to poor wording, confusing to students or other technical
inadequacies. For the current walkthrough, the number of respondents is 262, the number of items is 80, and the number of “Okay” items is 79, because item MATHC1075 was removed from the analysis.
The scaling interface
is more useful in for final assessment administrations rather than pilot testing.
The unweighted pilot test sample is not representative, so the distributions of results should not be generalized to population performance. Also, because no test scores will be reported,
there is no need to generate derived scale scores, and further results
from the scaling interface are not relevant to the analysis of the PILOT1 data. The scaling interface
will be discussed in greater detail in Chapter
10. You may click the “Next>>” button to continue to the next task.
9.9. Step 8: SELECTING TEST ITEMS
Optimal selection of items using IATA is available
whenever an item data file has been loaded or created during an analysis
of response data. IATA can automatically
select items based on their statistical item characteristics in order to produce the most efficient test for a given test length and purpose. The basic principle
underlying IRT- based test construction is that the test designer
has some expectation about the degree of measurement error that a test should have at different levels of proficiency in addition to requirements about the balance of content that must be included in the test.
In general, the more items there are in a test, the more information it can generate
about the proficiency level of examinees. Unfortunately, tests with too many items are generally neither
practical nor desirable; they can be unnecessarily disruptive in school and can result in test-taker fatigue and deterioration of student motivation, resulting in less accurate results. Overly long tests are also costly to develop, administer, score, and process.
To be most efficient, a test should only include the most informative test items from the pool of available
items. IATA can help develop a test with the minimum number of test items necessary to meet the purposes of policy makers and other stakeholders.
Determining an acceptable level of standard
error depends on the purpose of the assessment. While it would be ideal to build a test with high information at all proficiency levels, this would require many items, which increases the length of time each
student spends taking the test, which in turn may lower the validity of the test results by allowing fatigue and boredom
to influence test scores. If a test is norm- referenced, then detailed information (and lower error of measurement) is required for all levels of proficiency. In contrast, if a test is criterion-referenced, then information is only required around the proficiency thresholds at which decisions are made.
However, item selection at the pilot stage should not be determined solely by the results of statistical analysis. The validity of the interpretation of results is the most important consideration in constructing national
achievement, and indeed most other, tests. The test scores should adequately and accurately represent
the domain being measured. The most important tools for maintaining test validity are the theoretical
frameworks and the table of specifications or test blueprint. A blueprint helps determine the balance of content and cognitive skill levels to be included
in a test (see Anderson and Morgan, 2008).
The interface for selecting optimal test items is shown in Figure 9.18. On the left, a drop-down menu allows you to select a source of items for item selection. In this example, the “Items1” table is available, which contains the results of the current analysis[9]. Beneath the data source selection
are fields that allow you to specify the name that will be applied to the item selection
and the total number of items to select from
the item data. The table beneath these fields contains
a list of all the calibrated items in the selected
data source, along with the proficiency level “Level” and content area (“Content”) associated with each item. Although the latter two data fields are typically read into IATA in an item data file, the data may also be manually edited directly in the table. The statistical selection process does not require
Level and Content specifications, but having detailed
information about each item will help you optimize the selection of item while maintaining desired content representation.
Clicking the checkbox to the left of an item name will force IATA to select the item, regardless of its statistical properties.
Beneath
the item table, there are two sliding controls that allow you to specify the range
of proficiency within which you wish to maximize test accuracy.
The controls are set such that the minimum value corresponds to the 2nd percentile of proficiency and the top maximum
corresponds to the 98th percentile (the current selected
value is displayed to the right of each sliding
control). You can specify a narrower range in which to maximize the information by modifying
upper and lower bounds to reflect your assessment goals. IATA will select items to produce the minimum standard error of measurement in the range of proficiency between the lower and upper bounds, assuming a normal distribution of proficiency.
Figure 9.18 Item selection results for PILOT1 data, 50 items
The primary purpose of pilot testing
the assessment items is to determine which items will
be most useful in the final administration of the national
assessment. Because the items have been calibrated with a non-representative sample,
it may be useful to triangulate the item selection
process using several
criteria. Because the sample is entirely from urban schools, it is likely that the distribution of proficiency in the sample
is slightly higher in terms of average
proficiency than the overall population. In other words, selecting
test items to optimize accuracy
for students with slightly- below average proficiency in the current sample will likely optimize
accuracy for average
students in the full sample. Keeping in mind that we wish to create a 50 item final
test, we can enter these specifications into IATA as follows:
1. In the “Name of item selection” box, type “50Items” (the name is arbitrary; we use the name here so that you may compare the results you produce to the results
in the IATA sample data folder).
2. In the “Total number of items”
box, enter the number 50.
3. Move the slider control for
the upper bound so that it has a value of 80; this specification indicates that
the item selection will not attempt to maximize accuracy above the 80th percentile
in proficiency distribution of the current sample, in order to offset the higher
proficiency of the pilot sample relative to the general population.
4. Click the “Select Items”
button.
When IATA has performed
the task, your interface should appear as in Figure 9.18. On the left hand side in the items list, you can view the actual 50 items that have been selected. (The last one is MATHC1041). On the right hand side, the graph displays the collective
information and expected
error of measurement of the selected items if they were administered as a test. The results indicate that the item selection is most accurate around the proficiency score of 0 (average
proficiency in the current sample).
The table beneath the graph summarizes the distribution of selected items across different content areas and cognitive levels (for these data, all items have been a
default value of 1; values may be edited directly in the item table or uploaded
in the initial item data file). If the data in this table indicate that the statistically optimal selection does not adequately conform to the test blueprint, you can modify the balance
of content by manually selecting
and deleting specific
items using the checkboxes next to each item name in the table on the left. As you manually select items, the summary of the test properties on the right will be automatically updated.
The item selection
is also recorded as an item data table in IATA with the name “CustomTest50ItemsA.” As with all results produced by IATA, you can
view and export this data table by
advancing to the final interface of the workflow (see Section 9.11). The items in the table are sorted in the order of
suitability for the selection criteria,
with the most suitable items at the top.
Given the small number of items in the current analysis, a user may use IATA simply to order all of the items in order
of suitability to the desired range of proficiency (i.e., below the 80th
percentile in the current sample). The test development team may then review the item data file produced
by IATA and, when selecting
items for the final test, use a ranking of the items in terms of suitability while ensuring that the appropriate balance of the different content is maintained. To create a new item selection, perform the following
steps:
1. Click the “Clear” button to remove all previous selections
from the item list.
2. Enter a new name for the item
selection, “79Items” (if you used the name that was already used, the previous results
would be overwritten).
3. Enter the maximum number of
items available (79) as the total number of items. If you enter a number that is
greater than the number of available items, IATA will only select from the available
items.
4. You may leave the upper bound
at 80, since the target range of proficiency has not changed.
5. Click the “Select Items”
button.
Figure
9.19 presents some of the results of the analysis
of the 79 item pilot test. A table of results (named “CustomTest79Iems”) has been added to the IATA result set,
which may be viewed on the final interface
of the workflow. Test developers can use this information to help improve the quality of items to be used in the national
assessment.
Figure 9.19 Item selection results for PILOT1 data, 79 items
The process of item selection is dependent on the quality of available
items. IATA cannot introduce accuracy
to specific regions
of proficiency if there are no items with information in those regions.
The automated process can help select the best items that are available, but it cannot make the items themselves more accurate.
When you are finished reviewing
the results, click the “Next>>” button to continue.
9.10. Step 9: PERFORMANCE STANDARDS
At the pilot test stage, there is insufficient evidence to support the setting of performance standards. Although some information is available about the statistical
item properties and the specifications that were used to create the items, there is not yet any detailed
information about the distribution of proficiency in the student population. Therefore, any attempt to set performance standards at the pilot stage would be unnecessary and potentially misleading.
Because
this walkthrough example
of the analysis of pilot test data does not require any
standard setting, you can click the “Next>>” button to continue to the results viewing and saving interface.
9.11. Step 10: VIEWING AND SAVING RESULTS
For all analysis
workflows, IATA produces a number of different results in data table format. Data table results from IATA can be viewed and saved on the final interface of each workflow. The results viewing and saving interface allows you to review each of the data tables of results produced during the analysis workflow.
The interface will display the data table that is selected in the drop-down menu. To change the data source, select a different
table from the drop-down menu, as shown in Figure 9.20. Chapter 9 (Table 8.5) provides a complete list and description of the available data tables produced by IATA.
Note that, although
you did not specify the creation of any performance standards, the table “PLevels” is created automatically using default specification values.
Figure 9.20 Viewing
results from the analysis of PILOT1 data
You may save these tables of results in a single output file or multiple
files by clicking the “Save Data” button. You may save a single table or all tables at once to a variety of formats.
There are two recommended
file formats for saving IATA output: Excel (*.xls/*.xlsx) and SPSS (*.sav). In general, Excel is preferable, because all data tables may be saved into a single data file. The Excel format may also be opened in free software such as OpenOffice (downloadable from http://www.openoffice.org/). However, early versions of Excel are limited to a maximum
of 255 variables. If your data file has more variables, IATA will only save the first 255 into the *.xls. To save larger data files, you must use the *.sav or *.xlsx
formats. SPSS files have the advantage that they can store larger data tables efficiently and can store metadata (if they are edited in the SPSS software package).
Note, however, that SPSS has one main limitation: each data table will be saved into a separate
file.
A file dialog will ask you to specify the file name and location
for the results, as well as the output format. Choose the desired data format and click the “Save” button to finish
saving the table or tables[10]. The resulting files contain all of the tabular results produced during the entire analysis workflow,
providing documentation of the analysis.
For reference, the item data results of this analysis
walkthrough from the table named “Items1” are included in the ItemDataAllTests.xls file in the worksheet named
“ReferenceP1.”
For a real pilot test analysis (i.e., not using simulated data), the results tables and any graphics that you have copied and pasted during the analysis
workflow should be provided to the test developers, who would then use the information to modify the test, selecting,
ordering, and adding items, as required, in order to maximize the accuracy and usefulness of the final test form.
9.12. SUMMARY
In this chapter,
you were introduced to the analysis
of pilot test data with IATA. You used the “Response data analysis” workflow
to analysis response
data using an answer key file. The different stages in the workflow
included loading data, specifying the analysis, item analysis, dimensional analysis, analysis of differential
item functioning, and item selection. Creating scale scores and developing
performance standards were not performed, because the distribution of proficiency in the pilot sample was not representative of the population.
In the next chapter, the example continues
with the same national assessment
program, after the final test has been constructed and administered to the complete national assessment sample.
[1] See chapter 9 for a discussion of the symbols
and their meanings.
[2] For more information on common issues
identifiable with distractor analysis, see Chapter 15,
page 170.
[3] It is unreasonable to have a loading equal to 1, because
this would require each respondent to have the same score on every item. This requirement implies that the test could produce only two distinct score values, which is not very informative.
[4] The values displayed in IATA have been standardized to express the proportion of total variance accounted
for by each eigenvalue.
[5] Clicking on the header twice will sort the column in descending order.
[6] The coefficient of sampling variation is calculated as the standard error of the S-DIF statistic
divided by the absolute value of the S-DIF statistic.
[7] All results from this walkthrough are available for reference and comparison in the IATA sample data folder in the Excel table named, “ReferencePILOT1.xls.” The DIF result tables are in the worksheets with names beginning
in “DIF_”.
[8] You can copy any of the DIF analysis graphs, by placing the cursor on the graph and using Copy and Paste functions from the right-click menu.
[9] For different
analyses that involve linking,
you may select from previously calibrated item data (“Items2”) or the set of items that are common to two item data sources (“MergedItems”)
[10] If you save all tables and select the SPSS (*.sav) output
format, each result table will be exported
as a separate *.sav data file, with the name you provide
as a prefix to all the table names.
Không có nhận xét nào:
Đăng nhận xét