15. CHAPTER 15 PSYCHOMETRIC CONCEPTS
This annex chapter provides some theoretical background about the statistical
analyses performed by IATA. Section one presents an overview of classical test theory. The overview is not intended
to be exhaustive and may be skimmed
by those familiar with this aspect of test theory.
It begins with a
discussion of test scores as statistical
estimates. The basic formulation of classical test theory is presented, using the concept of standard
error of measurement. The term test reliability is introduced in relation to standard error of measurement
and the impact of item characteristics on test
reliability. The section concludes with a discussion of practical methods for using classical item statistics to develop efficient
tests.
Section
two extends the concepts of classical test theory into item response
theory. It develops
the principles of item response
models and item information. It explains the fundamental concept
of population invariance and describes several
applications of item response theory, including item analysis, test construction, and test equating.
15.1. Classical Test Theory
15.1.1. Describing Accuracy of Tests
Under the Classical Test Theory (CTT) approach, we use student performance on a collection of items on a single test to make generalizations about performances on all other possible
collections of similar
items. This principle
relies on the assumption that there are a very large number of possible
items to measure the particular
skill that is the being assessed. This assumption is reasonable in most cases. For example,
even in a specific curriculum area such as multiplication of one and two-digit numbers, there are almost 10,000 possible
test items, enough to keep a grade three student busy for
an entire year. As it is unreasonable to test a student using such a large number of items, a much smaller sample of items can be used instead
to predict what the performance of a student would have been on the complete
set, or universe of possible items. Items are sampled from the item universe
to produce tests. When the tests are administered, they elicit test scores, which are then generalized back to the item universe. Always, the goal of testing
students is to make inferences about student performance on the universe of test items.
This interpretation of test performance is very similar
to the statistical concept of probability. The probability of an event occurring can never be known without an infinite
number of observations. However, we often make inferences about probability using a very small number of observations (usually, as few as possible). The typical technique for estimating probability is to take a sample of events and calculate
the number of times a particular event occurs divided by the total number of observations.
For instance you can test the fairness
of a coin by calculating the probability of heads and
tails occurring. We can estimate
the probability by tossing the coin repeatedly and counting the proportion of outcomes that are heads or tails. If the proportion of heads is one-half of the total number of tosses, then the probability is estimated to be 0.5 and the coin is probably a fair coin.
Test scores can be interpreted in a similar
way; each student’s
test score represents
the probability that the student will correctly
respond to a test item from a particular universe of items (e.g., a mathematics item from the set of all possible
mathematics items). The probability is calculated as the number of items they scored correct on a specific
tests divided by the total number of items they responded
to.
The following example in Table A.1 demonstrates this principle, using the observed performance of an individual student on a set of reading test items. Column one
shows the number of items the student has taken or answered. Column two
is the student’s score on a particular
item; 0 is incorrect, and 1 is correct. Column three shows the cumulative score after each item, which is
calculated as the sum of all item scores
up to each new item. Column four is the average item score, or estimated probability, updated
after each item by dividing
the cumulative score by the number of
items.
Item number
|
Item score
|
Cumulative
score
|
Average score
(proportion correct)
|
1
|
0
|
0
|
0.00
|
2
|
1
|
1
|
0.50
|
3
|
0
|
1
|
0.33
|
4
|
1
|
2
|
0.50
|
5
|
1
|
3
|
0.60
|
6
|
1
|
4
|
0.67
|
7
|
1
|
5
|
0.71
|
8
|
1
|
6
|
0.75
|
9
|
0
|
6
|
0.67
|
10
|
1
|
7
|
0.70
|
11
|
0
|
7
|
0.64
|
12
|
1
|
8
|
0.67
|
13
|
0
|
8
|
0.62
|
14
|
1
|
9
|
0.64
|
15
|
0
|
9
|
0.60
|
16
|
0
|
9
|
0.56
|
17
|
1
|
10
|
0.59
|
18
|
1
|
11
|
0.61
|
19
|
1
|
12
|
0.63
|
20
|
0
|
12
|
0.60
|
After 20 test items, the estimated probability of correct response
for this student is 0.60; this suggests
that, if this student responded
to all possible Grade 1 reading items, he or she would probably get 60% of the items correct.
However, this estimate varies as the test length increases from one to 20. Since a student can only get a score of 0 or 1 with a single item, the estimated probability
after only one item will be 0 or 1. However, common sense tells us that students will rarely get all items correct or incorrect from the entire universe of items. By the time the student completes
the third item, the probability is 0.33, which is a more reasonable estimate. As the number of test items increases, the estimate converges
to its final value of around 0.60, and the difference between each estimate
is smaller.
The graph in Figure 15.1 illustrates the relationship between increasing stability
of the test score and increasing number of items for this student.
When the number of items is small, the zigzagging of the line in Figure 15.1suggests
that the estimates of probability are unreliable. If a test score changes dramatically just by adding a single item, then the estimate is probably
not useful for generalizing to the thousands of comparable possible
items not included
on the test. As the number of
items increases, the severity of the zigzags decreases, and beyond a certain point, adding more items will not noticeable change the estimated probability. In general, as the number of items on a test increases, and the items are randomly equivalent to each other,
the estimated probability will provide a better or more accurate estimate of the proportion of the universe of items that are answered correctly.
Although it is still not possible to know the true probability without administering an infinite number of test items, the estimates can be made stable enough that using additional items is not cost-
effective.
15.1.2. Error of Measurement
The estimated probability which the test scores are trying to estimate is assumed to be fixed for each student, regardless of the number of items on a test. For each student,
the observed test scores will eventually converge
towards this probability as the number
of items on a test increases. Since we assume that this probability does not change
as we add new items to the test, we are also assuming that this probability is not affected
by the sample of items a student is administer ed. In other words, weassume that the proficiency of a student in a particular subject area does not depend on the items used to test this proficiency.
This assumption further
implies that every student has a certain
level of proficiency or ability
in the skill that is being assessed,
even if they are not administered a single test item. Unfortunately, it is impossible
to determine what this level is without testing. When we do test a student,
it is useful to make the distinction between the observed score, which is based on the test items used in the test, and the true score, which does not depend on test items. The observed score, as the name implies,
is the score the student
gets on an actual test. The true score is a hypothetical score; it can be interpreted as the average
of a very large number of scores on very similar tests administered under identical conditions
to the same student. In practice, the true score cannot be known, since it requires
administering a large number of tests while the students’ proficiency remains constant.
The true score of a student
will not change unless the students’ proficiency changes, whereas the observed score will change according to each test administration. The observed score may vary for a person (or for people of the exact same true score) depending on the sample of items used, but the true score does not.
The reason why the observed
score changes is because it is influenced by random error. In testing situations, random error refers to factors that randomly affect the assessment of reading proficiency such as student level of motivation, fatigue
etc. Because these factors are random, they may result in the observed test score being higher or lower than the true score. The difference between the true score and the observed score for a particular student on a particular test is the error of measurement[1]. This key concept
in classical test theory can be stated as follows:
for any student, the observed
score on a test is equal to the true score plus or minus some error
of measurement. The expected magnitude
of the error for students on a specific test is called the standard error of measurement (SEM). For any given observed
score, the SEM of the score describes
the probable location
of the true score. A small standard error of measurement suggests the true score is probably similar
to the observed score, and a large standard error of
measurement suggests the true score may be very different
from the observed score.
As an illustration, imagine a group of students
with the same true score of 0.60 in grade 1 reading.
Since this score represents the probability that each student will correctly respond to any Grade 1 reading test item, each student would theoretically
get a score of 60% on all similar grade 1 reading tests. However, on any specific reading test, the score for a student will probably not equal 60% -- it will be 60% plus or minus some error that depends on the characteristics
of the test or the testing situation. If the test is accurate,
the error might be small, and if the test is inaccurate
(for instance it may have some poorly worded items or it may have a limited
number of items),
the error may be large. Figure 15.2 illustrates this example. Even though all 10 students depicted
on Figure 1 have the same true score of 0.60, the observed
scores of those who took the accurate test are quite similar and are clustered
around 0.60. In contrast, the test scores of the five students
who took the inaccurate test are quite different from each other, even though the average score is still 0.60.
Figure 15.2 Observed scores on two tests of different accuracy
for students of same proficiency (true score=0.60)
The SEM of a particular test represents the degree to which students
with the same true score differ in their observed
scores. If we were to collect scores from 100
students with the exact same proficiency (true score=0.60), the expected distributions
of observed scores for the two tests above would appear as in Figure 15.3. The two sets of bars in Figure A.3 display the expected
frequency of each score occurring
for a sample of 100 students, each of whose true score is 0.60. This example assumes the accurate test is 32 items long, while the inaccurate test is 16 items long. Thus, there are 33 possible scores for the accurate test (including 0), and 17 possible scores for the inaccurate test. For both tests, the most likely score to be observed
is 0.625, which is the closest to the true score. However,
despite having more unique scores,
the scores are much more densely clustered
for the 32 item test than for the 16 item test. This
example demonstrates how students with the exact same true score can have very different observed
scores if the SEM of the test is large.
Figure 15.3 Distribution of Observed Scores for Tests of Different Accuracy levels when True Score is Constant (0.60)
Recall
that the observed score on a test is the average of the sample of individual item scores. Similarly,
the standard error for an individual is equal to the standard deviation between their item scores, divided by the total number of items. However, this standard error for the individual
is also affected by factors
related to the individual, and so is not a good representation of the accuracy
of the test. Taking the average[2] of these errors across all students provides
a better representation of the SEM. Thus, even though the true scores are not known, we can still estimate how reasonable we believe the observed scores are at representing the true scores.
15.1.3. Reliability
National assessment
reports should always report the reliabilities of tests used in the assessment. Although
the term “reliability” has a common meaning,
test reliability is a specific
statistics used to provide an indicator of the accuracy
of a test for all
students. Reliability[3] is frequently used to refer to the consistency of test scores.
In statistical terms test reliability is the proportion of variability in observed scores that can be explained
by variation in true scores. Reliability cannot be estimated
directly;
to do so would require knowing each student’s true score which, as we noted earlier,
is not possible. We can, however get an estimate
of test score reliability by using SEM.
The relationship Equation
1shows that, the larger the SEM, the lower will bethe reliability (we use the term σ2 to represent the variation in observed score).
The test reliability statistic ranges between 0 and 1. A value of 0 represents a test whose
scores do not relate in any way to what is being measured (e.g., a test where all students guess randomly on all items) and 1 represents a test which measures the domain (such as reading or mathematics) with perfect accuracy. Generally,
test reliability around 0.70 or higher is considered adequate
for a large-scale assessment.
Usually, results of a test are important only in the context of some decision or relationship with other variables.
The correlation between observed scores for a test and another variable,
such as school attendance, will always be lower than the correlation between true scores on these variables.
The degree to which the correlations based on observed
scores is lower (the attenuation) depends
upon the reliability of the test scores.
As test reliability increases, the observed-score correlation will become more similar to the true correlation. If p is the true-score correlation and r is the test reliability, then the maximum possible observed-score
correlation will be p r . As a consequence, if the reliability of a test decreases, the scores become less useful to describe the relationship between test performance and other variables.
The function in Figure 15.4demonstrates the effect of attenuation for tests of different reliabilities on a true correlation of 0.80. Only when the test reliability is perfect does the correlation of the two test scores equal its true value.
Hình 15.4
Các tác động của sự
suy giảm về tương quan thực sự của 0,80
15.1.4. Using classical item statistics with test development
This section
discusses methods of selecting or constructing items with desirable
properties. The aim of the national assessment team is to select items that provide the maximum differentiation between individuals using items that are strongly
related to the skill being measured. This should lead to improved reliability and a decrease
in the standard error of measurement. When applying these principles, it is important to remember that the primary purpose of a test is to support inferences
about a specific domain of proficiency. Although inferences cannot be supported
if the test is not accurate, they also cannot be supported
if the test does not adequately represent the domain. Thus, even while pursuing statistical accuracy, it is always important
to ensure adequate representation of the objectives
in the original table of specifications.
15.1.5. Item facility (difficulty)
In classical test theory, item facility is specific
to a particular population or group of students. For a given population of students, item facility estimates
the probability that an average
student will correctly
respond to the item. If a question
is very easy, then the probability that an average student in the specific population will correctly respond is close to 1. On the other hand, if an item is very difficult, the probability of correctly responding will be close to 0. Sometimes, item facility is referred to as item “difficulty,” even though increasing values indicate easier items. Commonly, the statistic used to describe facility if called the p-value, in reference
to the concepts of proportion and probability.
Since
the principal purposes
of a test is to sort or compare students
(either with respect to each other or to some standard proficiency), the test must be able to produce
different scores for different levels of proficiency with respect to some assumed differences which must be detected. Indeed, if all students achieved
the same score, the test would not have provided
any additional information over not administering it at all, since all would be assigned to the same group and be given the same rank order. Such a test would provide relatively
little information of use to policy makers interested in giving support,
for instance, to low performing groups. Therefore, one factor affecting
the usefulness of a test is its ability to produce different scores for different
students. Since the entire test is effectively composed of many one-item tests, the same principle
applies to each individual item in the test. In general, the best items are those that minimize
the number of students with the same score.
To illustrate this principle, consider the 10 items in Table 15.2, which contains the probability of correct and incorrect response
for each item in a particular population
of students. Column five presents
a count of the number of people (per hundred respondents in this population) who are expected
to have the same score on each item as well as which is the most common score, correct or incorrect. If an item is most effective when the fewest people have the same score, then the best item in this test in terms of differentiating between individuals is the item where there are equal numbers of
correct and incorrect
respondents, which occurs when facility
is 0.5.
In general, the best test items are ones where the probability of correctly response
(item facility) for the target individuals being assessed is around 0.5. There are some exceptions to this general rule, particularly for different types of multiple
choice items. If there is a real possibility that students will randomly guess instead of attempting to answer an item correctly, the ideal item facility is approximately halfway between the probability attributable to guessing and one. For example,
if an item has four options, and students tend to guess randomly, the ideal item facility would be (1-1/4) = 0.75.
In general, if it is clear that students are randomly guessing on specific
items, these items should be replaced with easier items that are more likely to elicit real effort.
Table 15.2 Relationship between item facility and usefulness
of a test item
Without
deviating from the design in the original
table of test specifications, it is possible
to change the facility of an items or the chance that students will guess randomly rather than attempt to answer an item. Some methods for doing so include:
- Increasing or decreasing the amount of text to be read in the item stimulus
for language items, described
in Chapter 3, Volume 2 in this series (Developing Tests and Questionnaires for a National
Assessment of Educational
Achievement);
- Increasing
or decreasing the number of steps a student must perform in order to produce a response
(mathematics or science items); or
- For multiple
choice items, increase or decrease the “plausibility” or “relative correctness”
of incorrect options by using responses that reflect reasoning or solution paths
that students might use (multiple choice items).
One technique that should NOT be used to change the facility
of a selected-response-
type item is to change the number of distracters (such as making an item easier by changing from 5 to 3 options).
Although the change in probability of correct response
will increase due to the increased chance of guessing,
this increase is unrelated to the skill
levels of students.
Therefore, it does not provide any additional information. For similar
reasons, use of True-False type items should generally be avoided on large- scale assessments, because they provide little information about student proficiency.
What is the ideal item facility for criterion-referenced tests? In criterion-referenced
tests, such as mastery tests or minimum competency tests, we are primarily interested in assigning scores such that students
whose skills are above a specific level of proficiency have higher scores than students
whose scores are below this level. The specific level might be set to determine passing or failing or to distinguish between adequate and excellent students.
The majority of the students
will more than likely have scored clearly above or below the criterion, so there is no need to further distinguish between the students
who are clearly above or the students
who are clearly below. Accordingly, the items for this type of test should be selected as if the population being assessed consists
only of those students with skills around the level of the criterion.
We could start developing and piloting or field testing
items for a criterion-referenced test by selecting
a sample of students whose skills or achievement levels
are considered close to level of the criterion.
The assessment team might ask teachers to identify students
who are on the borderline
between passing and failing or between adequate performance and excellent
performance based on their own perceptions and/or by previous test results. Using this sample of “borderline” students, we could then use the results of the pilot test to select items which were most effective
in terms of differentiating between these selected
students.
An alternative method is to define criteria
in terms of percentile rank scores of the full population on the item universe. Percentile rank scores express each student’s
score in terms of the percentage
of students with lower scores.
Students with lower percentile rank scores have lower percent-correct scores than students
with higher percentile ranks. We can apply this interpretation of test performance to individual items (see Figure 15.5). If the errors on individual item responses are normally distributed, then a student with a percentile rank score of x should have a 50% chance of scoring correctly any item that was correctly
answered by (100-x) % of all students. Another way of stating this principle is that, if x% of students correctly
responded to an item, we would expect that the 100-x% of students with the lowest total test scores would likely get it incorrect
and the x% with the highest total test scores would likely get it correct. Thus, to create a test that determines
whether or not students are above a certain
percentile, we should select items with the corresponding complementary
facility. For example,
if we wanted to determine
whether or not students are above or below the 75th percentile, we would select items with a facility
around 0.25, as shown in Figure 15.5.
Figure 15.5 The probability of correct response (item
facility) for students at different levels of
proficiency, expressed as percentile rank
To use this principle for item selection, it is necessary
to have a fairly good idea of the percentile rank score corresponding to the criterion.
For example, if a test were to be used to decide which members of the final year of the primary school population were to receive scholarships, the criterion for determining scholarship recipients might be set at the 85th percentile to ensure that the number of scholarship recipients would correspond to the top-scoring
15% of the primary school leaving population. The percentile rank score of a student who has a 50-50 chance of receiving
a scholarship is 85. Applying the principle
defined above, we find that the ideal item facility
index for selecting scholarships students
is equal to 100 minus 85 divided by 100, which is 0.15.
15.1.6. Item discrimination
The item discrimination index is another good indicator
of the usefulness of an item. The term “discrimination” is used to refer to the ability of items to elicit different item scores from students with different levels of proficiency. If all students produce
the same response to an item, regardless
of their level of proficiency, the item cannot discriminate between
different levels of student proficiency. Item discrimination indicates the extent to which an individual item measures what the whole test is measuring. Given pre-defined
groupings of students
according to percentile
rank, the probability of performing correctly on an item should be greater
for the higher proficiency groups
than for the lower proficiency groups. Items are most useful when the probability of correctly responding is very different
between different proficiency
groups.
Consider the following example,
where a head teacher wishes to use a mathematics
test to group 100 students
into three skill levels (low, medium, and high). If the head teacher wishes to have equal size groups, the criteria for determining group membership, in percentile rank scores, are 33 and 66 on the mathematics test results. Figure 15.6 illustrates the differences in effectiveness of three specific
items in terms of their ability to discriminate between
these three proficiency groups. The items correspond to the following tasks:
- Item
1: Identify the union of these two sets: {34,16,45,7,11,2,8,28} U
{1,67,9,2,26,8,4}
- Item 2: Identify the union of these two
sets: {a,5,j,5,12,Q, r,45,2} U
{w,t,q,A,9,b,5,twelve,j}
- Item 3: Identify the union of these two
sets: {1,2,3,4,5,6,7,8,9,10} U
{6,7,8,9,10,11,12,13,14,15}
Responses to Item 1 show very distinct differences between the proportions of students in the different
groups who answer the item correctly. The low-skilled
students have only a 1% probability of responding correctly, compared to 48% for the medium-skilled students
and 97% for the high-skilled students.
Item 2, has a relatively similar probability of correct response
for all three skill levels (42%, 50%, 58%). Responses
to Item 3 reveal that low-skilled students
have a small probability of correct response
for (16%) and large probability of correct response
for the medium and high skilled students
(92% and 100%). The item facility is the same for Item 1 and Item 2 (0.50), but Item 3 is easier (0.70).
Figure 15.6 Discrimination of three test items as group-specific
probabilities by proficiency
level
The statistical information can be used to understand how students approach
the different items. Item 1 is most effective in discriminating between below-average and above-average students, because the requirements are clear, yet the task is complex.
In contrast, Item 2 may be confusing to students. The figure shows that students
at each of the three levels of ability had about equal probabilities
of success on Item 2. Some may have been confused by the inclusion
of letters with numbers, and others may
have wondered if upper case letters should be treated
differently from lower case ones
or if the number “12” is equal to the word “twelve,”
or if repeated elements were counted as unique. Some high proficiency students may have found the question ambiguous and opted for incorrect answers,
while some lower-skilled
students may have selected the correct response
for inappropriate reasons.
This poor-quality item should be edited, the various sources of ambiguity
removed and retested
in a further pilot. Item 3 provides
much clearer information than Item 2, indicated by the strong discrimination primarily
between Low peformers
and the other two groups. Because all
of the elements in each set have already been ordered, the sets have the same number of elements, and the elements are consecutive, the students have to do less analysis to answer item 3 than either of the previous two items. Item 3 discriminates
well between the low and the other two groups; it is also easier than Items 1 and 2.
But of these three items, which is the best? The answer to this question
depends on the purpose of the test. It is clear that Item 2 is the weakest and is not likely to serve any
useful purpose. If the purpose of the test is to distinguish between low, medium and high-proficiency levels, then Item 1 is the best item, because it has the largest differences between
all skill levels in terms of the probability of getting the item correct. In contrast, Item 3 has very similar
probabilities for the medium and high groups, suggesting that medium and high skilled students would probably get the same score on this item. However, if the test were a criterion-referenced test used to distinguish between low and medium-skilled students, then Item 3 is the best item, because it has the largest change in probability of getting a correct
response between low and medium skill levels.
Although the criteria used for choosing
the best items are dependent
on each specific testing situation, the following guidelines
can be used to select items based on the property of item discrimination:
- For norm-referenced tests, select items where the change in probability is large and
relatively equal in magnitude between low-to-medium
and medium-to- high skill levels.
- For criterion-referenced
tests, if the criterion is to distinguish between low and medium skill levels, select
items with large differences in probability between low and medium skill groups
and similar probabilities for medium and high groups. For criterion-referenced tests
that seek to distinguish between medium and high skill levels, select items that
have large differences in probability between medium and high groups and a similar
probability for low and medium groups.
In the context of national assessments, different
stakeholders typically expect the national assessments to perform both functions.
For example, the Ministry of Education may be interested in a national assessment
primarily for research purposes to understand why males and females perform differently.
Because there are both males and females at all ranges of proficiency, this assessment
would be best served with a test built following the norm-referenced principles.
On the other hand, if the national assessment is specifically interested in identifying
which types of students are performing below the expected standard, then the test
should use criterion- referenced principles to select test items that show the greatest
discrimination at lower proficiency levels.
Analyzing pilot test data would be time consuming if you were required to examine figures similar to Figure 15.6 for every test item. An alternative approach is to use the classical index of discrimination. This index can be calculated in a number of different ways. In the following example, it calculated simply as the difference in probability of correct response
between low-skilled and high-skilled students. Table
15.3 contains the item facilities for the high, middle and low performing
students of a group and the index of discrimination for five items. Students were assigned to these groups based on their overall test performance. In this example,
the index estimates
in the final column of the table correspond to the differences between item facilities of high and low groups. Because the index of discrimination is simple to calculate
and captures the general usefulness
of an item, it is a common metric used to select test items.
Table 15.3 Index of discrimination
Table 15.3 Index of discrimination
As a rule of thumb, suitable test items should have a discrimination value above 0.25. For pilot test data, this may be relaxed to 0.20. Further examples of using discrimination indices
in selecting items can be found in Anderson and Morgan (2008).
15.1.7. Distractor analysis
There are many reasons why an item may have a low or even a negative
index of discrimination. These reasons include poor wording, confusing instructions in how to answer a question, sampling
errors, and miskeying
or miscoding. This section provides an overview on how item statistics may be used to detect and remediate
some common errors that become apparent in analyzing pilot test items and in some instances, in analysis of items of the test data used in the national assessment tests.
In its simplest
form, distractor analysis looks at how each option (or score code) discriminates between three student skill levels (high1/3, middle 1/3 and low 1/3), based on the overall test score. Table 15.4 presents a typical
distractor analysis for an individual item.
Table 15.4 Response
error (or distractor) analysis
Item Q9 has 4 response options and two missing value codes (8 and 9). The missing response code 8 indicates
that it was not possible
to score the student response,
either because the response was illegible, two options were selected, or some other operational problem. The missing response
code 9 indicates that the response was left blank
by the student. The asterisk (*) beside Option 1 indicates that it is the item key, or correct response.
The total percentage of students selecting
Option 1 is equal to the item facility. In general, a well-functioning item should have the following
characteristics:
- the correct column option should have a high percentage for the high group, and
successively lower percentages for the middle and low groups;
- the columns corresponding to
incorrect options should have approximately equal percentages within each skill
level and overall;
- for the high skilled group,
the percentage choosing the correct option should be higher than the percentage
choosing other options;
- for the low skilled group,
the percentage choosing the correct option should be lower than the percentage choosing
other options;
- for all groups, the percentage of missing value codes should be close to
0;
- if there are a large number
of missing responses, the percentages should be equal across skill levels.
When an item does not have these desirable characteristics, it is usually the result of one of the following errors: miskeyed or miscoded responses,
multiple correct responses, confusing requirements from the item, or the content of the test item is irrelevant, too difficult or too easy. Examples of each of these types of problem items are shown in Table 15.5, Table 15.6, Table 15.7 and Table 15.8.
Table 15.5 Miskeyed or miscoded responses
The distractor analysis data in Table 15.5 show that the item key for Q9 was specified as 3, rather than 1. Using the simple approach towards
calculating a discrimination
index shown in Table 15.3, the item discrimination for Q9 would equal -0.23 (0.147- 0.381). The presence of a negative index of discrimination suggests that an item has probably been miskeyed. You can identify the correct key by finding the option that best satisfies the conditions described
above. In this case, Option 1 (which has been shaded) is the only option where the percentage
of students choosing
the correct option is higher than the percentage
choosing the other options. Note that the selection of the correct response
is primarily the responsibility of the subject matter specialist,
and not that of the person entrusted
with data analysis.
Table 15.6 Low Discrimination:
more than one “correct” response
Q9
|
Total
|
||||||
1*
|
2
|
3
|
4
|
8
|
9
|
||
high 1/3
middle 1/3 low 1/3
Total
|
55.2
52.5
41.3
50.5
|
.1
2.6
17.4
6.6
|
37.9
30.8
24.2
27.4
|
2.5
6.9
17.6
8.9
|
1.1
.7
.6
|
4.2
6.2
8.1
6.1
|
100.0
100.0
100.0
100.0
|
The distractor analysis in Table 15.6 presents
the results of Q9 when Option 3 has been identified as a correct
response, in addition
to Option 1. This error can happen when the assessment team tries to increase the difficulty or facility of an item by increasing the attractiveness of specific distracters. From the students’
perspective, the instructions or question stem may be ambiguous, and they may be forced to rely on “common
sense” rather than appropriate knowledge or skill to choose a response. For items with these patterns,
the item developers should focus on clarifying the item such that there is an unambiguously correct
answer.
Table 15.7 Low Discrimination:
not measuring the correct domain
|
Q9
|
Total
|
|||||
1*
|
2
|
3
|
4
|
8
|
9
|
|
|
high 1/3
|
38.6
|
18.1
|
13.9
|
25.2
|
|
4.2
|
100.0
|
middle 1/3
|
27.0
|
12.7
|
20.5
|
32.5
|
1.1
|
6.2
|
100.0
|
low 1/3
|
21.5
|
25.4
|
34.2
|
10.1
|
0.7
|
8.1
|
100.0
|
Total
|
29.0
|
18.7
|
22.9
|
22.6
|
0.6
|
6.1
|
100.0
|
A distractor analysis similar to the one shown in Table 15.7 suggests that an item may have little to do with the subject area being assessed by the other test items. There is a weak relationship with proficiency for the correct response, a substantial amount of missing
data and ambiguous relationships with proficiency for the distractors. There are several reasons why this may happen, even when the item is valid from a content perspective. These include:
- The reading requirement may be too demanding, particularly if the test is not intended to measure
reading skill. This error can be remedied
by reducing the readability of the item so its meaning is clearly understood by all students.
- the wording of the question may be ambiguous, making it
unclear what information the item requires. This error can be remedied by field
testing items in an interview setting with students and asking them to think aloud
as they respond to each item. Misconceptions produced by instructions (that may
make sense to teachers and item writers but not to students) can be identified and
clarified.
- The item may be biased towards specific groups of students.
For example, a mathematics item that uses real football statistics may be biased
towards boys, who may have knowledge of the statistics without being able to solve
the mathematics problem. Item bias can be reduced by using think-aloud procedures
in pilot test settings, where students describe their thought processes as they
complete each test item.
Table 15.8 Low Discrimination:
too easy or too difficult
|
Q9
|
Total
|
|||||
1*
|
2
|
3
|
4
|
8
|
9
|
||
high 1/3
|
10.2
|
32.9
|
23.2
|
29.5
|
|
4.2
|
100.0
|
middle 1/3
|
5.1
|
34.6
|
20.5
|
32.5
|
1.1
|
6.2
|
100.0
|
low 1/3
|
2.1
|
24.3
|
34.2
|
30.6
|
0.7
|
8.1
|
100.0
|
Total
|
5.8
|
30.6
|
26.0
|
30.9
|
0.6
|
6.1
|
100.0
|
The distractor analysis in Table 15.8 does not indicate
that anything wrong with the item per se. The correct option (1) has a higher percentage
for the high group, and successively lower percentages for the middle and low groups. Even though the relative probability of correct
response changes across skill levels, the discrimination
is too small for it to provide much useful information. Each of the distracters has a greater
chance than the correct option of being selected by students of all ability levels. Extremely difficult items should be avoided on large-scale assessments, to the extent
that is possible without reducing test validity.
A similar problem occurs when items are too easy. If an item is almost universally answered correctly, it will not be able to discriminate between students. However,
it may still be desirable
to retain very easy items to increase
student motivation at the beginning of a test, and it may be necessary to include either very easy or very difficult items if they are required to satisfy the test specifications.
15.1.8. Summary
A test is constructed of individual items. These items are drawn from a universe (possibly an infinite number) of items that measure the same subject area. A person’s true score is the test score that he or she would have achieved
if the entire universe of items had been included
on the test. The true score is equivalent to the probability that a student will correctly
respond to a test item in the subject area. A person’s observed score is the score he or she achieves
on the sample of items included on the test. Observed scores are used to estimate true scores. The standard error of measurement of a particular observed
score indicates the uncertainty with which the observed score reflects the true score. For an entire test, reliability summarizes
the accuracy of the observed scores. The property of attenuation due to unreliability can help determine
the minimum reliability required for a test to satisfy certain purposes.
The characteristics of individual items can be analyzed to determine the best items for the purpose of the test. Two item characteristics in particular are important: item facility and item discrimination. As much as is possible,
the facility of items on a test should be around 0.50, or halfway between
the chance of guessing correctly
and 1. In order to improve their measurement quality,
items can be made more or less difficult by increasing or decreasing their cognitive requirements without changing their content. Item discrimination
also plays a role in determining the quality of a test item. Suitable items should have a large difference in the probability of correct response
between students of different skill levels. The difference in probability between students of low and high skill levels should be greater than 0.25. For criterion-
referenced tests, the largest change in probability should be in the region where the criterion is to be applied.
15.2. Item Response Theory (IRT)
The previous section introduced two aspects of the classical
test theory approach
to measuring proficiency, namely item facility
(or difficulty) and item discrimination. In the remainder
of this annex we examine an alternative approach, Item Response
Theory (IRT), which unifies the concepts of item facility and discrimination. IRT has also been described
as latent trait theory. It is the most widely used theoretical
approach in large-scale assessments.
A good starting point to understand IRT is to contrast what constitutes a good test item from the CTT perspective
and the IRT perspective. The classical
item statistics of facility and discrimination were focused on estimating and comparing the probability of correct response
for different students.
In contrast, IRT characterizes students by the type of item response
they are likely to produce and tries to describe the distributions of proficiency for students that respond in different ways. A good test item from a CTT perspective has large differences in probability of correct response for students of difference proficiency levels. From an IRT perspective, a good test item is one where the distribution of proficiency for students who correctly answered is different from the distribution of proficiency for the incorrect
students.
Whereas
CTT fixates on probability of correct response,
IRT fixates on the estimation
of the distribution(s) of proficiency. While these two perspectives are generally in agreement, the IRT perspective describes items in a much richer and more useful way.
The two distributions in Figure 15.7 illustrate some fundamental features
of IRT. The two curves represent distributions of proficiency[4] for respondents to a single test item.
The distribution on the left describes the proficiency of students who responded incorrectly, and the distribution on the right describes the proficiency of students who responded correctly. This item has a facility
of 0.50, which reflects the identical height of the two distributions along the vertical axis -- there are as many correct respondents as incorrect respondents. The mean proficiency of correct respondents is 0.10, which is reflected
in the graph by the peak of the distribution for correct students
being directly above 0.10 on the proficiency axis. Because the overall mean of both populations is 0, and they are equal in size, the mean proficiency of incorrect
respondents is symmetric, at -0.10. The two distributions are very similar
to each other in terms of both size and location, indicating
that there is very little difference in proficiency between the type of students
who correctly respond
and the type who incorrectly respond. If there were no difference at all, both distributions would be identical with means equal to 0, and the item responses would have no relationship with proficiency.
Figure 15.7 Distributions of proficiency for correct
and incorrect respondents to a single test item, facility = 0.50, mean proficiency
of correct respondents = 0.
A much
more accurate test item, also with facility
of 0.50, is illustrated in Figure 15.8.
This item illustrates the strongest relationship between
item response and proficiency, where the mean proficiency of the correct respondents is approximately 1 and the mean proficiency of the incorrect
respondents is approximately -1. There is
no overlap in the distributions, indicating that, in terms of proficiency, the correct respondents are completely distinct
from the incorrect respondents.
Figure 15.8 Distributions of proficiency for correct
and incorrect respondents to a single test item, facility = 0.50, mean proficiency
of correct respondents = 0.99
In practice, it is extremely
rare that correct respondents are completely
distinct from incorrect respondents. There is typically a wide region of proficiency in which the two distributions overlap.
In this region, there is a smooth transition as students with increasing proficiency become less likely to be members of the incorrect distribution
and more likely to be members of the correct distribution. This transition is illustrated in Figure 15.9 for an item with facility
of 0.60 (indicating the distribution for correct respondents is larger than that for incorrect respondents) and the mean proficiency of the correct respondents is 0.40. The solid curved line, which is also known as an item response function
(IRF), describes size of the distribution of correct respondents
relative to the size of the distribution of incorrect respondents. In other words, in regions
of proficiency where the height of the correct distribution is lower than the height
of the incorrect distribution, the IRF is below 0.5, and when the reverse is true, the value is above 0.5. The IRF can be interpreted as the probability that a respondent
with a given proficiency level will belong the group of correct
respondents. The exact values of the IRF can be calculated by dividing the probability for the distribution of correct respondents by the sum of probabilities of both distributions. For example, at the proficiency value of -1, the probability value of the Correct respondents is approximately 0.06 and the value for the Incorrect respondents is approximately 0.15; 0.06/(0.06+0.15) = 0.29. Because the proportion
of incorrect respondents is the reverse
of the proportion of correct
respondents and the mean proficiency of incorrect respondents can be calculated from the mean proficiency of the correct respondents (given that the overall mean equals 0), the IRF is a function of the item facility and the mean proficiency of the correct respondents. Exact calculations are presented in section 15.2.1.
Figure 15.9 Distributions of proficiency for correct
and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.60, mean of correct respondents = 0.40
The mathematical equation using distributional parameters to describe the IRF is quite lengthy. The common practice
is to describe the IRF in terms of two other parameters (which are described in greater detail in the following section).
The simplest parameter, denoted b, is identified
as the threshold where the two distributions
intersect (the vertical line in Figure 15.9, equal to -0.408).
At this threshold on the proficiency scale,
students have an equal likelihood of belonging to either the correct distribution or the incorrect
distribution. Consequently, it is the location where the item is most useful in distinguishing between the two types of students. The degree of accuracy of the item in distinguishing between the two types of students is
proportional to the slope of the IRF at this location
(the oblique line in Figure 15.9), denoted as the a parameter. The a parameter is typically a transformation of the slope[5]; in this case, the value of a is 0.85. As the differences between the distributions of correct and incorrect respondents increase, in either overall probability or location, the slope of the IRF increases, reflecting
the stronger relationship. For comparison, Figure 15.10 illustrates an item where the mean proficiency of correct respondents is the same as in Figure 15.10, but the facility is much higher, equal to 0.70. The greater difference between the two distributions is reflected in the greater value of the a parameter (reflected in the steeper slope), which is 1.25 for this item. In general, accurate
test items will have steep S-shaped curves, indicated by high values of the a parameter.
Figure 15.10 Distributions of proficiency for correct
and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.70, mean of correct respondents = 0.40
A special case of the IRF describes
a situation where lower-proficiency students tend to guess rather than attempt to correctly
answer an item. This situation
occurs in multiple choice items that are confusing or have implausible distractors. In this case, many
students correctly respond
to an item without having the same distribution of proficiency as the ‘true’ correct responds.
As a result, the item is characterized better by the proficiency of the students
who incorrectly respond
than by the proficiency of correct respondents. As shown in Figure 15.11, the population of incorrect respondents is divided into two components: one component is the proportion
of respondents who actually scored incorrectly (0.40),
whereas the other component is the proportion
of students who guessed correctly
(0.20), labelled ‘incorrect guessers.’
As a result of this division of the incorrect
respondents into two distributions, the IRF, which
is the distribution of correct
respondents divided by the sum of the other distributions, has a lower asymptote, meaning that even the students
at the lowest end of the distribution have a non-zero chance of responding correctly to the item. The minimum
chance of responding correctly is denoted as the c parameter, and it is equal to the proportion
of the incorrect respondent population that guessed correctly. In Figure 15.11, the threshold, or b parameter, is no longer at the intersection of the distributions of incorrect and correct respondents; instead, it is located where the
intersection would have been if the incorrect-guessers had correctly been assigned to the incorrect
respondent distribution.
Figure 15.11 Distributions of proficiency for correct
and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.60, mean of correct respondents = 0.80, lower asymptote
of IRF=0.33
In practice, it is difficult
and statistically complex to distinguish between Correct respondents and Incorrect guessers.
This type of estimation is available in the advanced functionality of IATA, where the proportion
of Incorrect guessers
is described by a third item parameter, denoted as the c parameter. Estimating the c parameter tends to require larger samples
of respondents, and the estimates
of the a and b parameters in this scenario
are not as reliable as when they are estimated
by themselves. For this reason, it is typically more efficient
to assume the proportion of Incorrect
guessers is equal to 0 and simply remove from analysis any items where this assumption is not met (i.e., where the empirical
IRFs indicate that the lowest proficiency respondents have a higher probability of correct response
than predicted from the theoretical IRFs).
[1] The term ‘error of measurement’ refers to the variation
or uncertainty in measurement; it should not be considered a mistake on the part of the student.
[2] The quadratic mean.
[3] Different forms of test reliability include test-retest, alternate forms, split half, inter- rater and internal consistency.
[4] In IRT, student proficiency is described on a scale (often called theta) that is similar to the Z-score scale: the theoretical average proficiency
level is 0, and the standard deviation is 1. Most students usually
have scores between
-2 and 2, and less than one in a thousand
students will have scores less than -3 (or greater than 3)
[5] For algebraic reasons, the a parameter is four times the value of the slope at the location of the b parameter. Usually, this value is further divided
by 1.7 so that the value may be used in the easy-to-use logistic
model while approximating of the cumulative normal distribution function.
[6] Although there are other IRT models appropriate to different methods
of scoring, the model described here is most appropriate to the rubric-based scoring schemes typically found in national assessments, where higher scores are assumed to represent success
on the requirements associated with lower scores.
[7] If, for example,
a curriculum fundamentally
changes such that the definition of mathematics
proficiency changes from computational speed to visualizing patterns, then the property of invariance
would no longer hold true. In this case, the statistical behaviour of the items in populations for whom mathematics is primarily computational
would not be consistent with their behaviour in populations where mathematics is primarily
visualizing patterns.
[8] There is no equation for transforming the c parameter, nor is there a standard method for equating a or b parameters when the c parameter differs
substantially across different populations.
[9] It is more common in practice
to add the logarithms of the likelihood values
rather than multiplying them. This approach minimizes
the effects of rounding
error and significant digit truncation during
the calculation process.
[10] The score corresponding to the maximum value of the posterior distribution is called the maximum a posterior (MAP) estimate. The EAP estimate
is preferred over the MAP because it tends to be more stable to calculate.
[11] The presentation of information for partial credit items is more complex
but essentially has the same interpretation, in that it represents
the conditional variance of the item scores at a given level of proficiency. See Samejima
(1974) for a complete
discussion of the item information function
used for partial credit items in IATA.
[12] For illustration purposes, the items all have a parameters that are much greater than would ordinarily be encountered in educational
assessment. Large a parameters accentuate the effects of each test item on the TCC.
[13] The examples in this section compare parametric IRFs for the two populations for reasons
of clarity. This method is not as sensitive
to group differences as non-parametric methods. The DIF results in IATA are based on the observed proportion of correct responses at each proficiency score. However, the interpretation of the results
is identical for both methods.
Không có nhận xét nào:
Đăng nhận xét