15. CHAPTER 15 PSYCHOMETRIC CONCEPTS

This annex chapter provides some theoretical background about the statistical analyses performed by IATA. Section one presents an overview of classical test theory. The overview is not intended to be exhaustive and may be skimmed by those familiar with this aspect of test theory. It begins with a discussion of test scores as statistical estimates. The basic formulation of classical test theory is presented, using the concept of standard error of measurement. The term test reliability is introduced in relation to standard error of measurement and the impact of item characteristics on test reliability. The section concludes with a discussion of practical methods for using classical item statistics to develop efficient tests.

Section two extends the concepts of classical test theory into item response theory. It develops the principles of item response models and item information. It explains the fundamental concept of population invariance and describes several applications of item response theory, including item analysis, test construction, and test equating.

15.1. Classical Test Theory

15.1.1. Describing Accuracy of Tests

Under the Classical Test Theory (CTT) approach, we use student performance on a collection of items on a single test to make generalizations about performances on all other possible collections of similar items. This principle relies on the assumption that there are a very large number of possible items to measure the particular skill that is the being assessed. This assumption is reasonable in most cases. For example, even in a specific curriculum area such as multiplication of one and two-digit numbers, there are almost 10,000 possible test items, enough to keep a grade three student busy for

an entire year. As it is unreasonable to test a student using such a large number of items, a much smaller sample of items can be used instead to predict what the performance of a student would have been on the complete set, or universe of possible items. Items are sampled from the item universe to produce tests. When the tests are administered, they elicit test scores, which are then generalized back to the item universe. Always, the goal of testing students is to make inferences about student performance on the universe of test items.

This interpretation of test performance is very similar to the statistical concept of probability. The probability of an event occurring can never be known without an infinite number of observations. However, we often make inferences about probability using a very small number of observations (usually, as few as possible). The typical technique for estimating probability is to take a sample of events and calculate the number of times a particular event occurs divided by the total number of observations. For instance you can test the fairness of a coin by calculating the probability of heads and tails occurring. We can estimate the probability by tossing the coin repeatedly and counting the proportion of outcomes that are heads or tails. If the proportion of heads is one-half of the total number of tosses, then the probability is estimated to be 0.5 and the coin is probably a fair coin.

Test scores can be interpreted in a similar way; each student’s test score represents the probability that the student will correctly respond to a test item from a particular universe of items (e.g., a mathematics item from the set of all possible mathematics items). The probability is calculated as the number of items they scored correct on a specific tests divided by the total number of items they responded to.

The following example in Table A.1 demonstrates this principle, using the observed performance of an individual student on a set of reading test items. Column one shows the number of items the student has taken or answered. Column two is the student’s score on a particular item; 0 is incorrect, and 1 is correct. Column three shows the cumulative score after each item, which is calculated as the sum of all item scores up to each new item. Column four is the average item score, or estimated probability, updated after each item by dividing the cumulative score by the number of items.

Table 15.1 Grade 1 Reading Test Scores by Item

Item number	Item score	Cumulative score	Average score (proportion correct)
1	0	0	0.00
2	1	1	0.50
3	0	1	0.33
4	1	2	0.50
5	1	3	0.60
6	1	4	0.67
7	1	5	0.71
8	1	6	0.75
9	0	6	0.67
10	1	7	0.70
11	0	7	0.64
12	1	8	0.67
13	0	8	0.62
14	1	9	0.64
15	0	9	0.60
16	0	9	0.56
17	1	10	0.59
18	1	11	0.61
19	1	12	0.63
20	0	12	0.60

After 20 test items, the estimated probability of correct response for this student is 0.60; this suggests that, if this student responded to all possible Grade 1 reading items, he or she would probably get 60% of the items correct.

However, this estimate varies as the test length increases from one to 20. Since a student can only get a score of 0 or 1 with a single item, the estimated probability after only one item will be 0 or 1. However, common sense tells us that students will rarely get all items correct or incorrect from the entire universe of items. By the time the student completes the third item, the probability is 0.33, which is a more reasonable estimate. As the number of test items increases, the estimate converges to its final value of around 0.60, and the difference between each estimate is smaller.

The graph in Figure 15.1 illustrates the relationship between increasing stability of the test score and increasing number of items for this student.

When the number of items is small, the zigzagging of the line in Figure 15.1suggests that the estimates of probability are unreliable. If a test score changes dramatically just by adding a single item, then the estimate is probably not useful for generalizing to the thousands of comparable possible items not included on the test. As the number of items increases, the severity of the zigzags decreases, and beyond a certain point, adding more items will not noticeable change the estimated probability. In general, as the number of items on a test increases, and the items are randomly equivalent to each other, the estimated probability will provide a better or more accurate estimate of the proportion of the universe of items that are answered correctly. Although it is still not possible to know the true probability without administering an infinite number of test items, the estimates can be made stable enough that using additional items is not cost- effective.

15.1.2. Error of Measurement

The estimated probability which the test scores are trying to estimate is assumed to be fixed for each student, regardless of the number of items on a test. For each student, the observed test scores will eventually converge towards this probability as the number of items on a test increases. Since we assume that this probability does not change as we add new items to the test, we are also assuming that this probability is not affected by the sample of items a student is administer ed. In other words, weassume that the proficiency of a student in a particular subject area does not depend on the items used to test this proficiency.

This assumption further implies that every student has a certain level of proficiency or ability in the skill that is being assessed, even if they are not administered a single test item. Unfortunately, it is impossible to determine what this level is without testing. When we do test a student, it is useful to make the distinction between the observed score, which is based on the test items used in the test, and the true score, which does not depend on test items. The observed score, as the name implies, is the score the student gets on an actual test. The true score is a hypothetical score; it can be interpreted as the average of a very large number of scores on very similar tests administered under identical conditions to the same student. In practice, the true score cannot be known, since it requires administering a large number of tests while the students’ proficiency remains constant. The true score of a student will not change unless the students’ proficiency changes, whereas the observed score will change according to each test administration. The observed score may vary for a person (or for people of the exact same true score) depending on the sample of items used, but the true score does not.

The reason why the observed score changes is because it is influenced by random error. In testing situations, random error refers to factors that randomly affect the assessment of reading proficiency such as student level of motivation, fatigue etc. Because these factors are random, they may result in the observed test score being higher or lower than the true score. The difference between the true score and the observed score for a particular student on a particular test is the error of measurement[1]. This key concept in classical test theory can be stated as follows: for any student, the observed score on a test is equal to the true score plus or minus some error of measurement. The expected magnitude of the error for students on a specific test is called the standard error of measurement (SEM). For any given observed score, the SEM of the score describes the probable location of the true score. A small standard error of measurement suggests the true score is probably similar to the observed score, and a large standard error of measurement suggests the true score may be very different from the observed score.

As an illustration, imagine a group of students with the same true score of 0.60 in grade 1 reading. Since this score represents the probability that each student will correctly respond to any Grade 1 reading test item, each student would theoretically get a score of 60% on all similar grade 1 reading tests. However, on any specific reading test, the score for a student will probably not equal 60% -- it will be 60% plus or minus some error that depends on the characteristics of the test or the testing situation. If the test is accurate, the error might be small, and if the test is inaccurate (for instance it may have some poorly worded items or it may have a limited number of items), the error may be large. Figure 15.2 illustrates this example. Even though all 10 students depicted on Figure 1 have the same true score of 0.60, the observed scores of those who took the accurate test are quite similar and are clustered around 0.60. In contrast, the test scores of the five students who took the inaccurate test are quite different from each other, even though the average score is still 0.60.

Figure 15.2 Observed scores on two tests of different accuracy for students of same proficiency (true score=0.60)

The SEM of a particular test represents the degree to which students with the same true score differ in their observed scores. If we were to collect scores from 100 students with the exact same proficiency (true score=0.60), the expected distributions of observed scores for the two tests above would appear as in Figure 15.3. The two sets of bars in Figure A.3 display the expected frequency of each score occurring for a sample of 100 students, each of whose true score is 0.60. This example assumes the accurate test is 32 items long, while the inaccurate test is 16 items long. Thus, there are 33 possible scores for the accurate test (including 0), and 17 possible scores for the inaccurate test. For both tests, the most likely score to be observed is 0.625, which is the closest to the true score. However, despite having more unique scores, the scores are much more densely clustered for the 32 item test than for the 16 item test. This example demonstrates how students with the exact same true score can have very different observed scores if the SEM of the test is large.

Figure 15.3 Distribution of Observed Scores for Tests of Different Accuracy levels when True Score is Constant (0.60)

Recall that the observed score on a test is the average of the sample of individual item scores. Similarly, the standard error for an individual is equal to the standard deviation between their item scores, divided by the total number of items. However, this standard error for the individual is also affected by factors related to the individual, and so is not a good representation of the accuracy of the test. Taking the average[2] of these errors across all students provides a better representation of the SEM. Thus, even though the true scores are not known, we can still estimate how reasonable we believe the observed scores are at representing the true scores.

15.1.3. Reliability

National assessment reports should always report the reliabilities of tests used in the assessment. Although the term “reliability” has a common meaning, test reliability is a specific statistics used to provide an indicator of the accuracy of a test for all students. Reliability[3] is frequently used to refer to the consistency of test scores. In statistical terms test reliability is the proportion of variability in observed scores that can be explained by variation in true scores. Reliability cannot be estimated directly; to do so would require knowing each student’s true score which, as we noted earlier, is not possible. We can, however get an estimate of test score reliability by using SEM. The relationship Equation 1shows that, the larger the SEM, the lower will bethe reliability (we use the term σ2 to represent the variation in observed score).

The test reliability statistic ranges between 0 and 1. A value of 0 represents a test whose scores do not relate in any way to what is being measured (e.g., a test where all students guess randomly on all items) and 1 represents a test which measures the domain (such as reading or mathematics) with perfect accuracy. Generally, test reliability around 0.70 or higher is considered adequate for a large-scale assessment.

Usually, results of a test are important only in the context of some decision or relationship with other variables. The correlation between observed scores for a test and another variable, such as school attendance, will always be lower than the correlation between true scores on these variables. The degree to which the correlations based on observed scores is lower (the attenuation) depends upon the reliability of the test scores. As test reliability increases, the observed-score correlation will become more similar to the true correlation. If p is the true-score correlation and r is the test reliability, then the maximum possible observed-score correlation will be p r . As a consequence, if the reliability of a test decreases, the scores become less useful to describe the relationship between test performance and other variables. The function in Figure 15.4demonstrates the effect of attenuation for tests of different reliabilities on a true correlation of 0.80. Only when the test reliability is perfect does the correlation of the two test scores equal its true value.

Hình 15.4 Các tác động của sự suy giảm về tương quan thực sự của 0,80

15.1.4. Using classical item statistics with test development

This section discusses methods of selecting or constructing items with desirable properties. The aim of the national assessment team is to select items that provide the maximum differentiation between individuals using items that are strongly related to the skill being measured. This should lead to improved reliability and a decrease in the standard error of measurement. When applying these principles, it is important to remember that the primary purpose of a test is to support inferences about a specific domain of proficiency. Although inferences cannot be supported if the test is not accurate, they also cannot be supported if the test does not adequately represent the domain. Thus, even while pursuing statistical accuracy, it is always important to ensure adequate representation of the objectives in the original table of specifications.

15.1.5. Item facility (difficulty)

In classical test theory, item facility is specific to a particular population or group of students. For a given population of students, item facility estimates the probability that an average student will correctly respond to the item. If a question is very easy, then the probability that an average student in the specific population will correctly respond is close to 1. On the other hand, if an item is very difficult, the probability of correctly responding will be close to 0. Sometimes, item facility is referred to as item “difficulty,” even though increasing values indicate easier items. Commonly, the statistic used to describe facility if called the p-value, in reference to the concepts of proportion and probability.

Since the principal purposes of a test is to sort or compare students (either with respect to each other or to some standard proficiency), the test must be able to produce different scores for different levels of proficiency with respect to some assumed differences which must be detected. Indeed, if all students achieved the same score, the test would not have provided any additional information over not administering it at all, since all would be assigned to the same group and be given the same rank order. Such a test would provide relatively little information of use to policy makers interested in giving support, for instance, to low performing groups. Therefore, one factor affecting the usefulness of a test is its ability to produce different scores for different students. Since the entire test is effectively composed of many one-item tests, the same principle applies to each individual item in the test. In general, the best items are those that minimize the number of students with the same score.

To illustrate this principle, consider the 10 items in Table 15.2, which contains the probability of correct and incorrect response for each item in a particular population of students. Column five presents a count of the number of people (per hundred respondents in this population) who are expected to have the same score on each item as well as which is the most common score, correct or incorrect. If an item is most effective when the fewest people have the same score, then the best item in this test in terms of differentiating between individuals is the item where there are equal numbers of correct and incorrect respondents, which occurs when facility is 0.5.

In general, the best test items are ones where the probability of correctly response (item facility) for the target individuals being assessed is around 0.5. There are some exceptions to this general rule, particularly for different types of multiple choice items. If there is a real possibility that students will randomly guess instead of attempting to answer an item correctly, the ideal item facility is approximately halfway between the probability attributable to guessing and one. For example, if an item has four options, and students tend to guess randomly, the ideal item facility would be (1-1/4) = 0.75. In general, if it is clear that students are randomly guessing on specific items, these items should be replaced with easier items that are more likely to elicit real effort.

Table 15.2 Relationship between item facility and usefulness of a test item

Without deviating from the design in the original table of test specifications, it is possible to change the facility of an items or the chance that students will guess randomly rather than attempt to answer an item. Some methods for doing so include:

- Increasing or decreasing the amount of text to be read in the item stimulus for language items, described in Chapter 3, Volume 2 in this series (Developing Tests and Questionnaires for a National Assessment of Educational Achievement);

- Increasing or decreasing the number of steps a student must perform in order to produce a response (mathematics or science items); or

- For multiple choice items, increase or decrease the “plausibility” or “relative correctness” of incorrect options by using responses that reflect reasoning or solution paths that students might use (multiple choice items).

One technique that should NOT be used to change the facility of a selected-response- type item is to change the number of distracters (such as making an item easier by changing from 5 to 3 options). Although the change in probability of correct response will increase due to the increased chance of guessing, this increase is unrelated to the skill levels of students. Therefore, it does not provide any additional information. For similar reasons, use of True-False type items should generally be avoided on large- scale assessments, because they provide little information about student proficiency.

What is the ideal item facility for criterion-referenced tests? In criterion-referenced tests, such as mastery tests or minimum competency tests, we are primarily interested in assigning scores such that students whose skills are above a specific level of proficiency have higher scores than students whose scores are below this level. The specific level might be set to determine passing or failing or to distinguish between adequate and excellent students. The majority of the students will more than likely have scored clearly above or below the criterion, so there is no need to further distinguish between the students who are clearly above or the students who are clearly below. Accordingly, the items for this type of test should be selected as if the population being assessed consists only of those students with skills around the level of the criterion.

We could start developing and piloting or field testing items for a criterion-referenced test by selecting a sample of students whose skills or achievement levels are considered close to level of the criterion. The assessment team might ask teachers to identify students who are on the borderline between passing and failing or between adequate performance and excellent performance based on their own perceptions and/or by previous test results. Using this sample of “borderline” students, we could then use the results of the pilot test to select items which were most effective in terms of differentiating between these selected students.

An alternative method is to define criteria in terms of percentile rank scores of the full population on the item universe. Percentile rank scores express each student’s score in terms of the percentage of students with lower scores. Students with lower percentile rank scores have lower percent-correct scores than students with higher percentile ranks. We can apply this interpretation of test performance to individual items (see Figure 15.5). If the errors on individual item responses are normally distributed, then a student with a percentile rank score of x should have a 50% chance of scoring correctly any item that was correctly answered by (100-x) % of all students. Another way of stating this principle is that, if x% of students correctly responded to an item, we would expect that the 100-x% of students with the lowest total test scores would likely get it incorrect and the x% with the highest total test scores would likely get it correct. Thus, to create a test that determines whether or not students are above a certain percentile, we should select items with the corresponding complementary facility. For example, if we wanted to determine whether or not students are above or below the 75th percentile, we would select items with a facility around 0.25, as shown in Figure 15.5.

Figure 15.5 The probability of correct response (item facility) for students at different levels of proficiency, expressed as percentile rank

To use this principle for item selection, it is necessary to have a fairly good idea of the percentile rank score corresponding to the criterion. For example, if a test were to be used to decide which members of the final year of the primary school population were to receive scholarships, the criterion for determining scholarship recipients might be set at the 85th percentile to ensure that the number of scholarship recipients would correspond to the top-scoring 15% of the primary school leaving population. The percentile rank score of a student who has a 50-50 chance of receiving a scholarship is 85. Applying the principle defined above, we find that the ideal item facility index for selecting scholarships students is equal to 100 minus 85 divided by 100, which is 0.15.

15.1.6. Item discrimination

The item discrimination index is another good indicator of the usefulness of an item. The term “discrimination” is used to refer to the ability of items to elicit different item scores from students with different levels of proficiency. If all students produce the same response to an item, regardless of their level of proficiency, the item cannot discriminate between different levels of student proficiency. Item discrimination indicates the extent to which an individual item measures what the whole test is measuring. Given pre-defined groupings of students according to percentile rank, the probability of performing correctly on an item should be greater for the higher proficiency groups than for the lower proficiency groups. Items are most useful when the probability of correctly responding is very different between different proficiency groups.

Consider the following example, where a head teacher wishes to use a mathematics test to group 100 students into three skill levels (low, medium, and high). If the head teacher wishes to have equal size groups, the criteria for determining group membership, in percentile rank scores, are 33 and 66 on the mathematics test results. Figure 15.6 illustrates the differences in effectiveness of three specific items in terms of their ability to discriminate between these three proficiency groups. The items correspond to the following tasks:

- Item 1: Identify the union of these two sets: {34,16,45,7,11,2,8,28} U

{1,67,9,2,26,8,4}

- Item 2: Identify the union of these two sets: {a,5,j,5,12,Q, r,45,2} U

{w,t,q,A,9,b,5,twelve,j}

- Item 3: Identify the union of these two sets: {1,2,3,4,5,6,7,8,9,10} U

{6,7,8,9,10,11,12,13,14,15}

Responses to Item 1 show very distinct differences between the proportions of students in the different groups who answer the item correctly. The low-skilled students have only a 1% probability of responding correctly, compared to 48% for the medium-skilled students and 97% for the high-skilled students. Item 2, has a relatively similar probability of correct response for all three skill levels (42%, 50%, 58%). Responses to Item 3 reveal that low-skilled students have a small probability of correct response for (16%) and large probability of correct response for the medium and high skilled students (92% and 100%). The item facility is the same for Item 1 and Item 2 (0.50), but Item 3 is easier (0.70).

Figure 15.6 Discrimination of three test items as group-specific probabilities by proficiency level

The statistical information can be used to understand how students approach the different items. Item 1 is most effective in discriminating between below-average and above-average students, because the requirements are clear, yet the task is complex.

In contrast, Item 2 may be confusing to students. The figure shows that students at each of the three levels of ability had about equal probabilities of success on Item 2. Some may have been confused by the inclusion of letters with numbers, and others may have wondered if upper case letters should be treated differently from lower case ones or if the number “12” is equal to the word “twelve,” or if repeated elements were counted as unique. Some high proficiency students may have found the question ambiguous and opted for incorrect answers, while some lower-skilled students may have selected the correct response for inappropriate reasons. This poor-quality item should be edited, the various sources of ambiguity removed and retested in a further pilot. Item 3 provides much clearer information than Item 2, indicated by the strong discrimination primarily between Low peformers and the other two groups. Because all of the elements in each set have already been ordered, the sets have the same number of elements, and the elements are consecutive, the students have to do less analysis to answer item 3 than either of the previous two items. Item 3 discriminates well between the low and the other two groups; it is also easier than Items 1 and 2.

But of these three items, which is the best? The answer to this question depends on the purpose of the test. It is clear that Item 2 is the weakest and is not likely to serve any useful purpose. If the purpose of the test is to distinguish between low, medium and high-proficiency levels, then Item 1 is the best item, because it has the largest differences between all skill levels in terms of the probability of getting the item correct. In contrast, Item 3 has very similar probabilities for the medium and high groups, suggesting that medium and high skilled students would probably get the same score on this item. However, if the test were a criterion-referenced test used to distinguish between low and medium-skilled students, then Item 3 is the best item, because it has the largest change in probability of getting a correct response between low and medium skill levels.

Although the criteria used for choosing the best items are dependent on each specific testing situation, the following guidelines can be used to select items based on the property of item discrimination:

- For norm-referenced tests, select items where the change in probability is large and relatively equal in magnitude between low-to-medium and medium-to- high skill levels.

- For criterion-referenced tests, if the criterion is to distinguish between low and medium skill levels, select items with large differences in probability between low and medium skill groups and similar probabilities for medium and high groups. For criterion-referenced tests that seek to distinguish between medium and high skill levels, select items that have large differences in probability between medium and high groups and a similar probability for low and medium groups.

In the context of national assessments, different stakeholders typically expect the national assessments to perform both functions. For example, the Ministry of Education may be interested in a national assessment primarily for research purposes to understand why males and females perform differently. Because there are both males and females at all ranges of proficiency, this assessment would be best served with a test built following the norm-referenced principles. On the other hand, if the national assessment is specifically interested in identifying which types of students are performing below the expected standard, then the test should use criterion- referenced principles to select test items that show the greatest discrimination at lower proficiency levels.

Analyzing pilot test data would be time consuming if you were required to examine figures similar to Figure 15.6 for every test item. An alternative approach is to use the classical index of discrimination. This index can be calculated in a number of different ways. In the following example, it calculated simply as the difference in probability of correct response between low-skilled and high-skilled students. Table

15.3 contains the item facilities for the high, middle and low performing students of a group and the index of discrimination for five items. Students were assigned to these groups based on their overall test performance. In this example, the index estimates in the final column of the table correspond to the differences between item facilities of high and low groups. Because the index of discrimination is simple to calculate and captures the general usefulness of an item, it is a common metric used to select test items.

Table 15.3 Index of discrimination

As a rule of thumb, suitable test items should have a discrimination value above 0.25. For pilot test data, this may be relaxed to 0.20. Further examples of using discrimination indices in selecting items can be found in Anderson and Morgan (2008).

15.1.7. Distractor analysis

There are many reasons why an item may have a low or even a negative index of discrimination. These reasons include poor wording, confusing instructions in how to answer a question, sampling errors, and miskeying or miscoding. This section provides an overview on how item statistics may be used to detect and remediate some common errors that become apparent in analyzing pilot test items and in some instances, in analysis of items of the test data used in the national assessment tests.

In its simplest form, distractor analysis looks at how each option (or score code) discriminates between three student skill levels (high1/3, middle 1/3 and low 1/3), based on the overall test score. Table 15.4 presents a typical distractor analysis for an individual item.

Table 15.4 Response error (or distractor) analysis

Item Q9 has 4 response options and two missing value codes (8 and 9). The missing response code 8 indicates that it was not possible to score the student response, either because the response was illegible, two options were selected, or some other operational problem. The missing response code 9 indicates that the response was left blank by the student. The asterisk (*) beside Option 1 indicates that it is the item key, or correct response. The total percentage of students selecting Option 1 is equal to the item facility. In general, a well-functioning item should have the following characteristics:

- the correct column option should have a high percentage for the high group, and successively lower percentages for the middle and low groups;

- the columns corresponding to incorrect options should have approximately equal percentages within each skill level and overall;

- for the high skilled group, the percentage choosing the correct option should be higher than the percentage choosing other options;

- for the low skilled group, the percentage choosing the correct option should be lower than the percentage choosing other options;

- for all groups, the percentage of missing value codes should be close to 0;

- if there are a large number of missing responses, the percentages should be equal across skill levels.

When an item does not have these desirable characteristics, it is usually the result of one of the following errors: miskeyed or miscoded responses, multiple correct responses, confusing requirements from the item, or the content of the test item is irrelevant, too difficult or too easy. Examples of each of these types of problem items are shown in Table 15.5, Table 15.6, Table 15.7 and Table 15.8.

Table 15.5 Miskeyed or miscoded responses

The distractor analysis data in Table 15.5 show that the item key for Q9 was specified as 3, rather than 1. Using the simple approach towards calculating a discrimination index shown in Table 15.3, the item discrimination for Q9 would equal -0.23 (0.147- 0.381). The presence of a negative index of discrimination suggests that an item has probably been miskeyed. You can identify the correct key by finding the option that best satisfies the conditions described above. In this case, Option 1 (which has been shaded) is the only option where the percentage of students choosing the correct option is higher than the percentage choosing the other options. Note that the selection of the correct response is primarily the responsibility of the subject matter specialist, and not that of the person entrusted with data analysis.

Table 15.6 Low Discrimination: more than one “correct” response

Group	Q9						Total
Group	1*	2	3	4	8	9	Total
high 1/3 middle 1/3 low 1/3 Total	55.2 52.5 41.3 50.5	.1 2.6 17.4 6.6	37.9 30.8 24.2 27.4	2.5 6.9 17.6 8.9	1.1 .7 .6	4.2 6.2 8.1 6.1	100.0 100.0 100.0 100.0

The distractor analysis in Table 15.6 presents the results of Q9 when Option 3 has been identified as a correct response, in addition to Option 1. This error can happen when the assessment team tries to increase the difficulty or facility of an item by increasing the attractiveness of specific distracters. From the students’ perspective, the instructions or question stem may be ambiguous, and they may be forced to rely on “common sense” rather than appropriate knowledge or skill to choose a response. For items with these patterns, the item developers should focus on clarifying the item such that there is an unambiguously correct answer.

Table 15.7 Low Discrimination: not measuring the correct domain

Group	Q9						Total
Group	1*	2	3	4	8	9
high 1/3	38.6	18.1	13.9	25.2		4.2	100.0
middle 1/3	27.0	12.7	20.5	32.5	1.1	6.2	100.0
low 1/3	21.5	25.4	34.2	10.1	0.7	8.1	100.0
Total	29.0	18.7	22.9	22.6	0.6	6.1	100.0

A distractor analysis similar to the one shown in Table 15.7 suggests that an item may have little to do with the subject area being assessed by the other test items. There is a weak relationship with proficiency for the correct response, a substantial amount of missing data and ambiguous relationships with proficiency for the distractors. There are several reasons why this may happen, even when the item is valid from a content perspective. These include:

- The reading requirement may be too demanding, particularly if the test is not intended to measure reading skill. This error can be remedied by reducing the readability of the item so its meaning is clearly understood by all students.

- the wording of the question may be ambiguous, making it unclear what information the item requires. This error can be remedied by field testing items in an interview setting with students and asking them to think aloud as they respond to each item. Misconceptions produced by instructions (that may make sense to teachers and item writers but not to students) can be identified and clarified.

- The item may be biased towards specific groups of students. For example, a mathematics item that uses real football statistics may be biased towards boys, who may have knowledge of the statistics without being able to solve the mathematics problem. Item bias can be reduced by using think-aloud procedures in pilot test settings, where students describe their thought processes as they complete each test item.

Table 15.8 Low Discrimination: too easy or too difficult

Group	Q9						Total
Group	1*	2	3	4	8	9	Total
high 1/3	10.2	32.9	23.2	29.5		4.2	100.0
middle 1/3	5.1	34.6	20.5	32.5	1.1	6.2	100.0
low 1/3	2.1	24.3	34.2	30.6	0.7	8.1	100.0
Total	5.8	30.6	26.0	30.9	0.6	6.1	100.0

The distractor analysis in Table 15.8 does not indicate that anything wrong with the item per se. The correct option (1) has a higher percentage for the high group, and successively lower percentages for the middle and low groups. Even though the relative probability of correct response changes across skill levels, the discrimination is too small for it to provide much useful information. Each of the distracters has a greater chance than the correct option of being selected by students of all ability levels. Extremely difficult items should be avoided on large-scale assessments, to the extent that is possible without reducing test validity. A similar problem occurs when items are too easy. If an item is almost universally answered correctly, it will not be able to discriminate between students. However, it may still be desirable to retain very easy items to increase student motivation at the beginning of a test, and it may be necessary to include either very easy or very difficult items if they are required to satisfy the test specifications.

15.1.8. Summary

A test is constructed of individual items. These items are drawn from a universe (possibly an infinite number) of items that measure the same subject area. A person’s true score is the test score that he or she would have achieved if the entire universe of items had been included on the test. The true score is equivalent to the probability that a student will correctly respond to a test item in the subject area. A person’s observed score is the score he or she achieves on the sample of items included on the test. Observed scores are used to estimate true scores. The standard error of measurement of a particular observed score indicates the uncertainty with which the observed score reflects the true score. For an entire test, reliability summarizes the accuracy of the observed scores. The property of attenuation due to unreliability can help determine the minimum reliability required for a test to satisfy certain purposes.

The characteristics of individual items can be analyzed to determine the best items for the purpose of the test. Two item characteristics in particular are important: item facility and item discrimination. As much as is possible, the facility of items on a test should be around 0.50, or halfway between the chance of guessing correctly and 1. In order to improve their measurement quality, items can be made more or less difficult by increasing or decreasing their cognitive requirements without changing their content. Item discrimination also plays a role in determining the quality of a test item. Suitable items should have a large difference in the probability of correct response between students of different skill levels. The difference in probability between students of low and high skill levels should be greater than 0.25. For criterion- referenced tests, the largest change in probability should be in the region where the criterion is to be applied.

15.2. Item Response Theory (IRT)

The previous section introduced two aspects of the classical test theory approach to measuring proficiency, namely item facility (or difficulty) and item discrimination. In the remainder of this annex we examine an alternative approach, Item Response Theory (IRT), which unifies the concepts of item facility and discrimination. IRT has also been described as latent trait theory. It is the most widely used theoretical approach in large-scale assessments.

A good starting point to understand IRT is to contrast what constitutes a good test item from the CTT perspective and the IRT perspective. The classical item statistics of facility and discrimination were focused on estimating and comparing the probability of correct response for different students. In contrast, IRT characterizes students by the type of item response they are likely to produce and tries to describe the distributions of proficiency for students that respond in different ways. A good test item from a CTT perspective has large differences in probability of correct response for students of difference proficiency levels. From an IRT perspective, a good test item is one where the distribution of proficiency for students who correctly answered is different from the distribution of proficiency for the incorrect students.

Whereas CTT fixates on probability of correct response, IRT fixates on the estimation of the distribution(s) of proficiency. While these two perspectives are generally in agreement, the IRT perspective describes items in a much richer and more useful way.

The two distributions in Figure 15.7 illustrate some fundamental features of IRT. The two curves represent distributions of proficiency[4] for respondents to a single test item. The distribution on the left describes the proficiency of students who responded incorrectly, and the distribution on the right describes the proficiency of students who responded correctly. This item has a facility of 0.50, which reflects the identical height of the two distributions along the vertical axis -- there are as many correct respondents as incorrect respondents. The mean proficiency of correct respondents is 0.10, which is reflected in the graph by the peak of the distribution for correct students being directly above 0.10 on the proficiency axis. Because the overall mean of both populations is 0, and they are equal in size, the mean proficiency of incorrect respondents is symmetric, at -0.10. The two distributions are very similar to each other in terms of both size and location, indicating that there is very little difference in proficiency between the type of students who correctly respond and the type who incorrectly respond. If there were no difference at all, both distributions would be identical with means equal to 0, and the item responses would have no relationship with proficiency.

Figure 15.7 Distributions of proficiency for correct and incorrect respondents to a single test item, facility = 0.50, mean proficiency of correct respondents = 0.

A much more accurate test item, also with facility of 0.50, is illustrated in Figure 15.8. This item illustrates the strongest relationship between item response and proficiency, where the mean proficiency of the correct respondents is approximately 1 and the mean proficiency of the incorrect respondents is approximately -1. There is no overlap in the distributions, indicating that, in terms of proficiency, the correct respondents are completely distinct from the incorrect respondents.

Figure 15.8 Distributions of proficiency for correct and incorrect respondents to a single test item, facility = 0.50, mean proficiency of correct respondents = 0.99

In practice, it is extremely rare that correct respondents are completely distinct from incorrect respondents. There is typically a wide region of proficiency in which the two distributions overlap. In this region, there is a smooth transition as students with increasing proficiency become less likely to be members of the incorrect distribution and more likely to be members of the correct distribution. This transition is illustrated in Figure 15.9 for an item with facility of 0.60 (indicating the distribution for correct respondents is larger than that for incorrect respondents) and the mean proficiency of the correct respondents is 0.40. The solid curved line, which is also known as an item response function (IRF), describes size of the distribution of correct respondents relative to the size of the distribution of incorrect respondents. In other words, in regions of proficiency where the height of the correct distribution is lower than the height of the incorrect distribution, the IRF is below 0.5, and when the reverse is true, the value is above 0.5. The IRF can be interpreted as the probability that a respondent with a given proficiency level will belong the group of correct respondents. The exact values of the IRF can be calculated by dividing the probability for the distribution of correct respondents by the sum of probabilities of both distributions. For example, at the proficiency value of -1, the probability value of the Correct respondents is approximately 0.06 and the value for the Incorrect respondents is approximately 0.15; 0.06/(0.06+0.15) = 0.29. Because the proportion of incorrect respondents is the reverse of the proportion of correct respondents and the mean proficiency of incorrect respondents can be calculated from the mean proficiency of the correct respondents (given that the overall mean equals 0), the IRF is a function of the item facility and the mean proficiency of the correct respondents. Exact calculations are presented in section 15.2.1.

Figure 15.9 Distributions of proficiency for correct and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.60, mean of correct respondents = 0.40

The mathematical equation using distributional parameters to describe the IRF is quite lengthy. The common practice is to describe the IRF in terms of two other parameters (which are described in greater detail in the following section). The simplest parameter, denoted b, is identified as the threshold where the two distributions intersect (the vertical line in Figure 15.9, equal to -0.408). At this threshold on the proficiency scale, students have an equal likelihood of belonging to either the correct distribution or the incorrect distribution. Consequently, it is the location where the item is most useful in distinguishing between the two types of students. The degree of accuracy of the item in distinguishing between the two types of students is proportional to the slope of the IRF at this location (the oblique line in Figure 15.9), denoted as the a parameter. The a parameter is typically a transformation of the slope[5]; in this case, the value of a is 0.85. As the differences between the distributions of correct and incorrect respondents increase, in either overall probability or location, the slope of the IRF increases, reflecting the stronger relationship. For comparison, Figure 15.10 illustrates an item where the mean proficiency of correct respondents is the same as in Figure 15.10, but the facility is much higher, equal to 0.70. The greater difference between the two distributions is reflected in the greater value of the a parameter (reflected in the steeper slope), which is 1.25 for this item. In general, accurate test items will have steep S-shaped curves, indicated by high values of the a parameter.

Figure 15.10 Distributions of proficiency for correct and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.70, mean of correct respondents = 0.40

A special case of the IRF describes a situation where lower-proficiency students tend to guess rather than attempt to correctly answer an item. This situation occurs in multiple choice items that are confusing or have implausible distractors. In this case, many students correctly respond to an item without having the same distribution of proficiency as the ‘true’ correct responds. As a result, the item is characterized better by the proficiency of the students who incorrectly respond than by the proficiency of correct respondents. As shown in Figure 15.11, the population of incorrect respondents is divided into two components: one component is the proportion of respondents who actually scored incorrectly (0.40), whereas the other component is the proportion of students who guessed correctly (0.20), labelled ‘incorrect guessers.’

As a result of this division of the incorrect respondents into two distributions, the IRF, which is the distribution of correct respondents divided by the sum of the other distributions, has a lower asymptote, meaning that even the students at the lowest end of the distribution have a non-zero chance of responding correctly to the item. The minimum chance of responding correctly is denoted as the c parameter, and it is equal to the proportion of the incorrect respondent population that guessed correctly. In Figure 15.11, the threshold, or b parameter, is no longer at the intersection of the distributions of incorrect and correct respondents; instead, it is located where the intersection would have been if the incorrect-guessers had correctly been assigned to the incorrect respondent distribution.

Figure 15.11 Distributions of proficiency for correct and incorrect respondents to a single test item and conditional probability of correctly responding, facility = 0.60, mean of correct respondents = 0.80, lower asymptote of IRF=0.33

In practice, it is difficult and statistically complex to distinguish between Correct respondents and Incorrect guessers. This type of estimation is available in the advanced functionality of IATA, where the proportion of Incorrect guessers is described by a third item parameter, denoted as the c parameter. Estimating the c parameter tends to require larger samples of respondents, and the estimates of the a and b parameters in this scenario are not as reliable as when they are estimated by themselves. For this reason, it is typically more efficient to assume the proportion of Incorrect guessers is equal to 0 and simply remove from analysis any items where this assumption is not met (i.e., where the empirical IRFs indicate that the lowest proficiency respondents have a higher probability of correct response than predicted from the theoretical IRFs).

[1] The term ‘error of measurement’ refers to the variation or uncertainty in measurement; it should not be considered a mistake on the part of the student.

[2] The quadratic mean.

[3] Different forms of test reliability include test-retest, alternate forms, split half, inter- rater and internal consistency.

[4] In IRT, student proficiency is described on a scale (often called theta) that is similar to the Z-score scale: the theoretical average proficiency level is 0, and the standard deviation is 1. Most students usually have scores between -2 and 2, and less than one in a thousand students will have scores less than -3 (or greater than 3)

[5] For algebraic reasons, the a parameter is four times the value of the slope at the location of the b parameter. Usually, this value is further divided by 1.7 so that the value may be used in the easy-to-use logistic model while approximating of the cumulative normal distribution function.

[6] Although there are other IRT models appropriate to different methods of scoring, the model described here is most appropriate to the rubric-based scoring schemes typically found in national assessments, where higher scores are assumed to represent success on the requirements associated with lower scores.

[7] If, for example, a curriculum fundamentally changes such that the definition of mathematics proficiency changes from computational speed to visualizing patterns, then the property of invariance would no longer hold true. In this case, the statistical behaviour of the items in populations for whom mathematics is primarily computational would not be consistent with their behaviour in populations where mathematics is primarily visualizing patterns.

[8] There is no equation for transforming the c parameter, nor is there a standard method for equating a or b parameters when the c parameter differs substantially across different populations.

[9] It is more common in practice to add the logarithms of the likelihood values rather than multiplying them. This approach minimizes the effects of rounding error and significant digit truncation during the calculation process.

[10] The score corresponding to the maximum value of the posterior distribution is called the maximum a posterior (MAP) estimate. The EAP estimate is preferred over the MAP because it tends to be more stable to calculate.

[11] The presentation of information for partial credit items is more complex but essentially has the same interpretation, in that it represents the conditional variance of the item scores at a given level of proficiency. See Samejima (1974) for a complete discussion of the item information function used for partial credit items in IATA.

[12] For illustration purposes, the items all have a parameters that are much greater than would ordinarily be encountered in educational assessment. Large a parameters accentuate the effects of each test item on the TCC.

[13] The examples in this section compare parametric IRFs for the two populations for reasons of clarity. This method is not as sensitive to group differences as non-parametric methods. The DIF results in IATA are based on the observed proportion of correct responses at each proficiency score. However, the interpretation of the results is identical for both methods.

MỘT SỐ VẤN ĐỀ GIÁO DỤC

Thứ Năm, 5 tháng 11, 2015

Hướng dẫn sử dụng phần mềm phân tích đề thi IATA (Chapter 15_Convert to PDF to Word full)