I think I’ll return to the de-rail topic of ‘have standards changed?’, just because I’m interested in it, and perhaps others are too.
As one might guess, it is quite complicated.
One could ask “Have questions (aka ‘items’) got easier?”. There is a way to explore this using Item Response Theory
https://en.wikipedia.org/wiki/Item_response_theory (which relies on a number of fairly plausible assumptions).
For each item, we can develop an Item Characteristic Curve (ICC), which shows the absolute difficulty of the item graphically. We could in principle compare the ICC for items administered in the past at different stages with those for current items (if we assume that candidate ability has remained unchanged: but see the Flynn effect
https://en.wikipedia.org/wiki/Flynn_effect).
However, “Have items got easier (or more difficult)” is not the same question as “Have standards decreased (or increased)”. In fact, it is perfectly possible for standards to become more demanding, even when items become easier.
The first, and relatively trivial, way in which this may be the case, is if the level of performance required (e.g. the pass mark) is increased. In my day job, we routinely measure the difficulty of individual tests, and adjust the ‘pass score’ accordingly, since we are aiming for the same level of performance across a number of test administrations.
The second relates to validity (‘what the test is intended to measure’). An item may be difficult but also of low validity for the test intention. (‘True but trivial’ is one quite common example in testing). Validity itself is a complicated construct, but in general, I’m particularly interested in predictive validity: does the test predict later educational/work-place performance?
A test may therefore change to have items of lower abstract
difficulty but disproportionately greater
validity. In such a case, standards would have increased, even though item and test difficulty had decreased. This is similar to the situation Ken McKenzie described previously with regard to Physics degrees.
A test can also be designed to be sensitive in a particular area. This is shown by something called the Test Information Curve, also derived from Item Response Theory. We could design a test to be sensitive at the highest levels of performance, with many difficult items. But the top performing students are generally few in number. It is generally more useful to design a test to be sensitive around a critical boundary, such as the pass/fail boundary. In healthcare education, for instance, I’m more interested in the question “do those who pass, deserve to pass?” than “Have I ranked the top 5% correctly?”. I might therefore shift the mix of item difficulties to cluster around the pass/fail boundary, making the test as a whole technically 'easier' but more valid for my purpose.
Note: even the simple-seeming, but dubiously relevant, question of “has item difficulty decreased” for something like A levels, hasn’t been comprehensively answered. To calculate the ICC, a large number of candidate responses are needed. Four hundred is sometimes suggested as a minimum. The Jones
et al paper we have discussed previously on this thread, managed to find just 66 candidate responses, at only 4 time points, and for only one subject. But I hope I’ve made it clear that item difficulty is not the key question.
I think I’ll pause there, and see if this has been of any interest to anyone. There is more I could say.
Oh boy, is there more I could say…