Scrutable

Posted: **Wed May 13, 2020 8:48 am**

Hoping a stats person can help me out here

Lets say, hypothetically, that you had two groups of test data where one group of lower ability took an easier test than the other on the same subject. Let’s call them, hypothetically, the foundation and higher groups. If you then were forced to needed to rank both groups in one scale, what would be the best way of calibrating the data in the tests such that the two groups could be quantifiably compared?

Posted: **Wed May 13, 2020 9:11 am**

Pianissimo wrote: ↑
Wed May 13, 2020 8:48 am
Hoping a stats person can help me out here

Lets say, hypothetically, that you had two groups of test data where one group of lower ability took an easier test than the other on the same subject. Let’s call them, hypothetically, the foundation and higher groups. If you then were forced to needed to rank both groups in one scale, what would be the best way of calibrating the data in the tests such that the two groups could be quantifiably compared?

It seems like there are only two obvious ways of doing the calibration: either getting a small number of students to do the other test, or getting teachers to estimate how the mark of one test map to a mark of the other test.

Which is of course all completely impossible and subjective as we discover any time we try to discuss grade inflation etc. on here.

Posted: **Wed May 13, 2020 2:36 pm**

Pianissimo wrote: ↑
Wed May 13, 2020 8:48 am
Hoping a stats person can help me out here

Lets say, hypothetically, that you had two groups of test data where one group of lower ability took an easier test than the other on the same subject. Let’s call them, hypothetically, the foundation and higher groups. If you then were forced to needed to rank both groups in one scale, what would be the best way of calibrating the data in the tests such that the two groups could be quantifiably compared?

In the particular case of foundation and higher tier GCSE, the grades are already on the same scale, and capped for foundation tier at 5 (or C back in the old days when grades were letters).

What's the comparison you're trying to make? If it's just grade distribution you don't need to adjust the data at all, as it's already calibrated by the exam board, and full marks on foundation is considered equivalent to a 5 on higher tier.

Posted: **Wed May 13, 2020 5:12 pm**

It’s not just grade distribution, it’s an absolute ranking through the whole cohort, no matter which paper they were due to take. Say we were comparing one who was four marks over the 5 grade boundary on the foundation level vs one who was three marks over the boundary on the higher level. Which one is better? Can that actually be quantified? You can’t just compare them in terms of marks over the boundary, partly because grades are not linearly distributed but also partly because of the cap you mentioned (so many marks over the 5 eventually becomes a 6 on higher but not foundation).

I’m basically wondering if there’s a valid way to turn two data sets into one. I’m thinking it might be analogous to if two groups were taking two different strengths of the same drug?

Posted: **Thu May 14, 2020 2:43 am**

Pianissimo wrote: ↑
Wed May 13, 2020 5:12 pm
I’m basically wondering if there’s a valid way to turn two data sets into one. I’m thinking it might be analogous to if two groups were taking two different strengths of the same drug?

It depends on the details of how the marks are assigned. In general there is no solution. To see why, imagine you are measuring lengths using two rulers. The short ruler is 100mm long with 11 possible results: 0, 10, 20, ..., 100. The long ruler is 1000mm long, also with 11 possible results: 0, 100, 200, ...., 1000. All measurements as perfectly rounded to nearest. It's obvious that these are completely objective, repeatable measurements, but it's impossible to compare measurements between them. If something is 62mm long, then on the short scale it's 60, while on the long one it's 100, so if it were accidentally assigned to be measured on the long scale the measured value would be grossly incorrect. This is the case even if the scales are much closer in size such as 500mm vs 1000. Measuring only on the short scale, something measured as 450 is surely longer than something measured as 400 and shorter than 500. But it's not necessarily longer than something measures 400 on the long scale as short 450 means (425..475) while long 400 means (350..450). Similarly it might not be shorter than something measured as 500 on the long scale.

This is the problem of quantisation. The whole reason for having two scales is that it's not practical for either to have infinite precision, so each is designed for the range it is intended to cover.

In the case of exams which are taken by people, there are further problems. Such exams always include an element of ability and an element of knowledge. A candidate sitting the hard exam will have covered topics not examined on the easy exam. Someone who sat the easy exam might score zero on the hard exam simply by not knowing something that does not reflect on their ability (e.g. in a literature exam, there may be the assumption that a candidate has read a prescribed novel, in a mathematics exam it may be presumed that a candidate knows particular terminology). What does it mean to compare the two scales? Does it mean that if someone who sat the easy exam was simply presented with the hard one what mark would they achieve? Or does it mean what mark would they have achieved if they had followed the course of study intended to prepare for the hard exam and then sat it?

Some of this can be addressed by considering the reason why there is a need to unify the results. For example, if it is an entrance qualification to another course, maybe those who sat the easy exam need, regardless of how high they scored, to take an extra module to cover material that was only covered on the hard exam.

Posted: **Thu May 14, 2020 1:15 pm**

Pianissimo wrote: ↑
Wed May 13, 2020 8:48 am
Hoping a stats person can help me out here

Lets say, hypothetically, that you had two groups of test data where one group of lower ability took an easier test than the other on the same subject. Let’s call them, hypothetically, the foundation and higher groups. If you then were forced to needed to rank both groups in one scale, what would be the best way of calibrating the data in the tests such that the two groups could be quantifiably compared?

Oh, I wouldn't start from where you are....(but you know that already!)

What kind of (questions) items were they? Multiple Choice Questions (MCQs), or Short Answer Questions, or something else? It would be a help to know.

If the items were all drawn from a bank that had been previously used for a larger* number of candidates, then Item Response Theory approaches will allow you to calculate an absolute difficulty for each item, via its Item Characteristic Curve. A Rasch (one parameter) model is sufficient for many purposes, but I would prefer a three-parameter model (where one parameter is a guessing correction) for MCQs, especially if they were one-best-of-four, rather than on-best-of-five or more.

Then you can calculate the difficulty of each test, based on its item difficulties, on a common scale, and job's a good un.

I'm not getting the sense that you have that kind of information, however: it sounds a bit like someone has given you this task, despite there being completely insufficient data to do it properly...and you still have to do it!

...so I'm going to make something up, in the hope that it is helpful. I'm going to assume that there is some overlap between both tests and candidates. If all the items in the foundation test are easier than any item in the higher test, and all the candidates in the higher group are brighter than any candidate in the foundation group, you will be no further forward. I'll also assume that you have a decent number of students in each group.

Combine the items in the two tests into one randomised list. Ask as many colleagues you can, who are familiar with (a) this topic under examination (b) the kinds of student under discussion, to rank the items in terms of difficulty. Average the ranks across colleagues, and then order all the items by difficulty. (Also ask them to tell you how high up this ranked list a 'foundation' and a 'higher' student should go in order to pass at their level).

If you are lucky, there will be some overlap between the 'foundation' items and the 'higher' items.

Now plot the facility (% correct) for each group of students on the rank order of difficulty. The best outcome for you is if there are two clear distributions, with a good overlap in the middle. By inspection, can you assume that the higher students would have passed most of the items in the easier test? If yes, you have a rank order. Is that enough, or do you need to assign a score to each candidate as well?

If you need to assign a score as well as a rank, then we could consider the 'pass marks' your reference group came up with: but I'll leave it there for the moment.

Feel free to PM me if you like, I do assessment as a day job. Some more details would help! If this is a high stakes assessment (people's lives at risk), that makes a difference, for instance.

*the exact number is contested, but a minimum of 400 candidates is sometimes suggested.

Posted: **Fri May 15, 2020 10:18 am**

Oh, I wouldn't start from where you are....(but you know that already!)

Yeah! These are basically maths GCSE papers, and so far higher ups and using a combination of “professional opinions” which is a bit subjective to me and a statistical method with makes utterly no sense at all. I just wondered if there was a concrete statistical method I wasn’t aware of, but it doesn’t appear so.

There’s no prospect of getting anyone to do another exam in a guaranteeably fair way sadly, and we are being asked to rank an entire year group where no single person has taught them all. I’m foreseeing nightmare scenarios come August when disputes occur over the grade “we” have given them.

Combine the items in the two tests into one randomised list. Ask as many colleagues you can, who are familiar with (a) this topic under examination (b) the kinds of student under discussion, to rank the items in terms of difficulty. Average the ranks across colleagues, and then order all the items by difficulty. (Also ask them to tell you how high up this ranked list a 'foundation' and a 'higher' student should go in order to pass at their level).

If you are lucky, there will be some overlap between the 'foundation' items and the 'higher' items.

Given these are last year’s papers the difficulty should be information I could get nationally as it happens, rather than relying on a handful of colleagues. I think I will try this recommendation, thanks.

Posted: **Fri May 15, 2020 10:22 am**

This is the problem of quantisation. The whole reason for having two scales is that it's not practical for either to have infinite precision, so each is designed for the range it is intended to cover.

Yes I understand, thanks. I would not be attempting this unless being forced to, it’s not as easy as some would believe.

Posted: **Fri May 15, 2020 5:37 pm**

This seems like a weird task, ranking candidates for two distinct sets of exams on a single scale. It's not clear to me how it can be done in a principled way at all.

Not a dig at you at all, Pianissimo, but it's a bizarre exercise. I hope there's a good reason for it.

Posted: **Fri May 15, 2020 6:00 pm**

Bird on a Fire wrote: ↑
Fri May 15, 2020 5:37 pm
This seems like a weird task, ranking candidates for two distinct sets of exams on a single scale. It's not clear to me how it can be done in a principled way at all.

Not a dig at you at all, Pianissimo, but it's a bizarre exercise. I hope there's a good reason for it.

It's the way that some (all?) GSCEs are examined. The same nominal course, and examined either with the foundation paper or the higher one. The foundation paper is intended to offer grades up to a 'C', whereas the higher paper had grades A to C (or D) and below that was an Ungraded Fail.

I'm sure that when the CSE and O level exams were merged into the GCSE, each candidate would take two papers, one of which was common, and the other would grade either A to Ç and the other would grade C to F.

Yes, it is odd, and hard to be sure of how confident one can be about the relative gradings.

Posted: **Fri May 15, 2020 6:27 pm**

Yes, I know (see my first post in this thread).

The candidates have taken different exams, so trying to rank the two separate sets of people on same scale is a bit silly. Might as well try to rank Spanish and French foreign language candidates.

OTOH,note that

Tiered exam papers have questions (usually around 20%) that are common to both foundation and higher tier. Exam boards use these to align standards between tiers, so it is no easier to get a grade 5 or 4 on one tier than another.

https://ofqual.blog.gov.uk/2020/01/23/g ... y-in-2020/

so the other way to look at things is that you just assume the official calibration by the exam board is ok and roll with it, ranking the children by the grades they achieved (or the precise scores, if you're a fan of spurious accuracy).

Trying to devise your own scheme to combine two different sets of exam results that are in any case supposedly calibrated when you have access to only a tiny fraction of the data is, I maintain, a brave idea.

Posted: **Fri May 15, 2020 7:32 pm**

Bird on a Fire wrote: ↑
Fri May 15, 2020 5:37 pm
This seems like a weird task, ranking candidates for two distinct sets of exams on a single scale. It's not clear to me how it can be done in a principled way at all.

Not a dig at you at all, Pianissimo, but it's a bizarre exercise. I hope there's a good reason for it.

The reason is essentially “because the government have asked us to”, presumably so that when the exam boards do their own standardisation they will just slide the grades for each individual up or down according to the rank order. That information is googlable so I’m fairly sure I’m not breaking any rule repeating it.

For this reason it’s really important to have some justification for the order, but it might just come down to “it’s like my opinion, man”. Sigh.

Posted: **Fri May 15, 2020 8:29 pm**

Pianissimo wrote: ↑
Fri May 15, 2020 7:32 pm

Bird on a Fire wrote: ↑
Fri May 15, 2020 5:37 pm
This seems like a weird task, ranking candidates for two distinct sets of exams on a single scale. It's not clear to me how it can be done in a principled way at all.

Not a dig at you at all, Pianissimo, but it's a bizarre exercise. I hope there's a good reason for it.
The reason is essentially “because the government have asked us to”, presumably so that when the exam boards do their own standardisation they will just slide the grades for each individual up or down according to the rank order. That information is googlable so I’m fairly sure I’m not breaking any rule repeating it.

For this reason it’s really important to have some justification for the order, but it might just come down to “it’s like my opinion, man”. Sigh.

If both exams can allow the achievement of a grade C, would that allow for measure of normalisation?

Posted: **Sun May 17, 2020 10:34 pm**

It’s taken me 5 days to realise there’s a typo in the thread title. Sorry, as you were.

Posted: **Mon May 18, 2020 2:28 am**

Grumble wrote: ↑
Sun May 17, 2020 10:34 pm
It’s taken me 5 days to realise there’s a typo in the thread title. Sorry, as you were.

I assume you haven't noticed that you can change the title when you post a reply.

Posted: **Mon May 18, 2020 2:03 pm**

Millennie Al wrote: ↑
Mon May 18, 2020 2:28 am

Grumble wrote: ↑
Sun May 17, 2020 10:34 pm
It’s taken me 5 days to realise there’s a typo in the thread title. Sorry, as you were.
I assume you haven't noticed that you can change the title when you post a reply.

Mods can even fix the original, but I didn't notice either.

Typo in title is now corrected.

Posted: **Mon May 18, 2020 3:51 pm**

Grumble wrote: ↑
Sun May 17, 2020 10:34 pm
It’s taken me 5 days to realise there’s a typo in the thread title. Sorry, as you were.

Just seeing who is paying attention, honest.

Scrutable

Totally Hypothetical Statistics

Totally Hypothetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hypothetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hyopthetical Statistics

Re: Totally Hypothetical Statistics

Re: Totally Hypothetical Statistics

Re: Totally Hyopthetical Statistics