Pianissimo wrote: ↑Wed May 13, 2020 8:48 am
Hoping a stats person can help me out here
Lets say, hypothetically, that you had two groups of test data where one group of lower ability took an easier test than the other on the same subject. Let’s call them, hypothetically, the foundation and higher groups. If you then
were forced to needed to rank both groups in one scale, what would be the best way of calibrating the data in the tests such that the two groups could be quantifiably compared?
Oh, I wouldn't start from where you are....(but you know that already!)
What kind of (questions) items were they? Multiple Choice Questions (MCQs), or Short Answer Questions, or something else? It would be a help to know.
If the items were all drawn from a bank that had been previously used for a larger* number of candidates, then Item Response Theory approaches will allow you to calculate an absolute difficulty for each item, via its Item Characteristic Curve. A Rasch (one parameter) model is sufficient for many purposes, but I would prefer a three-parameter model (where one parameter is a guessing correction) for MCQs, especially if they were one-best-of-four, rather than on-best-of-five or more.
Then you can calculate the difficulty of each test, based on its item difficulties, on a common scale, and job's a good un.
I'm not getting the sense that you have that kind of information, however: it sounds a bit like someone has given you this task, despite there being completely insufficient data to do it properly...and you still have to do it!
...so I'm going to make something up, in the hope that it is helpful. I'm going to assume that there is some overlap between both tests and candidates. If all the items in the foundation test are easier than
any item in the higher test, and all the candidates in the higher group are brighter than
any candidate in the foundation group, you will be no further forward. I'll also assume that you have a decent number of students in each group.
Combine the items in the two tests into one randomised list. Ask as many colleagues you can, who are familiar with (a) this topic under examination (b) the kinds of student under discussion, to rank the items in terms of difficulty. Average the ranks across colleagues, and then order all the items by difficulty. (Also ask them to tell you how high up this ranked list a 'foundation' and a 'higher' student should go in order to pass at their level).
If you are lucky, there will be some overlap between the 'foundation' items and the 'higher' items.
Now plot the facility (% correct) for each group of students on the rank order of difficulty. The best outcome for you is if there are two clear distributions, with a good overlap in the middle. By inspection, can you assume that the higher students would have passed most of the items in the easier test? If yes, you have a rank order. Is that enough, or do you need to assign a score to each candidate as well?
If you need to assign a score as well as a rank, then we could consider the 'pass marks' your reference group came up with: but I'll leave it there for the moment.
Feel free to PM me if you like, I do assessment as a day job. Some more details would help! If this is a high stakes assessment (people's lives at risk), that makes a difference, for instance.
*the exact number is contested, but a minimum of 400 candidates is sometimes suggested.