Thoughts
OpenAI published scores that GPT-4 got after taking several high-school level exams. It does well on most of them, but the one that
immediately stands out to me is the AMC 10. The AMC is the American Math Contest (10th grade and below). GPT-4 got a 30/150 on the test, which is the 10th percentile. This is much worse than its performance on other tests. (For example, it's in the 89th percentile on the SAT Math subject test.)
To try to understand what's going on here, I'm going to quote part of the grading description on for the test.
> This is a 25-question, multiple choice test. Each question is followed by answers marked A, B, C, D and E. Only one of these is correct.
> You will receive 6 points for each correct answer, ... 1.5 points for each problem left unanswered ... and 0 points for each incorrect answer.
The AMC is supposed to be hard. You are not supposed to be able to answer every question on the AMC, which is why you get points for leaving questions blank. If you left every question blank, you would get a 37.5, and do better than GPT-4! But GPT isn't designed to leave questions blank. It's going to generate a BS, plausible sounding, answer for every single question. Now, if you guessed randomly on every question, with 5 correct answers, you'd get a 30, which also happens to be the score GPT got.
The type of problem solving that you have to do on the AMC is the exactly the type of stuff that GPT has a bad time with. I remember geometry problems that require you to be able to visualize spacial objects in a novel way. As a language model, GPT has no spacial reasoning ability. I think GPT is guessing randomly here, hurting its score.
I'm calling this the AMC-effect. Text generation models would rather hallucinate than not-answer, leading to worse performance in some cases. I think it could be used by teachers trying to detect cheating. They could include a nonsensical, honeypot, question, and instruct students that they don't have to answer every question. Any student who generated BS for this honeypot question is likely to not understand the concepts, or be cheating.
Another weird note: GPT does better when it doesn't have visual input on the AMC 10. Some of these questions are only possible to answer correctly with the diagram. OpenAI says that they replace the image with "IMAGE: with a non-meaningful filename wherever an image would be missing" (Appendix A.4). I think what's happening is that the text-only GPT is opting not to answer these questions, since it knows it's missing information, and leaving the question blank, leading to more points than otherwise.
But here's the weirdest part. GPT-4 does better on the AMC 12, landing in the 50th percentile with a 60/150. Why? This should be a harder test. It covers more concepts. I have no idea why it performs respectably on this test. It kind of throws a wrench in my whole theory.
I'm really curious if/how their prompt for these tests explained that the test taker can leave the questions blank, I think that's an important piece of information.
=> https://arxiv.org/abs/2303.08774 The GPT-4 paper (click PDF in the upper right)
=> https://artofproblemsolving.com/wiki/index.php/2022_AMC_10A_Problems The 2022 AMC 10