(Background/disclaimer: I work at Anthropic, far from any efforts related to math reasoning or evaluations; opinions are my own. I was an IMO gold medalist in 2012.)
I way missed the news window on this one, but I thought it was interesting and a bit underappreciated that the fraction of gold medalists at the 2025 IMO (72/630 = 11.4%) is the highest it’s been since 1981.
Crudely, IMO gold medals are awarded to the highest-scoring 1/12 of contestants.1 However, because scores are integers up to 42 and there’s no provision for tiebreaking, it’s possible for a lot of contestants to be tied around the threshold. In that case, either all of them get a gold medal or none do, and the fraction of gold medalists might deviate substantially from 1/12. That’s what happened this year: 46 contestants all won a gold medal by scoring exactly 35 points.
Here’s a comparison with the fraction of IMO gold medalists for each year since 2000. (See appendix 1 for a full table and appendix 2 for discussion of the choice of 2000.)
In fact, bizarrely, 35 is the mode of the scores this year; the last time the modal score was a gold medal score was in 1994. And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others.
We can also consider the IMO 2025 problems individually. In the Epoch AI newsletter, Greg Burnham combines a subjective analysis with Evan Chen’s MOHS ratings to argue that the first five problems at IMO 2025 were unusually easy and the sixth was unusually hard, so it’s not surprising that the first five problems were exactly the ones solved by these AIs. Though I’m not sure the MOHS scale is rigorous enough to make sense as the x-axis of a bar chart,2 it’s easy to corroborate the high-level story with the official IMO statistics. Based on average scores, this year’s Problem 6 was the fourth hardest and its Problem 3 was by far the easiest of all Problem 3s and 6s since 2000.3
Despite all this, 88th percentile among some of the best high school students in the world is still impressive. I was particularly impressed that the AI systems produced natural language proofs from natural language statements, rather than formal proofs from formal statements (with a proof assistant like Lean, as Google DeepMind’s AlphaProof did last year).4
I would rather bang on the drum that the IMO is a lousy benchmark even for humans. It’s not primarily intended to be one! There are only six problems on it! Scores will have a lot more variance based on whether the specific problems chosen for the IMO are similar to problems you’ve practiced or are just better at for whatever reason. I hope it’s intuitive why you might not want to evaluate, say, lawyers with a six-question law exam. To say nothing of the once-in-a-lifetime circumstances — jet-lagged in a foreign country, eating and drinking an unfamiliar local cuisine, possibly missing all their usual math stationery because of delayed luggage5 — that nearly all official contestants solve these problems in.
One redeeming quality of using the IMO as an evaluation might be that, because IMO problems are widely scrutinized for originality, the models are very unlikely to have seen the problems in training (i.e., test-set contamination is less likely). I have no reason to think this happened this year, but I don’t think it’s impossible, and because there are, again, only six problems, the small probability that this has happened is a bigger deal. Funny anecdote: when IMO 2018 Problem 3 came out, several internet sleuths recognized the problem from older publications, one of which had among its coauthors… yours truly, while he was in fourth grade.
Many researchers around the world have developed many benchmarks and other methods for evaluating frontier AI systems. Some of these evaluations have their own issues, but at least they’re trying to be evaluations, and together, I think they form a compelling case that frontier AI systems have impressive capabilities, including in solving difficult math problems. I don’t think that these systems’ raw scores and medal ranking at IMO 2025 add much to the mix.
For more commentary from a math contest insider, read Terence Tao.
Appendix 1: Full gold medal stats
Year | Gold % | Cutoff | Score Distribution |
---|
I threw a page with even “fuller” stats onto GitHub. Though, consider the next appendix.
Appendix 2: What’s the right reference class?
To argue that an IMO is easier than average, what IMOs should I average over?
I think the easiest answer to defend is “IMOs after 1981 (inclusive)”, since there was no “official” IMO 1980 due to political sanctions and 1979 → 1981 is the last time the maximum number of points changed. With that reference point, while 2025 is still an outlier, it pales in comparison to 1981.
Since the IMO 1981 scores are apparently lost to time, we might start per-problem comparisons from 1982 out of practicality. 2025’s P3 also looks like less of an outlier:
However, by slicing this data differently we can see that problems have gotten much harder since then:
I picked 2000 because it’s a round number and it matches the start of MOHS ratings, but based on the IMO alone I don’t have a good reason to pick it over 1999 or 2001.
Appendix 3: Full problem distribution
P.S.
If you like building bespoke visualizations to make sense of the messy, surprising behavior of LLMs, you may be interested in joining Anthropic’s interpretability team 🙂