Gold-medaling the 2025 IMO was easier than average

(Background/disclaimer: I work at Anthropic, far from any efforts related to math reasoning or evaluations; opinions are my own. I was an IMO gold medalist in 2012.)

I way missed the news window on this one, but I thought it was interesting and a bit underappreciated that the fraction of gold medalists at the 2025 IMO (72/630 = 11.4%) is the highest it’s been since 1981.

Crudely, IMO gold medals are awarded to the highest-scoring 1/12 of contestants.¹ However, because scores are integers up to 42 and there’s no provision for tiebreaking, it’s possible for a lot of contestants to be tied around the threshold. In that case, either all of them get a gold medal or none do, and the fraction of gold medalists might deviate substantially from 1/12. That’s what happened this year: 46 contestants all won a gold medal by scoring exactly 35 points.

Here’s a comparison with the fraction of IMO gold medalists for each year since 2000. (See appendix 1 for a full table and appendix 2 for discussion of the choice of 2000.)

In fact, bizarrely, 35 is the mode of the scores this year; the last time the modal score was a gold medal score was in 1994. And, of course, 35 is the same score claimed by AI systems from Google, OpenAI, and others.

We can also consider the IMO 2025 problems individually. In the Epoch AI newsletter, Greg Burnham combines a subjective analysis with Evan Chen’s MOHS ratings to argue that the first five problems at IMO 2025 were unusually easy and the sixth was unusually hard, so it’s not surprising that the first five problems were exactly the ones solved by these AIs. Though I’m not sure the MOHS scale is rigorous enough to make sense as the x-axis of a bar chart,² it’s easy to corroborate the high-level story with the official IMO statistics. Based on average scores, this year’s Problem 6 was the fourth hardest and its Problem 3 was by far the easiest of all Problem 3s and 6s since 2000.³

Despite all this, 88th percentile among some of the best high school students in the world is still impressive. I was particularly impressed that the AI systems produced natural language proofs from natural language statements, rather than formal proofs from formal statements (with a proof assistant like Lean, as Google DeepMind’s AlphaProof did last year).⁴

I would rather bang on the drum that the IMO is a lousy benchmark even for humans. It’s not primarily intended to be one! There are only six problems on it! Scores will have a lot more variance based on whether the specific problems chosen for the IMO are similar to problems you’ve practiced or are just better at for whatever reason. I hope it’s intuitive why you might not want to evaluate, say, lawyers with a six-question law exam. To say nothing of the once-in-a-lifetime circumstances — jet-lagged in a foreign country, eating and drinking an unfamiliar local cuisine, possibly missing all their usual math stationery because of delayed luggage⁵ — that nearly all official contestants solve these problems in.

One redeeming quality of using the IMO as an evaluation might be that, because IMO problems are widely scrutinized for originality, the models are very unlikely to have seen the problems in training (i.e., test-set contamination is less likely). I have no reason to think this happened this year, but I don’t think it’s impossible, and because there are, again, only six problems, the small probability that this has happened is a bigger deal. Funny anecdote: when IMO 2018 Problem 3 came out, several internet sleuths recognized the problem from older publications, one of which had among its coauthors… yours truly, while he was in fourth grade.

Many researchers around the world have developed many benchmarks and other methods for evaluating frontier AI systems. Some of these evaluations have their own issues, but at least they’re trying to be evaluations, and together, I think they form a compelling case that frontier AI systems have impressive capabilities, including in solving difficult math problems. I don’t think that these systems’ raw scores and medal ranking at IMO 2025 add much to the mix.

For more commentary from a math contest insider, read Terence Tao.

Appendix 1: Full gold medal stats

Year	Gold %	Cutoff	Score Distribution

I threw a page with even “fuller” stats onto GitHub. Though, consider the next appendix.

Appendix 2: What’s the right reference class?

To argue that an IMO is easier than average, what IMOs should I average over?

I think the easiest answer to defend is “IMOs after 1981 (inclusive)”, since there was no “official” IMO 1980 due to political sanctions and 1979 → 1981 is the last time the maximum number of points changed. With that reference point, while 2025 is still an outlier, it pales in comparison to 1981.

Since the IMO 1981 scores are apparently lost to time, we might start per-problem comparisons from 1982 out of practicality. 2025’s P3 also looks like less of an outlier:

However, by slicing this data differently we can see that problems have gotten much harder since then:

I picked 2000 because it’s a round number and it matches the start of MOHS ratings, but based on the IMO alone I don’t have a good reason to pick it over 1999 or 2001.

Appendix 3: Full problem distribution

P.S.

If you like building bespoke visualizations to make sense of the messy, surprising behavior of LLMs, you may be interested in joining Anthropic’s interpretability team 🙂

The more formal rules are that:
1. roughly half of contestants should get a medal; and
2. the ratio of gold : silver : bronze medals should be roughly 1 : 2 : 3.
There is a slight bias for fewer medals in that a supermajority vote is required to award medals to more than 1/2 of contestants, but that vote does pass from time to time. All this is laid out in 5.6 of the IMO regulations (PDF).

I was not able to verify when these proportions (half, then 1 : 2 : 3) were established or changed in the history of the IMO. The earliest reference I found was the book International Mathematical Olympiads 1978–1985 and Forty Supplementary Problems, by Murray S. Klamkin, which reprints the rules of the 1984 IMO in its Appendix A:

(IV.3.) The total number of awarded prizes will not exceed half of the number of all contestants. The number of 1st, 2nd, and 3rd prizes awarded will if possible be in the ratio 1 : 2 : 3.

(The option of, and procedure for, giving more than half the contestants medals has changed since then.) This was via Joseph Myers’s Analysis of IMO medal boundary choices (PDF), code on GitHub, which is also fascinating.↩︎
To quote § 3.6:

These ratings are a whole lot of nonsense. Don’t take them seriously.

↩︎
For those unfamiliar, the IMO is taken over two days, with contestants solving Problems 1–3 on the first day and Problems 4–6 on the second. The intent is generally for the problems to increase in difficulty significantly within each day, and perhaps slightly from Day 1 to Day 2; that is, 1 ≲ 4 ≪ 2 ≲ 5 ≪ 3 ≲ 6. So Problems 3 and 6 are usually the hardest problems. That said, problem difficulty is notoriously difficult to estimate.

The statistics may also mislead us about problem difficulty for a variety of reasons, a major one of which is that contestants solve them in sets of three with a time limit for the full set. For example, it’s likely that IMO 2017 Problem 3’s average score (the lowest of any problem since 2000) is artificially low because it came after an unusually time-consuming Problem 2, so contestants had less time to try it. See § 3.5 of the MOHS document for an elaboration and several other considerations.↩︎
I think proof assistants correct for a lot of modern LLMs’ deficiencies (e.g., hallucinations). On the other hand, one might argue that an LLM equipped with a proof assistant is the right reference class if you’re trying to assess whether an LLM can contribute to novel math research or something. Anyway, I don’t work in these specific aspects of machine learning; I’m at best a formal verification enthusiast who took two classes in college.↩︎
Ask me how I know.

…Actually, you can read about it on this very blog. Though please keep in mind that I wrote the linked post thirteen years ago.↩︎