
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ306 | |
2 | Ṁ162 | |
3 | Ṁ140 | |
4 | Ṁ138 | |
5 | Ṁ135 |
@Bayesian What is this resolution based on? o3 isn't out, there are no official HLE results for it anywhere I am aware of. Am I missing something?
Is tool use allowed?
@Frankas hmmmm good question, i'm inclined to resolve NO because deep research with tool use should do strictly better than o3 without tool use, and deep research got 26%?
@Bayesian Closing is fine (though I don't see the need, what's the hurry?), but you're also resolving it with a definite answer for which you don't provide the grounds. If you intend to resolve, it should at least be N/A then, citing the lack of hard facts.
The Last Exam appears to be primarily a knowledge benchmark, rather than a problem-solving benchmark. All frontier models score very highly on other knowledge benchmarks, but score poorly on The Last Exam. o3 is unlikely to be significantly more knowledgeable than other frontier models.
@Haiku I don’t fully agree. The benchmark was created by mostly filtering through questions that none of frontier models (at that time) can answer.
In math, a lot of these questions are problem solving. I assume o3 is very good at problem solving.