
Elon Musk is talking big: https://x.com/tsarnick/status/1815493761486708993. Says that Grok 3 will come out in December and 'should be' the most powerful AI in the world.
Resolves to YES if Grok 3 is, at the time of its release, plausibly the most powerful AI in the world according to my best judgment. Has to be at least as strong as all models publicly available at the time.
Resolves to NO if it is not the most powerful.
(Resolves NO if no such model is released by 7/23/25, to ensure this doesn't go on forever.)
As of 7/23/2024 Claude Sonnet 3.5 is IMO most powerful AI, but GPT-4o would also resolve to YES based on its position at #1 on Arena and other ways in which some people prefer it. Gemini 1.5 Pro or Advanced would not qualify, but would have counted prior to Sonnet 3.5 and GPT-4o.
(I will not take clarifying questions on my criteria here, it will be my subjective take on 'is this plausibly the best LLM I can access right now.')
Update 2025-05-01 (PST): - Reasoning models are a different class of AI and do not count for the purposes of resolving this market. (AI summary of creator comment)
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ20,306 | |
2 | Ṁ14,321 | |
3 | Ṁ11,514 | |
4 | Ṁ7,790 | |
5 | Ṁ7,037 |
I do think both interpretations are reasonable, and I could argue both sides. I understand both cases, although I would still be inclined to make the same decision again.
But I have learned that once you make a decision like this, you HAVE TO stick with it, reversing yourself makes things go crazy, even if you decide you made the wrong initial decision, and the only thing you can do after that is turn it over to the mods or stick with what you said.
Given it is 5-0 thumbs up on an accusation that my actions are disingenuous (and I've been outright accused of LYING among other things, seriously WTAF) here despite the market being where it was 2 days before the ruling, honestly, which I REALLY REALLY don't appreciate, I don't need this trouble. I hereby ask the mods to take over this question so I can wash my hands of it, and they can do whatever they decide is best.
Hope everyone's happy now. Enjoy.
@CarlosPerezIntuitionMachi I see someone else chimed in that Grok 3 got it right when they tried. In my very limited testing (see the second bullet point at the end of my latest AGI Friday post), Grok 3 actually beat Claude Sonnet 3.7 at geometric reasoning.
PS: Again, I don't actually think Grok 3 is the most powerful AI, but it briefly got "plausibly" to the frontier, which was the bar this market specified.
@Guilhermesampaiodeoliveir The timing is kind of galling, yeah. It seems like xAI rushed this out the door just so they could plausibly claim they'd hit the frontier at time of release. Then 3 days later along comes Claude Sonnet 3.7. And sounds like GPT-4.5 is like a week away.
But the market description did stipulate "at time of release" specifically.
We got a mod consensus on YES! 🎉
We also totally do think Elon Musk was bullshitting but that the spirit of the question was whether he could deliver a plausibly state-of-the-art model and, for like 3 days, he got there. So ultimately, after a ton of discussion and modeling of Zvi, we concluded there wasn't much ambiguity. Especially with the Chatbot Arena criterion specifically called out.
I appreciate the time put into this, but given the model scored under other models at this time, was it really 'the most powerful AI in the world'? That carries some large connotations
@Dauur I was very sympathetic to this argument and was one of the last to come around. My thinking now is that the "most powerful" is in scare quotes in the title and the market description clarified that it only had to get plausibly to the frontier. And then "plausibly" was further clarified and we're kind of forced into a YES even though, I agree, Elon Musk kind of deserves for this to have resolved NO. (Or at least partially NO? I was actually arguing for resolve-to-PROB for a while.)
@Dauur I doubt it's worth anyone's time to dig too deep into this (since the result won't change & mods shouldn't have been resolving this in the first place) but as a mod who opined in favor of YES, I'll just say that my decision was based on a fairly direct reading of the description. These are the examples Zvi used to define "plausibly most powerful":
As of 7/23/2024 Claude Sonnet 3.5 is IMO most powerful AI, but GPT-4o would also resolve to YES based on its position at #1 on Arena and other ways in which some people prefer it. Gemini 1.5 Pro or Advanced would not qualify, but would have counted prior to Sonnet 3.5 and GPT-4o.
Based on these examples, it is not necessary for a model to dominate all benchmarks (nor be Zvi's own personal favorite for daily usage) to resolve YES.
@Curvilinear It sounds bad when you put it that way, but the market picked an operationalization and Musk managed to dump a bunch of money into training and rushed something out in the nick of time and managed to plausibly hit the frontier for few days. Annoying but what can you do? I feel your pain. I keep losing money underestimating Musk. Eg:
https://manifold.markets/EsbenKran/will-elon-musk-announce-the-creatio
And I don't seem to be learning my lesson either, eg:
https://manifold.markets/JamesGrugett/will-tesla-serve-more-fully-autonom
I was one of the mods who was part of the decision, and I don't think YES is a technicality. Although Grok 3 might've been optimized a bit for Arena, it's clear that it was more powerful than the market expected. Zvi stated in a blog post that it beat his expectations. If someone bet YES, I think it's good they're being rewarded.
@Conflux More powerful than the market expected? The market was whether it would be the most powerful model (at the time) aside from the 64* (which does seem to be "cheating" or fudging the scores (while still below other models at the time)).
I personally don't care much about the posturing of Elon etc, but whether the technical requirements of the models were met.
@Dauur The market description mentions whether the AI has a plausible claim to being the best. As an example, it lists a market which is #1 on Arena. Grok was #1 on Arena, and thus had a plausible claim to being the best. Does that make sense? That was our logic
@Conflux @Bayesian https://lmarena.ai/?leaderboard
I see it here, okay fair enough, if it hadn't stayed on #1 after a day or two I would have significant issue. But alas, thanks for discussing with me.
Grok 3 is near frontier but not frontier according to AI explained
@SimoneRomeo it doesn’t need to top on every benchmark for this market to resolve YES, if that were the case there couldn’t be two different models that would simultaneously resolve YES in the description. It just has to be plausibly on par as shown by things like getting #1 in the arena.
The biggest wildcard for this market is how to handle the “reasoning model” exclusion.
@LiamZ oh come on, It must be the best model at most benchmarks at least. If any random benchmark satisfies as proof of being the most powerful AI, then any LLM could be considered the most powerful regardless of whether they are
“Has to be at least as strong”
“GPT-4o would also resolve to YES based on its position at #1 on Arena and other ways in which some people prefer it.”
“Reasoning models are a different class of AI and do not count”
One could make a market with stricter targets and resolution criteria plus include every competitor, but that would be a different market.
@LiamZ if reasoning models don't count then Grok3 also doesn't count. As for the rest of the description, you have a good point but it's a very good self contradictory statement. The market creator gave up the resolution to moderators exactly because the resolution criteria didn't make sense, he figured it out and didn't know how to handle it. I believe moderators should try to resolve according the original spirit of "being the most powerful AI in the world" and some random benchmark alone is not enough for a positive resolution.
@SimoneRomeo I think the spirit of the market is “is Elon Musk clearly bullshitting” more than anything else as evidenced by the framing of the first paragraph and I think if the resolution criteria are completely incoherent/Grok3 itself is excluded then the market needs to NA because people were trading on the description we have, not the one we wish we had.
The #1 arena score isn’t a “random benchmark” it’s a more wholistic measure and it is the only actual quantitative metric mentioned in the description at all.