Will an LLM become a Pokèmon Master by the end of 2025?
19
1kṀ1985
2026
47%
chance

I'll give bounties to people who suggest reasonable improvements to the criteria.

https://www.twitch.tv/claudeplayspokemon

Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking

Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:

  • Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.

  • Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.

  • With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.

Any number of "shots" are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.

RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.

See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync

  • Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:

    • Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.

    • Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.

Get
Ṁ1,000
to start trading!
Sort by:

safari zone is a massive problem, the step count is limited and it costs money for each attempt. you need a system that is cracked at navigation to get past that

Does this count as "assistance or steering mid-game."?

@Lorenzo ummm… im going to say no, but I won’t lie it’s in part because it seems a shame to disqualify Claude for a small tweak this early. If they continually tweak it throughout the run, that feels unfair.

I will make some more explicit criteria around this soon, I guess.

@Lorenzo I updated the criteria in a way that allows for this kind of thing midgame, as long as it's not directly hinting towards specific blockers.

Things claude has hallucinated in the 5 minutes I've watched this stream:
- thinks bulbasaur has a type disadvantage against squirtle
- thinks the exit to oaks lab is at the top of the screen
- successfully exited oaks lab, and then went back into it, thinking it was route 1

- went back to the top of the screen after re-entering oaks lab

I don't think 3.7 sonnet is going to be able to do this in any number of tries, I assume it ended up stuck in some sort of infinite loop in the midgame that it couldn't break out of.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules