
This market resolves Yes if any AI completes Pokemon Emerald Version on an unmodified cartridge or ROM by the end of 2025 without human assistance during the playthrough.
The run begins with the selection of "New Game" and ends upon entering the Hall of Fame.
Glitches are allowed.
Training on information related to the game, including gameplay footage, is allowed.
The AI is not allowed to access hidden information about the game state (e.g., RNG seed) that would not be available to a human player.
@MingCat Simplier game, specifically trained, and still only gets to Surge. It also has dexterity based challenges with the bikes.
@OP You're right that there's a lot of improvement that'd be required to beat Emerald. Was sonnet specifically trained on pokemon red/blue? that would surprise me. anthropic doesn't usually do that kind of benchmark hacking, I think? And someone could always train a model on specifically Pokemon Emerald as well. I also don't think Emerald is significantly more complex. the puzzle in Lt. Surge's gym, for example, has frustrated plenty of humans. In any case, it's early into the year and this is promising!
@MingCat I interpret the tweet as saying Claude was specifically trained on R/B as a real-world use case. I’m not aware of Pokemon progress as a general AI benchmark.
Lt Surge puzzle is still not realtime, and the AI would presumably know the answer. Whereas the bike inputs in Emerald present a challenge of real time control.
@OP The above tweet is a meme and there was no training on pokemon
Claude has never been explicitly trained to play any video games.
@nottelling2ccc Also, if you're allowed to set the initial state of the system and have guarantees about CPU timing and the like, then wouldn't the game be entirely deterministic? At that point, wouldn't a TAS be viable?
@nottelling2ccc An entirely human-authored script with no machine learning wouldn't count as AI, and setting a known initial state would count as accessing hidden information about the game state.
@NathanShowell "no known initial state" (or at least a random RNG seed at startup) makes sense to me, but the requirement to be "an entirely human-authored script" does not.
Where do we draw the line between "AI" and "not AI"? Would using an OCR program count if it was a convolutional neural network? Would an OCR program count if it was matching the image onscreen to the most similar image in a very tiny (and labeled) dataset? Would a CNN that was trained to behave exactly like the previous model count?
I am confident that this is a 100% solvable problem without using any "machine learning" and any competent programmer could make a bot to solve this game. It's just a matter of time and tedium.
@nottelling2ccc Trying to ban solutions that "aren't machine learning" is A) silly, and B) not effective. You can easily take a "human-authored script" (i.e. non ML solution) and replace enough subsystems with ML counterparts, ship-of-Theseus style.