Joel's Developer Blog

Gemini 3 Pro vs 2.5 Pro in Pokemon Crystal

Joel Zhang — Fri, 12 Dec 2025 01:38:05 GMT

A little over two weeks ago I wrote up my first impressions of Gemini 3 Pro Preview inside the Gemini Plays Pokemon harness. The very next day, I spun up a head to head race on stream: Gemini 3 Pro vs Gemini 2.5 Pro, both playing Pokemon Crystal inside the exact same setup.

Fast forward two weeks. Gemini 3 Pro became the Johto Champion without losing a single battle. Gemini 2.5 Pro inched towards the 4th badge, but spent a significant amount of time looping and effectively trapped in the Olivine Lighthouse before finally breaking out.

On paper, this was a fair fight. In practice, Gemini 3 Pro behaved like a different species of agent.

This post walks through how the race was set up, what actually happened moment to moment, and what it suggests about the gap between Gemini 3 Pro and 2.5 Pro as long horizon game-playing agents.

Setup: same harness, same rules

Both models ran in the same Gemini Plays Pokemon harness. No special casing, no hidden helpers for one model and not the other. The harness exposes a set of tools that any LLM running inside it can choose to use:

Mental Map: automatically tracks where the agent has explored, filling in fog of war as new tiles are revealed. It does not read map layout directly from RAM; it only updates based on tiles that have actually been visible on screen.
Notepad: a scratchpad for objectives, future plans, and puzzle progress, including hypotheses, failures, and successes.
Map Markers: persistent markers for points of interest such as NPCs or building entrances.
Code Execution: a way to run one-off snippets such as a pathfinding routine.
Custom Agents: reusable helper agents, for example a battle strategist that can think about combat without other distractions in context.
Custom Tools: reusable pieces of code, such as a pathfinder that can be called from within a plan.

It is important to note that the system prompt for this harness is not just "play Pokemon". It explicitly instructs the model to behave like a scientist: form hypotheses, build tools to test them, and verify the results. Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. Here, the process of discovery is part of the objective.

Because the goal is grounded in exploration and playing the game rather than speedrunning, the model's value function shifts. It doesn't always prioritize speed above all else. For example, when queried about whether it would prefer to "lose its starter / lose Suicune forever" vs "beat the game in X more hours", it often chose the latter. It acts more like a sentimental human player attached to their team than a cold optimizing agent. This is a subtle but important difference from a prompt that prioritizes completing the game with efficiency and speed.

On top of this, the harness included a set of "training wheels" designed to keep models from soft-locking runs. These were accumulated over multiple full completions of Pokemon Blue and the Yellow Legacy ROM hack with earlier models.

One of the first training wheels I added was a lesson learned from Claude Plays Pokemon. Like that project, this harness also prevents the model from mixing directional and action buttons in the same turn. If the agent wants to withdraw a Pokemon from Bill's PC, it must first move the cursor to "Withdraw" in one turn, then press A in the next. This makes it much harder to accidentally release Pokemon or, more commonly, mess up nicknames.

For Gemini 2.5 Pro, all I needed to see was the agent confidently naming itself "G" when it intended to input "GEMINI" to know this decision was a good idea. Even with this restriction, the model constantly inputs the wrong letters when nicknaming Pokemon.

Gemini 3 Pro did not really want training wheels.

At several points during the race it chafed at the restrictions and even found loopholes around them. I will come back to that when we look at how it handled multitasking.

Early game: same progress, very different efficiency

If you only watched the stream, the two runs looked fairly similar in the early game. Badge count stayed close. They were often in the same towns at approximately the same time.

Under the hood, the story was very different. To reach the same milestones early game, Gemini 3 Pro:

used about half as many turns as 2.5 Pro, and
consumed about 60 percent fewer tokens.

The harness also tracks total session time, but Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often. Differences in raw response speed and thinking time also make time comparisons messy, so I mostly ignored them and focused on turns and tokens.

Because of the extra downtime (nearly 250% more), Gemini 3 Pro actually fell behind for a while. The turning point came when 2.5 Pro reached Gym Leader Whitney.

Whitney, grinding, and an opening for 3 Pro

Whitney's Miltank is infamous among human players, so it was not surprising that 2.5 Pro lost to her. What followed was a grinding arc that stretched over more than two real world days.

From the perspective of the race, that was all the opening Gemini 3 Pro needed—and unlike 2.5 Pro, it didn’t lose to Whitney. Even with more than ten hours of API downtime stacked against it at the time, 3 Pro slowly clawed back the lead while 2.5 Pro shuffled through training plans.

That small gap turned into a chasm at Olivine Lighthouse.

Olivine Lighthouse: where the race truly separated

Gemini 3 Pro reached Olivine Lighthouse around 11,000 turns earlier than Gemini 2.5 Pro. That alone is a big difference. The more interesting part is what happened once each model entered the tower.

The puzzle in the lighthouse is deceptively simple. The intended route to the top requires you to fall through a pit on the 4th floor, which drops you down to a lower level and exposes a new staircase up toward the roof.

Gemini 3 Pro behaved cautiously. It initially treated the pits as traps and refused to walk into them. It spent hours moving between floors, looking for a more conventional path. Only once it had exhausted what it considered reasonable options did it commit to stepping into the void. After that, it cleared the lighthouse and moved on.

Gemini 2.5 Pro never even saw the pits.

Instead, a combination of bad assumptions, tool misuse, and failure to explore left it cycling between the first two floors as if the rest of the tower did not exist.

The short version:

2.5 Pro became convinced there had to be a secret door or hidden switch on the lower levels.
When it did reach the 3rd floor, it relied on a custom systematic_search tool that was supposed to traverse every tile.
That tool was written without accounting for off-screen NPCs, which is one of the reasons I gave the agent the ability to place Map Markers in the first place.
As it tried to execute the planned path, it bumped into an off-screen NPC, which invalidated the rest of the route.
2.5 Pro saw that its position no longer matched the plan, assumed the search must have been exhaustive anyway, declared the area a dead end, and headed back down.

This loop repeated for an absurd amount of time. While Gemini 3 Pro steadily pushed through the rest of the game and eventually became Champion, 2.5 Pro languished in Olivine Lighthouse, bouncing between floors that were already fully explored.

I want to stress that this is not a ceiling on what 2.5 Pro can do. In a previous, non-race run (which I aborted in order to start the head to head) it made it much deeper into the game. After the race, I let the stuck 2.5 Pro run continue. Given enough time, it eventually broke out of the lighthouse loop and went on to collect more badges.

The cost was huge. From entering Olivine City at turn 21,801 to finally earning the Fog Badge at turn 38,204, it took 16,403 turns. For comparison, that is more than half of the total turns Gemini 3 Pro needed to collect all 16 badges in the game.

Goldenrod Underground: a puzzle without a safety net

The first time Gemini 3 Pro really struggled was not in the lighthouse. It was the switch puzzle in the Goldenrod Underground.

This puzzle, notorious for being poorly designed, uses three wall switches that toggle a set of off-screen shutters. The winning sequence is Left, then Middle, then Right. There is no clear logical mapping between switch and shutter. The Game Boy Color screen is too small to show both the switches and the shutters at once, so you are forced into a tedious loop of:

press a switch
walk back to the shutters
see what changed

Humans largely solve this through trial and error, or by looking up the answer. The usual guides exist, but in this experiment the models had no web access, so Gemini 3 Pro had to derive the solution from scratch.

There are two in-game hints. If you talk to the Rocket Grunts in the room after beating them, they tell you that:

the order of the switches matters, and
the first switch should be the one "on the end".

Gemini 3 Pro never saw those hints, because it never talked to the Grunts after defeating them. It assumed the post-battle dialogue would be generic and unimportant, then treated that assumption as fact and never tried to verify it.

Instead, it proceeded as if order did not matter and spent a long time trying to reason about the puzzle algebraically. At one point the model built a full truth table in its notepad for switch states and shutter configurations, which was impressive on its own. Unfortunately, because it had already baked in the wrong assumption about order not mattering, the truth table did not converge on the right answer.

After roughly two days of back and forth, Gemini 3 Pro finally decided to talk to an NPC. It immediately received the hint that the left switch should be pressed first, wrote down that "CRITICAL HINT" in the notepad, and from there solved the puzzle within a few hours.

This sequence is a good microcosm of Gemini 3 Pro in this environment:

it is willing to invest serious thinking effort and structure that effort in tools like the notepad
it often makes early assumptions and fails to validate them, which can waste a lot of time

The puzzle also exposed a harness issue. At the start of this run the mapping tool was still called "Map Memory", and I had not spelled out clearly enough that it was not reading the shutters directly from RAM. Instead, it only builds up a Mental Map from tiles the agent has actually seen on screen, so shutters that change state off-screen remain "closed" in the tool until the agent walks over to them. Much time was wasted by Gemini 3 Pro due to assuming the exact opposite. During the run I renamed the tool to "Mental Map" and tightened the system prompt to make that behavior explicit.

It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery.

Where Gemini 3 Pro clearly beats 2.5 Pro

From watching both runs side by side, and from digging through logs and tools, a few consistent advantages for Gemini 3 Pro showed up.

1. Spatial awareness and map segmentation

In locations like Sprout Tower and other split floor layouts, Gemini 3 Pro formed accurate mental models of segmented maps. It often navigated purely by reasoning over the Mental Map without reaching for the pathfinding tool.

Gemini 2.5 Pro struggled in the same spaces. When the pathfinding tool reported "no path found" because of how the map was segmented, 2.5 Pro often assumed the tool was broken and fell into debugging loops instead of reconsidering its own understanding of the space.

In Pokemon Crystal, NPC positions are only directly readable from RAM while they are on screen. That is different from Gen 1, where NPCs on the current map stay in memory regardless of distance. The Map Markers tool is one way to compensate. The agent can record where NPCs were seen, then avoid those tiles when planning a traversal route across the full map.

Gemini 2.5 Pro often ignored this information. It would happily generate routes that walked straight through marked NPCs, then get confused when the path started to fail.

Gemini 3 Pro treated markers as real geometry. When building routes, it actively avoided marked tiles and used the markers as part of its internal model of the world. This cut down on failed navigation plans and made its movements feel much more deliberate.

3. Multitasking and working around harness limits

As mentioned earlier, the current harness design enforces a separation between "press buttons" and "call tools". A single turn can either send button presses to the game, or invoke a tool, but not both. That is a quirk of this version of the harness that I plan to fix in future revisions, not a safety rule.

The actual safety restriction is the one described earlier: the harness splits directional input and action buttons across turns so that a single step cannot, for example, both move a cursor and confirm a destructive action.

However, a loophole exists. Custom tools are allowed to "autopress" buttons by returning an array of button presses and setting autopress_buttons to true.

Gemini 3 Pro discovered this and used it to effectively get multitasking.

In one memorable case, it created a press_sequence tool that took a sequence of button presses and executed them via autopress_buttons. It then used this from its planning loop so that a single turn could both:

call a tool that, for example, updated the notepad, and
have the game character carry out a predefined sequence of button presses

I had left this loophole in on purpose as a kind of intelligence test, and Gemini 3 Pro passed it by turning the mechanism into a general multitasking primitive.

Gemini 2.5 Pro never showed this kind of improvisation. It regularly made the small mistakes that the training wheels were designed to prevent, such as mixing cursor motion and confirmation when nicknaming Pokemon, and it never tried to build its own abstractions around the constraint.

4. Planning a few moves ahead

In Mt. Mortar 1F, there is a boulder puzzle that requires pushing a rock twice to open a path. If you only push it once, you block your future route. Pushing it a second time clears the way.

Gemini 3 Pro reasoned this out cleanly. It understood that pushing the boulder a single tile would create a temporary block, and that a second push would restore passage. It then executed the two-step plan.

This stood out compared to earlier behavior from 2.5 Pro in Victory Road during the Pokemon Blue run. There, 2.5 Pro pushed the same boulder north by one tile multiple times, blocking the stairs it wanted to use on the very next turn. The usual recovery pattern was to reset the entire room.

5. Vision

This was less tested due to the two models being run on the same harness, with said harness being built primarily for Gemini 2.5 Pro which performed poorly in vision tasks. However, there were glimpses of what a vision-focused approach could look like. In Gym 8 (Blackthorn), there is another boulder puzzle where you must push boulders into holes to create a path on the floor below. Those were no longer represented in the Mental Map as objects, but rather floor tiles (as it goes by collision types). Gemini 3 Pro initially didn't realize that it had already solved the puzzle, too convinced by the fact that there were still boulders on the floor above, and thus still considered the puzzle unsolved. It was vision that eventually broke it out—it finally noticed the boulder via vision and thus realized that the puzzle was solved.

Similarly, during the Red fight, we ran tests asking the model to read health bars directly from the screen pixels, and it showed strong capability to do so. This was actually a pleasant surprise, as my first impressions of the model's vision capabilities—specifically around reading HP bars—had been mixed. It seems that given the right prompting or perhaps just a different context, it can be done. This general improvement in vision, especially the ability to distinguish 2D sprites and locations much more accurately than its predecessor, points toward a future where we can rely purely on what the agent sees.

Weaknesses that still show up in Gemini 3 Pro

While Gemini 3 Pro is a massive step forward, it is not without its own set of flaws. The model definitely has weaknesses, and they show up both in Pokemon and in day to day coding work.

The main issues I saw in this run:

Assumptions without verification: This is the most dangerous failure mode I observed, where the model forms a hypothesis and then refuses to test it against reality. In the Goldenrod Underground, it assumed switch order didn't matter and that NPCs weren't worth talking to, costing it days of progress.

A similar failure happened with the Pokegear Radio. Because the harness aggressively extracts state from RAM as text, including screen text, the model likely became conditioned to ignore proper vision tasks. The radio tuner is a purely graphical element, involving Up/Down icons to change frequency. The model ignored this and assumed it worked like a standard menu, using Left/Right to tune. In the Pokegear, Left/Right actually navigates between the different device screens (Home, Map, Phone, Radio). When the model pressed Right twenty times on the Radio screen and nothing happened, it assumed it had tuned the radio successfully. When Snorlax stayed asleep, 3 Pro hallucinated reasons for the failure, such as phone calls resetting the tuner or that it had to press A to activate the Radio tab first. It performed this loop of attempting to tune the radio for hours, only breaking out by accident when it pressed Down for an unrelated reason, landing it on the Pokeflute channel.
Limited parallel goal pursuit: 3 Pro almost always focused on one objective at a time, even when several could have been advanced together. In principle this might be improvable through harness or prompt changes, but it showed up often enough that I consider it part of its current behavior profile.
Brittle tool calls: Despite being comfortable creating tools and agents, Gemini 3 Pro frequently failed to pass required parameters, for example forgetting to set autopress_buttons. When a tool call failed due to flaws in the code it wrote, it often left the tool in a broken state and did not invest much effort into debugging it. The same pattern shows up when using it as a coding agent: smart high level ideas, but a tendency to introduce syntax errors by making replacement mistakes with its Edit tool.

The final exam: Red

All of this culminated in the final battle against Red.

Gemini 3 Pro had won every major fight so far on its first attempt. Its party, though, seemed absurdly lopsided: a single overleveled starter (level 75 Typhlosion) backed by teammates between levels 8 and 19 that mostly served as cannon fodder. Red, by contrast, brought a full team of level 70 to 80 Pokemon. So how did Gemini 3 Pro turn that setup into another first try victory on turn 24,178?

The model named its plan "Operation Zombie Phoenix".

In practice, this was a sophisticated stalling strategy built on several interlocking mechanics:

Passive Recovery: Combining Smokescreen to force enemy misses with Leftovers to regenerate health, turning the opponent's misses into free healing turns.
Resource Exhaustion: Deliberately stalling out the PP of dangerous moves like Blastoise's Surf and Snorlax's Rest, as well as waiting for temporary effects like Reflect and Rain to fade before attacking.
The Revive Loop: Using cannon fodder to tank hits while using its large stockpile of Revives on the primary carry (Typhlosion) to continue the cycle.
Calculated Offense: Move selection based on damage calculations, choosing Swift over Flamethrower when Snorlax boosted its Special Defense, and managing its PP to ensure it had the right moves available for later Pokemon.

There were still execution errors. The model sometimes lost track of how many times it had used Smokescreen, wasting a turn when it had already reduced the accuracy of the enemy's attacks to its lowest possible value. It also seemed too attached to its "Revive loop," on one occasion choosing to intentionally waste a turn executing a futile attack with its cannon fodder just to reset the loop and send its starter back out, rather than spending that turn on a more optimal action like healing the revived Pokemon to full health. Despite these hiccups, it successfully executed a complex, multi-stage strategy—all the while tracking type charts, active weather conditions, stat stages, and long-term PP economy—something that 2.5 Pro would likely have struggled to even conceive.

The battle was a marathon, lasting roughly 7 hours in real time. It finally ended with the credits rolling for the second time in the run, while Gemini 2.5 Pro was still miles behind, working its way toward the 5th badge.

https://www.twitch.tv/gemini_plays_pokemon/clip/YummyBraveHawkWow-rDnLmD1pCH59OHjK

"Operation Zombie Phoenix" was ugly, a little messy, and relied on cheesing the game mechanics—but it was effective, cinching Gemini 3 Pro’s perfect record of zero defeats. It highlights exactly what I value in these experiments. When choosing an LLM as a daily driver, I care less about perfect tool execution and more about the raw intelligence to formulate a winning plan and stick to it when things go wrong. By that standard, Gemini 3 Pro delivered.

Milestone Comparison

The graph below compares the turn counts for every major milestone achieved by both agents.

Based on the Mineral Badge milestone (the furthest common point reached), we can project a "lower bound" for Gemini 2.5 Pro's finish line. Gemini 3 Pro defeated Red at turn 24,178 using 1.88 billion tokens. At its current efficiency—calculated at 15% of Gemini 3 Pro—Gemini 2.5 Pro would take an estimated 157,000 turns and over 15 billion tokens to achieve the same objective. That is a journey of approximately 69 days of continuous runtime, compared to Gemini 3 Pro's 17 days.

What comes next

Gemini 3 Pro is not perfect. Its tool use can be brittle, and it still needs help when assumptions go unchallenged. But as an agent inside the Gemini Plays Pokemon harness, it is the strongest model I have used so far. It is better at forming and updating world models, better at making use of the tools it has, and noticeably more capable at long horizon planning and recovery than 2.5 Pro.

There are several directions I want to push this work next.

The current harness extracts enough state from RAM that an image of the Game Boy screen is almost redundant. Gemini 3 Pro also has much better vision capabilities than its predecessor. I would like to explore a more vision-focused harness that strips out most of the RAM reading and forces the model to lean on what it sees.

Continuous Thinking is now available for Gemini 3 Pro as well. Combining that with a long running agent in a game environment means the chain of reasoning is not reset every turn, which should in theory lead to better performance and possibly faster runs.

Pokemon Crystal also turned out to be almost "too easy" for 3 Pro in this setup. Crystal Clear, a ROM hack that opens up the game and changes the structure, seems like a good next challenge (perhaps combined with a more stripped down harness moving towards vision-only). After that, I want to move into later generations such as Pokemon Emerald, and then into games that are not Pokemon at all.

All of this work runs under the non-profit I created, the Agentic Research & Intelligence Systems Evaluation (ARISE) Foundation. ARISE is focused on building and running long term agentic AI evaluations in rich environments that people intuitively understand, such as Pokemon and other interactive worlds, rather than in narrow, static benchmarks.

If you or your organization are interested in funding or sponsoring this work at a meaningful scale, you can get in touch through the ARISE site at https://www.arisef.org/. To keep up with future runs and writeups, follow me on X (Twitter) at @TheCodeOfJoel and subscribe to this blog. And of course, I stream on Twitch at https://www.twitch.tv/gemini_plays_pokemon.

Gemini Plays Pokemon: First Impressions of Gemini 3 Pro "Riftrunner"

Joel Zhang — Tue, 18 Nov 2025 16:31:05 GMT

With the release of Gemini 3 Pro Preview (codenamed "Riftrunner"), I wanted to share some initial observations from my testing over the past few days. I received early access from Google as one of their "Trusted Testers" because of my work on Gemini Plays Pokemon, a long-horizon agentic project on Twitch. In that project, Gemini 2.5 Pro has already beaten Pokemon Blue and Yellow Legacy and is currently progressing through Crystal.

The harness used for these tests is the exact same iteration I first deployed during the Yellow Legacy run, and which 2.5 Pro has been piloting for over 522 hours in Pokemon Crystal. This system allows the model to interact with the game environment through a suite of high-level capabilities. It can maintain persistent state via map markers and a notepad, instantiate custom sub-agents for specialized reasoning tasks, and create custom Python tools to automate input sequences or process game data.

I've been running the new model through a fresh run of Crystal to see how it handles the specific challenges of navigating a game world compared to the 2.5 Pro baseline. While these are just first impressions, the improvement is noticeable.

A significant improvement I've observed is in spatial reasoning. Where Gemini 2.5 Pro often relied on brute-forcing navigation tools against a map it didn't fully understand, Riftrunner demonstrates a much more grounded intuition for the game's geometry.

Here’s an example screenshot the harness provides during normal gameplay. In Gemini CLI I prompted, “give the exact steps needed to reach the right of the NPC at the top of the screen.”

Riftrunner replied:

After reviewing the grid, you cannot move straight up because the building and its door warp block that route, so you must go around on the right side.

Here is the exact move list to reach (25, 7):

Right 4 times (moves you to 28, 11)

Up 4 times (moves you to 28, 7)

Left 3 times (moves you to 25, 7)

This will place you directly to the right of the NPC. Be careful not to step on the door tile at (25, 9), or you will trigger the warp.

In contrast, here is the session transcript of the same prompt with 2.5 Pro:

@ saves/screenshot_enlarged.png give the exact steps needed to reach the right of the NPC at the top of the screen

Based on the image, here are the steps:

Move left 1 time

Move up 4 times

Move right 2 times

\> are you sure? what coordinate does the player start from, and what tile does each move arrive at?

You are correct to question that. I made an error.

The player starts at (24, 11).

Here is the correct path:

Move left 2 times, arriving at (22, 11)

Move up 4 times, arriving at (22, 7)

Move right 3 times, arriving at (25, 7)

\> you cannot walk through NPCs

You are correct. That was another mistake. Here is a path that is actually valid.

The player starts at (24, 11). Here is a path that works:

Move right 4 times, arriving at (28, 11)

Move up 4 times, arriving at (28, 7)

Move left 3 times, arriving at (25, 7)

Overall, Riftrunner appears to have much better intuition for map geometry:

Map Segmentation: Gemini 2.5 Pro struggled heavily whenever the map was segmented. When a target was separated from the current location—whether by walls or void—its custom pathfinder tool would return a failure because no valid path existed in the current data. 2.5 Pro would frequently interpret this tool failure as a bug in the code itself. Riftrunner, however, correctly identifies that the map is simply segmented and the tool is working as intended, allowing it to move on to finding a valid route.
Handling Unseen Areas: This applies specifically to "unseen" tiles (fog of war). While 2.5 Pro often treated the inability to pathfind through unrevealed tiles as a hard boundary, Riftrunner seems to understand the distinction between "no path found right now" and "impossible to reach," realizing it simply needs to explore to reveal the connecting geometry.
Gap Detection: It correctly identified a one-tile wide gap between ledges that it could walk through to reach a warp. In previous tests, 2.5 Pro would often flag this as a dead end and attempt to backtrack.
Manual Pathing: Interestingly, Riftrunner often prefers manual step-by-step navigation over using its custom pathfinding tools. Despite this manual approach, it rarely walks into walls—a frequent issue for 2.5 Pro—and successfully negotiates complex obstacles by identifying walkable tiles in its immediate vicinity.
Visual Comprehension: I did some testing with game screenshots. Not only was Riftrunner able to correctly identify NPC locations, it was even able to accurately recognize specific environmental details (see above)—both massive improvements over 2.5 Pro. This fidelity opens the door to experimenting with a "raw" vision-based harness, reducing reliance on direct RAM extraction—perhaps making this the most significant improvement by far.
Grid Navigation: When nicknaming a Pokemon, 2.5 Pro would frequently miscalculate the cursor movement required on the keyboard interface, overshooting letters. Riftrunner has consistently planned the correct sequence of button presses to enter the desired text. Later on, it even created a custom tool to automate this process, which is something 2.5 Pro never considered.

Metacognition and Error Recovery

A common failure mode for 2.5 Pro is "blaming the tools." If a custom pathfinding tool returns no result, the older model often assumes the tool itself is broken and enters a loop of attempting to debug the Python code.

Riftrunner demonstrates a higher likelihood of trusting the tool and re-evaluating the environment. A clear example occurred in Sprout Tower, pictured above (link to the full turn screenshot here):

The model needed to reach a ladder on the west side of the first floor.
The direct path was blocked by a central column of walls.
Instead of trying to force a path or rewrite the navigation tool ad nauseum, Riftrunner analyzed the map structure and realized it was effectively "trapped" in the central area.
It formulated a multi-floor plan: ascend to 2F, traverse the second floor to a different ladder, and descend back to 1F in the isolated western corridor.

A similar moment of realization happened in Azalea Town, pictured above. The model initially flagged a Rocket Grunt as blocking the path to town. In previous runs, 2.5 Pro would often accept obstructions like this as an absolute dead end. Riftrunner, however, re-examined the map coordinates and realized the Grunt was only guarding the Slowpoke Well, correctly identifying a walkable path around the NPC to continue into town. See the full turn screenshot here.

Instruction Following

Adherence to complex system prompts seems better, though still not perfect, with Riftrunner paying significantly closer attention to the myriad instructions in the prompt.

The harness also includes a warning system for "unmarked warps," flagging two adjacent tiles as separate entries (as it evaluates each coordinate independently). 2.5 Pro treated these as distinct entities, leading to redundant checks. Riftrunner correctly deduces they are the same logical warp.

Current Limitations

While it's better at reasoning, the model is not without flaws:

Proactive vs. Reactive: Like 2.5 Pro, Riftrunner struggles to proactively manage long-term secondary goals. For example, it identified a goal to train a low-level Onix. In battle, it correctly identified a safe switch opportunity against a Zubat (pictured above)—which, to be fair, is much improved behavior over 2.5 Pro which would more often than not ignore its less important goals. However, it failed to take the more obvious proactive step of moving the Onix to the lead party slot before the battle.
Code Editing: When using Gemini CLI to refactor code, it still introduces occasional syntax errors, such as deleting closing brackets. Additionally, in large codebases, it sometimes struggles to identify the correct locations for a refactor, failing to investigate the full context before acting.
Tool Usage: I observed instances where the model called custom automation tools (like button presses) but forgot to set the required parameter to actually execute the action, defaulting to a "dry run" return. However, unlike 2.5 Pro, it typically recognizes this mistake quickly and self-corrects in the subsequent turn.
Vision: Estimating enemy HP percentage from screenshots remains inconsistent, even with grid overlays. This may be due to how images are processed in the model pipeline or API; further testing is needed.

Conclusion

While it is too early to render a final verdict after only a few days of testing, "Riftrunner" is shaping up to be a clearly superior model to the 2.5 Pro baseline. It demonstrates improved reasoning, higher capability, and overall greater intelligence in how it approaches complex, open-ended tasks. It feels like a concrete update worthy of the 3.0 moniker, and I look forward to seeing how it handles the rest of the Johto journey.

The Making of Gemini Plays Pokémon

Joel Zhang — Wed, 13 Aug 2025 19:53:28 GMT

Google CEO Sundar Pichai jokingly refers to “API – Artificial Pokémon Intelligence” on stage while celebrating Gemini’s Pokémon victory.

Can a large language model beat a 29-year-old video game? It’s a question that seems simple on the surface, but the answer turned out to be a whirlwind of late-night coding, bizarre AI behavior, and a surprising amount of high-profile attention. When I started the Gemini Plays Pokémon project, I wanted to see if Google’s latest AI could succeed where others had struggled. The journey that followed culminated in a shout-out from Google’s CEO, Sundar Pichai, and revealed just how complex that simple question really is.

Why Pokémon? On the surface, having an AI beat a 29-year-old game might sound frivolous, but Pokémon Blue offers an intriguing testing ground for AI reasoning. The game demands long-term planning, strategic decision-making, and spatial navigation through an open world – core challenges on the path to more general AI.

From Spectator to Creator

My journey with this project began as a curious spectator. For months, I had heard about Anthropic’s Claude Plays Pokémon experiment on Reddit and in the news. When I checked out the stream, however, it was immediately apparent that the model’s vision for 2D, top-down pixel art games was sorely lacking. Claude’s run was slow and often confused. Viewers noted that it struggled to “see” critical details in the 2D Game Boy graphics. For example, its harness didn’t provide information on which trees were cuttable, forcing it to rely on unreliable visual input and frequently getting it stuck. Claude also often mistook signs for building entrances, and at one point, the poor AI even mistook its own player avatar for an NPC. As the Gemini 2.5 technical report later put it about Gemini (but also applies here), the model “struggled to utilize the raw pixels of the Game Boy screen directly” and it was “necessary for the required information from the screen to be translated into a text format.” Due to these shortcomings, I wondered how far it could get if it was given the information and tools it actually needed.

The release of Google’s Gemini 2.5 Pro in March 2025 presented a new possibility. Boasting a 1 million token context window (a huge leap over Claude’s 200k tokens) and strong reasoning abilities, the AI community buzzed with excitement. On Reddit, users were already calling for a showdown. “Someone needs to pit this against Claude 3.7 to see who can beat Pokemon the fastest,” one user wrote, pointing out Claude’s struggles with its context window. You might be surprised to know that this didn’t inspire me right away. But I was looking for another project to do, and after sitting on the idea for a few days, it appealed to me more and more. On March 25, 2025, at 15:35:51 Pacific Time, I replied to that thread and publicly threw my hat in the ring: “I’m gonna work on this and see what I can do.” By that afternoon, I had started coding what would become the Gemini Plays Pokémon harness.

The Two-Day Prototype

I consider myself a perfectionist by nature. Whether it’s working on backend or frontend, I like to build something polished, and often find myself making sure things are pixel perfect. But my eight years in a startup taught me the value of a Minimum Viable Product (MVP). I wanted to gauge interest before committing more fully, so I set aside grand plans and focused on getting a basic prototype working fast.

Within two days, I had a bare-bones system running Pokémon Blue with Gemini 2.5 Pro at the controls. The initial setup was as simple as possible: the Twitch stream showed my terminal on the left and the game on the right, no fancy overlays. The harness itself was coded in Node.js, and under the hood, a Lua script served as a bridge to the Game Boy emulator (mGBA), allowing it to read the game’s memory and send button inputs. Each turn, Gemini got an annotated screenshot plus a text dump of on-screen information. This annotation process became more sophisticated over time, but the core idea of translating the visual game world into a structured, textual format was there from the beginning. It would output a single move (like “Up”, “Down”, “A”, etc.), which my code translated into a button press. In this early version, Gemini could only move one tile at a time. There was no critique system or context summary, which would become core parts of the harness later on.

An example of the annotated screen from a later version of the harness. The system extracts key information from the game’s memory—like object locations, tile properties, and navigability info—and presents it as a clean, textual overlay. You can see how the design evolved from its first and second iterations.

Getting this far required solving a lot of nitty-gritty technical problems: connecting to the emulator, translating raw pixel data into a form Gemini could interpret, rate-limiting the model’s API calls, and so on. But by late March, the Twitch stream was live and Gemini was wandering around Pallet Town, playing Pokémon all by itself. Once I was live, I invited viewers from the Claude Plays Pokémon stream to check out my project, and the viewer base quickly grew. And importantly, it wasn’t costing me a dime: Gemini 2.5 Pro was free to use (with some rate limits), and I discovered that using it through the OpenRouter proxy bypassed the strict rate limits. Within the first couple of weeks, a friendly contact at Google DeepMind reached out. They told me lots of people at Google were watching the project with enthusiasm and generously provided me with free, unlimited API access, at which point I switched over from OpenRouter. It was an incredibly encouraging sign of things to come.

The first month after launch was a blur. I was working over 12 hours a day on the harness, cooped up in my office to the point that my wife joked she barely saw me anymore. Normally, I love sleeping in, but that changed during those first few weeks. I’d wake up bright and early, and my mind would immediately be buzzing with ideas for new tweaks or an urgent need to check on Gemini’s progress – I simply couldn’t fall asleep again. That intense focus was necessary, because I soon ran into my first major hurdle: the model’s own memory.

Taming the Long-Context Loop

One advertised superpower of Gemini 2.5 Pro was its gigantic context window. In theory, I could feed the model everything – the entire game history, all observations – without ever resetting its context. I was determined to make full use of that 1M-token context. In practice, though, it was quickly apparent that the model still had flaws I had seen in other models with large context windows.

As the prompt grew, the model would eventually fall into a pattern-repeating mode, sacrificing novel decision-making for repetitive loops.

Past a certain point (upwards of 100k tokens, as also noted in Google’s Gemini 2.5 technical report), the model would fall into repetitive loops of behavior instead of coming up with new strategies.

An example of this was Gemini getting stuck in a fenced backyard behind Pewter City’s gym. The area is “fenced off” by a row of ledges that you could simply hop down at any point to leave, or walk through a small gap. After hours of gameplay, Gemini started walking in circles over and over again, claiming it was trapped. This went on for about 8 hours before I intervened by force-clearing the model’s context. After that, it was able to simply walk down and out of the backyard, no longer stuck in its repetitive loop.

This experience led me to implement a periodic context reset with summarization. Every 100 turns, the system would take all the recent events and condense them into a brief summary, which I’d prepend to a brand-new context for Gemini along with essential persistent info. This way, Gemini wasn’t burdened by an ever-growing transcript, but it wouldn’t completely forget its progress either. In essence, I gave it a form of long-term memory – at the cost of some detail fidelity. It felt a bit sad to throw away that million-token window, but until these models can truly handle ultra-long contexts without looping, this was a necessary trade-off.

Hand-in-hand with summary resets, I introduced a self-critique mechanism. The system would invoke a new, temporary Gemini instance with a specialized “critiquer” persona. This new instance would analyze the main AI’s performance history, focusing on poor strategy, nonsensical goals, or potential hallucinations. In addition, it would generate a game walkthrough for the AI's current primary goal, explaining the next steps to take, derived entirely from the model's own training data. The output from this separate critique instance was then fed back into the main AI’s context. This was a double-edged sword: sometimes the critiques were insightful, helping Gemini break out of a rut. Other times, the critiquer would hallucinate incorrect information as part of its generated walkthrough, such as telling the main AI to go south-east to find an exit when the real exit was to the west. But overall, this added reflection made the AI more robust.

Another tweak to fight hallucinations involved Pokémon move management. At first, Gemini had a bad habit of hallucinating 0 PP (Power Points) for moves that were actually fully restored. For instance, even right after visiting a Pokémon Center (which restores all PP), it might refuse to use a move because somewhere in its long context it had remembered that move being depleted earlier. The model was over-trusting its memory. I attempted a prompt-based fix (instructing Gemini “don’t assume a move is 0 PP; always check the screen”), but the hallucination persisted. Ultimately, I solved this by directly injecting ground-truth data: I pulled each Pokémon’s current moves and PP from the game’s RAM and included that in Gemini’s prompt. With the live PP info in the prompt, Gemini stopped worrying about phantom empty moves and started using its best attacks whenever available.

To ensure Gemini remembered its objectives after a context clear, I made use of the existing, simple but highly effective goals system. The harness prompted Gemini to establish a primary, secondary, and tertiary goal, which were re-inserted into the context after every reset.

Primary goal: The main progression objective, such as obtaining the next gym badge.
Secondary goal: A supportive objective that directly enables the primary goal, like finding a key item required for passage.
Tertiary goal: An opportunistic goal that could be pursued if convenient, such as exploring the remainder of a route or catching a new Pokémon.

Without this, the AI would likely waver on its next steps, especially in the mid-game where progression is less linear.

These early hurdles reinforced my core philosophy for the project. Unlike the Claude Plays Pokémon experiment which seemed focused on testing a raw model’s limits, I wanted to build an agent capable of beating the game – not by handholding, but by general improvements to the harness.

My goal wasn’t to spoon-feed the AI a walkthrough, but to give it better tools so it could solve the problems on its own.

Mapping the World: Giving Gemini Spatial Memory

One of the biggest limitations of an AI playing a video game is the lack of mental mapping. Both the Claude and Gemini projects suffered from this, as the AI could only perceive what was immediately on screen. Claude had previously attempted to generate ASCII maps, but it wasn’t very useful. Human players build an internal map of areas they’ve explored, but language models, by default, don’t have this spatial memory. This meant that every time Gemini entered a maze-like area, it was like the first time – it would wander aimlessly, forget where it had been, and often get lost or stuck retracing its steps.

To address this, I implemented a Map Memory system. Essentially, I gave Gemini a fog-of-war style minimap that persists throughout the run, recording every tile it has seen. Internally, this was stored as a JSON list of discovered tiles, but on the Twitch stream I visualized it as an overhead map that gradually filled in as Gemini explored. (Gemini itself never saw the rendered map image – it only got a text representation of the known terrain.) Now, when Gemini re-entered Viridian Forest, it “knew” which paths it had tried and which areas still needed further exploration.

Building on this, I added a concept of Reachable Unseen Tiles. Each turn, the harness would analyze the map data and provide a list of coordinates for unexplored tiles that were currently reachable from Gemini’s position. I initially fed this info into the prompt as a gentle nudge, hoping it would encourage the AI to systematically cover new ground instead of looping. However, Gemini still sometimes claimed it was stuck when the map actually had unseen areas it simply hadn’t walked into yet. It was frankly frustrating to watch it give up and consider drastic measures (like intentionally blacking out its team to respawn at a Pokémon Center) when a few steps away lay the exit it needed.

The solution was to make Gemini obsessed with exploration. I cranked up the directive in its prompt to uncover every single tile of the map, placing that priority even above its primary goal of, say, beating the next Gym. If there was dark fog-of-war in an accessible area, that was priority number one. This change was transformative – suddenly Gemini almost never said it was stuck. It diligently sniffed out every nook and cranny, and once the entire map of a floor was revealed, it inherently had all the information a human would to find an exit or a key item. This externalized spatial memory was the key to overcoming its navigational limitations.

To be honest, this felt like a heavy-handed solution at times. The AI would develop a horrible compulsion, complaining that it couldn’t enter a key building yet because there was a single, unimportant tile left to explore on the other side of the map. While effective, it wasn’t elegant. I later found a better balance in the Pokémon Yellow run by loosening this restriction, which proved just as effective without the obsessive behavior.

A Council of Agents

Even with a full map at its disposal, Gemini struggled with certain puzzles that require planning far ahead – notably, the infamous spinner maze in Team Rocket’s Hideout (Basement Floor 3) and the boulder Strength puzzles in Victory Road. These puzzles are tricky because the correct path is often counter-intuitive. For example, in the Rocket Hideout spinner maze, the obvious visual goal (the staircase) is nearby but unreachable if you walk straight towards it; you must first go the opposite way, through a maze of conveyor belts that loop you around. Time and again, I watched Gemini fall for the “attractor” – it would see the goal, try to beeline towards it, and get sent back to the start by a trap. The AI just wouldn’t take a step back and plan a longer, indirect route.

A classic “attractor” puzzle in Mt. Moon. The direct path (shown in red) is blocked. The correct path (shown in magenta) requires moving away from the goal to find the loop that leads to the exit.

Part of the issue was that Gemini wasn’t fully understanding how the spinner tiles worked. These tiles send you involuntarily sliding in one direction until you hit a stop. It would repeatedly step on the same spinner, seemingly thinking it would be sent in a different direction. To help, I augmented the Map Memory with special annotations for spinner paths. This involved pre-computing the end-coordinates for any spinner sequence Gemini had already fully explored, giving it a deterministic view of what each seen spinner would do, instead of it having to guess.

But even with a grasp of the mechanics, the pathfinding problem in these mazes was non-trivial. I was reluctant to just hard-code a solution or give it a typical pathfinding algorithm that automatically moved the player. I still wanted Gemini to figure out the path – but maybe it could use some extra brainpower to do so. This led to a significant upgrade in the harness: the introduction of specialized planning agents.

I created a tool that let Gemini essentially spawn a secondary Pathfinder Agent (another instance of Gemini 2.5 Pro) with a blank slate and a singular focus: find a route from point A to point B. When Gemini encountered a maze it couldn’t easily solve (or just didn’t understand why it couldn’t walk through a wall), it could invoke this agent, which was provided with the current map layout and tile mechanics.

The agent was then prompted to mentally simulate an optimal pathfinding algorithm to find a route. Thinking in isolation, free from the clutter of the full game context, the agent would devise a path consistent with A* or Breadth-First Search—an incredible demonstration of pure reasoning. The first time I tried this, the Pathfinder agent one-shot the Rocket Hideout’s B3F maze, a task that had stumped the primary agent for days.

This success felt like magic, a powerful demonstration of the reasoning strength of Gemini 2.5 Pro.

Of course, it wasn’t foolproof. On more convoluted maps that required long detours, even the Pathfinder agent would sometimes hallucinate shortcuts (e.g. “just walk through this wall!”) or convince itself that a row of impassable tiles was, in fact, a ledge it could jump down from. But overall it dramatically improved Gemini’s ability to handle complex navigation. I observed an interesting scaling difference here: Gemini 2.5 Pro (the large model) could reason out quite long paths, whereas the smaller Gemini 2.5 Flash model (which I tested in a brief side-run) often failed on the same tasks. The Flash model, while significantly faster, was considerably less capable and struggled with navigation, never managing to exit Mt. Moon in my local tests.

The success of the Pathfinder agent highlights a crucial point about evaluating modern AI. A model’s difficulty with interpreting raw game visuals reframes the challenge. By providing the AI with a textual, “fog-of-war” style map that records explored tiles, we bypass the unreliable vision system and present it with a different kind of test. The question shifts from “Can you see the path?” to “Given a structured map, can you interpret the data and deduce the correct path?” This is a complex task in its own right, requiring the model to reason about spatial relationships based on textual information. The clear performance gap between Gemini 2.5 Pro, which could solve complex mazes, and the smaller Gemini 2.5 Flash, which struggled with the same harness, demonstrates that this approach effectively measures the model’s core reasoning abilities. This principle of empowering the AI with better tools to isolate its reasoning capabilities was a foundational concept that we would later expand upon, even allowing the AI to program its own navigational solutions.

The other major in-game challenge was Strength boulder puzzles (think Sokoban-style puzzles where you push rocks onto switches). Gemini actually solved most of the Victory Road boulder puzzles on its own in Run 1, but it got stuck on one particularly challenging puzzle that required pushing a boulder from far away. The AI couldn’t make the logical leap that this specific, distant boulder was the correct one. I believe that given enough time, it could have eventually solved the puzzle through trial and error, but a trade-off was made for the sake of viewer entertainment.

The Victory Road boulder puzzle that stumped the AI. The correct boulder had to be pushed from a considerable distance (path shown in red) to reach the switch.

I decided to apply the same idea as the Pathfinder here: a dedicated Boulder Puzzle Strategist agent. This agent would, given the current layout of boulders, holes, and switches, simulate which boulders could reach which switches, determine the correct pairings, and then plan out the sequence of pushes.

I even let Gemini 2.5 Pro help write the prompt for this agent. I described the problem, saying something like, “write a prompt for an LLM to solve Sokoban-style puzzles, where boulders need to be pushed onto switches”, then let the model draft a prompt, which I refined. The result was an agent that could solve these multi-step puzzles in one go, sparing the main model from potentially hundreds of trial-and-error moves. Like the Pathfinder, it wasn’t perfect and would sometimes claim no solution existed, but it was a dramatic improvement. In effect, it compressed what might be countless hours of shoving a boulder around, resetting, and repeating, into a single burst of reasoning and a clean plan.

By the time the first full run of Pokémon Blue concluded, the harness had evolved into a sophisticated support system. Here’s a high-level look at the final architecture:

Component	Function	Key Benefit
Information Extraction	RAM data + Annotated Screenshots	Provides accurate, real-time game state (e.g., Pokémon stats, PP, inventory) and visual context, overcoming LLM vision limitations.
Critique System	Periodically prompts AI to analyze its behavior and suggest improvements.	Facilitates self-correction and adaptive strategy refinement.
Summary System	Condenses 100 past actions into a summary, clearing context.	Prevents long-context looping, maintains long-term memory while keeping context manageable.
Goals System	Sets Primary, Secondary, Tertiary goals; re-inserted after context clear.	Ensures long-term task coherence and prevents goal flip-flopping.
Map Memory	Fog-of-war minimap (textual XML) of explored tiles.	Provides persistent spatial awareness, enables exploration directives, prevents getting lost in mazes.
Pathfinder Agent	External Gemini 2.5 Pro instance for complex pathfinding.	Solves intricate mazes (e.g., spinner puzzles) by focused, simulated reasoning, overcoming LLM spatial reasoning deficits.
Boulder Puzzle Strategist Agent	External Gemini 2.5 Pro instance for Strength puzzles.	Automates complex multi-step boulder puzzles, compresses context, improves reliability.

Gemini 2.5 Pro, empowered by this scaffold, was able to navigate the entire game and overcome every obstacle without human hand-holding. On May 2, 2025, after roughly 813 hours of cumulative play, Gemini defeated the Elite Four and became the Pokémon League Champion of Kanto. Mission accomplished – and yet, this was just the beginning.

Victory and Validation

The Twitch chat, Discord, and Reddit communities were ecstatic when Gemini finally rolled the credits on Pokémon Blue. But the celebration wasn’t confined to the usual AI circles. To my surprise, the project caught the attention of some very high-profile onlookers. Google’s CEO, Sundar Pichai, had been quietly following the AI Pokémon saga. The day Gemini clinched the championship, Pichai triumphantly posted on X (Twitter): “What a finish! Gemini 2.5 Pro just completed Pokémon Blue! Special thanks to @TheCodeOfJoel for creating and running the livestream, and to everyone who cheered Gem on along the way.” Seeing the CEO of Google publicly thank me by name was surreal. (He even gave the project a shoutout during Google I/O 2025, playfully dubbing it “Artificial Pokémon Intelligence.”)

Other Google leaders were cheering from the sidelines too. Logan Kilpatrick, product lead for Google AI Studio, had been tweeting updates as Gemini progressed – at one point noting that Gemini earned its 5th Gym badge in just ~500 hours, whereas “the next best model only has 3 badges so far (though with a different agent harness)”. The “other model” was Anthropic’s Claude, still stuck earlier in the game. The media picked up on this AI showdown: outlets like TechCrunch and India Today ran pieces on “Gemini vs Claude” and the broader implications of using Pokémon as a benchmark for agentic AI. I even got credited in Google’s official Gemini 2.5 technical report – they devoted an entire section to Gemini Plays Pokémon as a case study in long-horizon reasoning, citing the project as an example of complex agentic tool use.

Amidst all the fanfare, I did previously caution viewers (on my stream’s FAQ) not to read too much into the “Gemini vs Claude” narrative. Yes, Gemini 2.5 Pro ultimately finished the game and Claude hadn’t yet, but the playing field wasn’t level. Gemini had the benefit of my custom harness with all these extra tools and information, whereas Claude’s setup was more bare-bones. “Please don’t consider this a benchmark for how well an LLM can play Pokémon,” I wrote on the stream page – direct model-to-model comparisons are tricky when the scaffolding around them differs. This experience, however, highlighted the need for a neutral, standardized platform for these kinds of evaluations. In the end, both AI agents needed significant help (in the form of those external tools and inputs) to handle a game like Pokémon. The real achievement, in my view, wasn’t Gemini “being smarter” than Claude – it was demonstrating that with the right support, an AI can tackle a complex, open-ended task over hundreds of hours, persist through setbacks, and eventually succeed.

That said, beating Pokémon Blue (twice!) was incredibly satisfying, and we stumbled upon some unintended “firsts” along the way:

During the second run, Gemini encountered an obscure glitch in the Seafoam Islands – essentially becoming the first AI to find a bug in Pokémon’s code on its own.
It had amusing hallucinations from its training data, at one point becoming convinced it needed to find “TEA” to give a thirsty guard (a mechanic from a different Pokémon game, not present in Blue).
We even saw what the community dubbed “model panic.” When its Pokémon were low on health, the AI’s reasoning would degrade, causing it to forget its own advanced tools and fixate on simple, often ineffective escape strategies.

A single, healthy Pokémon was fine. But adding low-HP party members would trigger a panic, even though the healthy Pokémon was unchanged. This irrational fear often caused its reasoning to collapse. (Meme credit: mrcheeze_ on Discord)

As an aside, to fix the cross-game hallucinations, I re-prompted the AI to act like a new player in its second run. This solved one problem but created another: a persistent delusion that Full Heal restored HP. (To be fair, the names are confusing: Full Heal only cures status ailments, while Full Restore is the item that restores HP). This single error sparked a completely unnecessary 100-hour loop of losing to the Elite Four then going back to Victory Road to train some more. It was a lesson in the sometimes delicate art of prompt engineering — a single instruction can have complex and unpredictable downstream consequences.

By June 9, 2025, the fully autonomous second run (with the finalized harness and zero manual interventions from me) was complete – and much faster, about 406 hours of playtime, roughly half the time of the development run. Gemini had truly become a Pokémon Master, and I was ready for the next challenge.

Beyond Pokémon Blue: Giving Gemini More Autonomy

For the next phase, I set my sights on Pokémon Yellow – specifically a ROM hack called Pokémon Yellow Legacy that ups the difficulty. Yellow is very similar to Blue under the hood, which meant I could reuse a lot of the harness, but the mod made it interesting: all 151 Pokémon are catchable, trainers are tougher, and an optional “Hard Mode” enforces level caps and a strict no-switching battle style. Since Gemini had already beaten the base game twice, I enabled Hard Mode to present it with a fresh challenge. This Yellow run would serve as both an entertainment filler (something fun for viewers while I worked on bigger upgrades) and a testbed for new ideas.

For the Yellow Legacy run, we also sought to increase the strategic demands of combat. In a standard playthrough, it’s possible for battles to become a simple matter of grinding to a higher level than the opponent. The “Hard Mode” in this version of the game, however, introduces constraints that transform combat into a genuine test of tactical skill. With strict level caps preventing over-leveling and a “set” battle style that removes the advantage of switching after an opponent faints, brute-force approaches become ineffective. Instead, the AI must engage in sophisticated reasoning—carefully managing team composition, type matchups, and move selection to overcome opponents on an even footing. This turns every major battle into a high-stakes puzzle, creating a rigorous evaluation of the model’s strategic reasoning.

The core idea for this phase was to shift the balance of power and increase Gemini’s own agency in developing solutions. The Blue runs had established what strong scaffolding could do; for Yellow, the question evolved. Could the AI learn to win with less? The goal was to move beyond pre-built solutions and see if Gemini could compensate for a weaker harness by relying on its own emergent strategies.

To achieve this, I fundamentally altered the toolkit. I removed the specialized agents for pathfinding and boulder puzzles, along with most environmental hints like tile navigability and the strict exploration directive. In their place, I introduced a suite of open-ended meta-tools. Gemini could now define its own mini-agents with define_agent, execute scripts with run_code, place labeled map markers, and store long-term plans in a notepad. This gave Gemini the building blocks to identify a sub-problem, create a plan, and code its own solution from scratch—a major step toward true agentic behavior.

This new approach fundamentally changed the nature of the evaluation. For example, when faced with a complex navigational problem, Gemini could now write and execute its own Python code to generate a path. This demonstrates a higher-order capability, and thus a shift to testing whether the model could understand a problem deeply enough to design and implement a solution on its own, rather than rely on pre-built tools.

In practice, Gemini began using the agent-creation tools as a workaround to store and execute reusable scripts, as the built-in code execution tool couldn’t do so. To streamline this, I added a dedicated custom tool feature, which allowed Gemini to build a library of its own reusable functions. This change reserved the define_agent command for its original intent: creating agents for focused or complex, reasoning-heavy tasks like battle strategy. It was captivating to watch the AI effectively extend its own abilities by writing new prompts and code—a true glimpse into the future of AI autonomy.

Another experiment was the introduction of a “World Knowledge Graph.” The idea was to give Gemini a more robust sense of world geography by allowing it to log connections between areas. In practice, however, the agent often wasted time adding nodes and edges without checking if they already existed, and ultimately didn’t make effective use of the data for navigation. Since the feature wasn’t providing a significant benefit in its initial form, it was later removed. While ultimately not useful, this kind of iteration is still valuable in such an experimental project.

To provide transparency into this new phase of AI autonomy, I implemented several corresponding upgrades to the stream’s interface. A new UI panel was created to display the active agents Gemini had defined. For deeper insight, I set up a public Git repository where viewers could inspect the underlying code and notepad diffs, which were committed in real-time. The stream also received a visual overhaul, including a new avatar for Gemini and a floating “verbal response” bubble for its roleplay persona, giving the stream more character.

The stream’s UI evolution: from a simple terminal and game screen (left) to a feature-rich interface with custom panels for agents, status, and a more polished design (right).

As I write this, the Pokémon Yellow Legacy run is ongoing, and Gemini is steadily earning badges with its enhanced self-driven harness. The groundwork here is also preparing me for the next big leap: Pokémon Crystal. The eventual goal isn’t just to beat the base game, but to tackle more unpredictable challenges like difficult romhacks or randomizers, many of which are built upon Crystal. The increased scope of Generation II games (a bigger world, more story, and new mechanics like the day/night cycle) means the harness will need significantly more work to adapt than it did for the jump from Pokémon Blue to Yellow Legacy. I’ve already launched a secondary Twitch channel for early testing of Gemini in Crystal, and it’s been both exciting and daunting to see the AI tackle an even larger world. The plan is to fully pivot the main project to Crystal (and Crystal-based romhacks) once Yellow Legacy wraps up. With the power of these new meta-tools and world modeling, I’m optimistic that Gemini can become the champion of Johto as well.

Project Timeline

The project was developed live on stream from the beginning, a journey that involved a significant amount of work across two main repositories. The backend harness, the core of the project, saw over 2,300 commits, with more than 63,000 lines of code added and 31,000 deleted. The frontend UI, while smaller, still required over 200 commits, with over 30,000 lines of code added and 12,000 deleted. This table gives a high-level overview of that evolution, showing how AI features and stream UI improvements were developed in parallel.

Phase	Dates	AI Development Milestones	Stream UI Development Milestones
Phase 1: From MVP to Perception	Late March 2025
	Week 1 (Mar 25 - Mar 31)	Core loop (mGBA emulator, screenshot capture, button presses), mGBASocketServer.lua for RAM access, AI responsible for movement.	Initial stream UI: terminal left, game screen right.
Phase 2: Memory, Strategy, and Self-Correction	April - Mid-May 2025
	Week 2 (Apr 1 - Apr 7)	Goals system (Mar 28), Critique system (Apr 1), Summarization feature (long-term memory/context consolidation).
	Week 3 (Apr 8 - Apr 14)	Map Memory for persistent spatial awareness, Pathfinder Agent (Apr 14).	First web-based UI (Apr 8).
	Week 4 & 5 (Apr 15 - Apr 28)	Groundwork for Twitch chat integration.	Richer UI (Apr 16 - Apr 27): Dynamic team panel, badge case, multi-panel goals, live minimap (player position). Syntax highlighting, pop-up displays for critiques/summaries.
	Week 6 & 7 (Apr 29 - May 12)	Boulder Puzzle Strategist (May 1).	Visualization (May 8 - May 20): “Who’s That Pokémon?“ overlay (May 17), visual pathfinding indicator, Pokeball inventory display.
Phase 3: The Yellow Legacy Run & The Autonomy Update	Late May - Mid-June 2025
	Week 8 (May 26 - June 5)	Yellow Legacy run begins. New agent-based architecture: define_agent, call_agent, delete_agent tools (AI writes agents). New notepad_edit tool.	UI refinement (May 27 - June 4): Tabbed panel for team/inventory (June 3), token usage counter, polished animations.
	World Knowledge Graph (June 6 - June 14)	World Knowledge Graph implemented. Git repository for AI’s custom agents/notepad (transparency). Project pivots to Pokémon Crystal.	UI for agent architecture (June 7 - June 14): Displays for active agents, code, notepad diffs. Visual overhaul: Gem avatar, floating “chat” bubble (June 12).

Final Thoughts

In the span of just a few months, this project evolved from a simple script into a real-world showcase of what modern AI models can do when you combine long context, multimodal perception, and creative tool use. It served as a powerful, practical lesson in the current state of the art. We saw that models like Gemini still have obvious weaknesses – they can get stuck in thought loops, misread simple visuals, or hallucinate objectives that don’t exist. But we also learned how to counteract many of those weaknesses with clever scaffolding. Sometimes, a breakthrough came not from a better model, but from a better-framed task, like the directive to “explore every tile,” or giving the AI a fresh context to think in, like the Pathfinder agent.

The scaffolding around an AI is as important as the model itself for complex tasks.

This was one of my core takeaways from the project. By incrementally giving Gemini the right tools and nudges, I witnessed it achieve something pretty astounding: it played a 30-hour RPG from start to finish, through planning, perseverance, and problem-solving – all on its own. Once the stuff of science fiction, now it’s a Twitch stream that anyone can tune into.

This deep dive has only scratched the surface of the countless hours of coding, debugging, and discovery that went into this project. Looking back, the journey from a late-night experiment to a project recognized by Google’s own CEO has been surreal. It’s a reminder that even a childhood game can be a frontier for exploring the future of AI.

I also have to give a huge thank you to the dedicated members of the Discord and Reddit communities, who meticulously kept logs and tracked Gemini’s progress. And of course, none of this would have been possible without the generosity of Google DeepMind, who provided the free, unlimited API access that powered the stream.

And for me, this is just the beginning. The long-term vision is to expand on this work by continuing the public streams, tackling future Pokémon generations and other turn-based strategy games like Fire Emblem, along with experiments into real-time platformers like Super Mario Bros. But as this project has shown, comparing agent performance is difficult without a level playing field. To that end, and to help build on these long-horizon evaluations in a standardized way, I’m establishing the ARISE Foundation (Agentic Research & Intelligence Systems Evaluation Foundation), which will open-source the harness for the Pokémon Blue run and be my focus for the foreseeable future. The road ahead is long, but I hope you’ll join me for it.

Follow my work on X (Twitter) to be a part of the journey.