Skip to main content

Command Palette

Search for a command to run...

The Making of Gemini Plays Pokémon

Updated
29 min read

image of Sundar Pichai presenting the Gemini Plays Pokémon project at Google I/O 2025

Google CEO Sundar Pichai jokingly refers to “API – Artificial Pokémon Intelligence” on stage while celebrating Gemini’s Pokémon victory.

Can a large language model beat a 29-year-old video game? It’s a question that seems simple on the surface, but the answer turned out to be a whirlwind of late-night coding, bizarre AI behavior, and a surprising amount of high-profile attention. When I started the Gemini Plays Pokémon project, I wanted to see if Google’s latest AI could succeed where others had struggled. The journey that followed culminated in a shout-out from Google’s CEO, Sundar Pichai, and revealed just how complex that simple question really is.

Why Pokémon? On the surface, having an AI beat a 29-year-old game might sound frivolous, but Pokémon Blue offers an intriguing testing ground for AI reasoning. The game demands long-term planning, strategic decision-making, and spatial navigation through an open world – core challenges on the path to more general AI.

From Spectator to Creator

My journey with this project began as a curious spectator. For months, I had heard about Anthropic’s Claude Plays Pokémon experiment on Reddit and in the news. When I checked out the stream, however, it was immediately apparent that the model’s vision for 2D, top-down pixel art games was sorely lacking. Claude’s run was slow and often confused. Viewers noted that it struggled to “see” critical details in the 2D Game Boy graphics. For example, its harness didn’t provide information on which trees were cuttable, forcing it to rely on unreliable visual input and frequently getting it stuck. Claude also often mistook signs for building entrances, and at one point, the poor AI even mistook its own player avatar for an NPC. As the Gemini 2.5 technical report later put it about Gemini (but also applies here), the model “struggled to utilize the raw pixels of the Game Boy screen directly” and it was “necessary for the required information from the screen to be translated into a text format.” Due to these shortcomings, I wondered how far it could get if it was given the information and tools it actually needed.

The release of Google’s Gemini 2.5 Pro in March 2025 presented a new possibility. Boasting a 1 million token context window (a huge leap over Claude’s 200k tokens) and strong reasoning abilities, the AI community buzzed with excitement. On Reddit, users were already calling for a showdown. “Someone needs to pit this against Claude 3.7 to see who can beat Pokemon the fastest,” one user wrote, pointing out Claude’s struggles with its context window. You might be surprised to know that this didn’t inspire me right away. But I was looking for another project to do, and after sitting on the idea for a few days, it appealed to me more and more. On March 25, 2025, at 15:35:51 Pacific Time, I replied to that thread and publicly threw my hat in the ring: “I’m gonna work on this and see what I can do.” By that afternoon, I had started coding what would become the Gemini Plays Pokémon harness.

The Two-Day Prototype

I consider myself a perfectionist by nature. Whether it’s working on backend or frontend, I like to build something polished, and often find myself making sure things are pixel perfect. But my eight years in a startup taught me the value of a Minimum Viable Product (MVP). I wanted to gauge interest before committing more fully, so I set aside grand plans and focused on getting a basic prototype working fast.

Within two days, I had a bare-bones system running Pokémon Blue with Gemini 2.5 Pro at the controls. The initial setup was as simple as possible: the Twitch stream showed my terminal on the left and the game on the right, no fancy overlays. The harness itself was coded in Node.js, and under the hood, a Lua script served as a bridge to the Game Boy emulator (mGBA), allowing it to read the game’s memory and send button inputs. Each turn, Gemini got an annotated screenshot plus a text dump of on-screen information. This annotation process became more sophisticated over time, but the core idea of translating the visual game world into a structured, textual format was there from the beginning. It would output a single move (like “Up”, “Down”, “A”, etc.), which my code translated into a button press. In this early version, Gemini could only move one tile at a time. There was no critique system or context summary, which would become core parts of the harness later on.

Annotated screen

An example of the annotated screen from a later version of the harness. The system extracts key information from the game’s memory—like object locations, tile properties, and navigability info—and presents it as a clean, textual overlay. You can see how the design evolved from its first and second iterations.

Getting this far required solving a lot of nitty-gritty technical problems: connecting to the emulator, translating raw pixel data into a form Gemini could interpret, rate-limiting the model’s API calls, and so on. But by late March, the Twitch stream was live and Gemini was wandering around Pallet Town, playing Pokémon all by itself. Once I was live, I invited viewers from the Claude Plays Pokémon stream to check out my project, and the viewer base quickly grew. And importantly, it wasn’t costing me a dime: Gemini 2.5 Pro was free to use (with some rate limits), and I discovered that using it through the OpenRouter proxy bypassed the strict rate limits. Within the first couple of weeks, a friendly contact at Google DeepMind reached out. They told me lots of people at Google were watching the project with enthusiasm and generously provided me with free, unlimited API access, at which point I switched over from OpenRouter. It was an incredibly encouraging sign of things to come.

The first month after launch was a blur. I was working over 12 hours a day on the harness, cooped up in my office to the point that my wife joked she barely saw me anymore. Normally, I love sleeping in, but that changed during those first few weeks. I’d wake up bright and early, and my mind would immediately be buzzing with ideas for new tweaks or an urgent need to check on Gemini’s progress – I simply couldn’t fall asleep again. That intense focus was necessary, because I soon ran into my first major hurdle: the model’s own memory.

Taming the Long-Context Loop

One advertised superpower of Gemini 2.5 Pro was its gigantic context window. In theory, I could feed the model everything – the entire game history, all observations – without ever resetting its context. I was determined to make full use of that 1M-token context. In practice, though, it was quickly apparent that the model still had flaws I had seen in other models with large context windows.

As the prompt grew, the model would eventually fall into a pattern-repeating mode, sacrificing novel decision-making for repetitive loops.

Past a certain point (upwards of 100k tokens, as also noted in Google’s Gemini 2.5 technical report), the model would fall into repetitive loops of behavior instead of coming up with new strategies.

An example of this was Gemini getting stuck in a fenced backyard behind Pewter City’s gym. The area is “fenced off” by a row of ledges that you could simply hop down at any point to leave, or walk through a small gap. After hours of gameplay, Gemini started walking in circles over and over again, claiming it was trapped. This went on for about 8 hours before I intervened by force-clearing the model’s context. After that, it was able to simply walk down and out of the backyard, no longer stuck in its repetitive loop.

This experience led me to implement a periodic context reset with summarization. Every 100 turns, the system would take all the recent events and condense them into a brief summary, which I’d prepend to a brand-new context for Gemini along with essential persistent info. This way, Gemini wasn’t burdened by an ever-growing transcript, but it wouldn’t completely forget its progress either. In essence, I gave it a form of long-term memory – at the cost of some detail fidelity. It felt a bit sad to throw away that million-token window, but until these models can truly handle ultra-long contexts without looping, this was a necessary trade-off.

Hand-in-hand with summary resets, I introduced a self-critique mechanism. The system would invoke a new, temporary Gemini instance with a specialized “critiquer” persona. This new instance would analyze the main AI’s performance history, focusing on poor strategy, nonsensical goals, or potential hallucinations. In addition, it would generate a game walkthrough for the AI's current primary goal, explaining the next steps to take, derived entirely from the model's own training data. The output from this separate critique instance was then fed back into the main AI’s context. This was a double-edged sword: sometimes the critiques were insightful, helping Gemini break out of a rut. Other times, the critiquer would hallucinate incorrect information as part of its generated walkthrough, such as telling the main AI to go south-east to find an exit when the real exit was to the west. But overall, this added reflection made the AI more robust.

Another tweak to fight hallucinations involved Pokémon move management. At first, Gemini had a bad habit of hallucinating 0 PP (Power Points) for moves that were actually fully restored. For instance, even right after visiting a Pokémon Center (which restores all PP), it might refuse to use a move because somewhere in its long context it had remembered that move being depleted earlier. The model was over-trusting its memory. I attempted a prompt-based fix (instructing Gemini “don’t assume a move is 0 PP; always check the screen”), but the hallucination persisted. Ultimately, I solved this by directly injecting ground-truth data: I pulled each Pokémon’s current moves and PP from the game’s RAM and included that in Gemini’s prompt. With the live PP info in the prompt, Gemini stopped worrying about phantom empty moves and started using its best attacks whenever available.

To ensure Gemini remembered its objectives after a context clear, I made use of the existing, simple but highly effective goals system. The harness prompted Gemini to establish a primary, secondary, and tertiary goal, which were re-inserted into the context after every reset.

  • Primary goal: The main progression objective, such as obtaining the next gym badge.

  • Secondary goal: A supportive objective that directly enables the primary goal, like finding a key item required for passage.

  • Tertiary goal: An opportunistic goal that could be pursued if convenient, such as exploring the remainder of a route or catching a new Pokémon.

Without this, the AI would likely waver on its next steps, especially in the mid-game where progression is less linear.

These early hurdles reinforced my core philosophy for the project. Unlike the Claude Plays Pokémon experiment which seemed focused on testing a raw model’s limits, I wanted to build an agent capable of beating the game – not by handholding, but by general improvements to the harness.

My goal wasn’t to spoon-feed the AI a walkthrough, but to give it better tools so it could solve the problems on its own.

Mapping the World: Giving Gemini Spatial Memory

One of the biggest limitations of an AI playing a video game is the lack of mental mapping. Both the Claude and Gemini projects suffered from this, as the AI could only perceive what was immediately on screen. Claude had previously attempted to generate ASCII maps, but it wasn’t very useful. Human players build an internal map of areas they’ve explored, but language models, by default, don’t have this spatial memory. This meant that every time Gemini entered a maze-like area, it was like the first time – it would wander aimlessly, forget where it had been, and often get lost or stuck retracing its steps.

To address this, I implemented a Map Memory system. Essentially, I gave Gemini a fog-of-war style minimap that persists throughout the run, recording every tile it has seen. Internally, this was stored as a JSON list of discovered tiles, but on the Twitch stream I visualized it as an overhead map that gradually filled in as Gemini explored. (Gemini itself never saw the rendered map image – it only got a text representation of the known terrain.) Now, when Gemini re-entered Viridian Forest, it “knew” which paths it had tried and which areas still needed further exploration.

Building on this, I added a concept of Reachable Unseen Tiles. Each turn, the harness would analyze the map data and provide a list of coordinates for unexplored tiles that were currently reachable from Gemini’s position. I initially fed this info into the prompt as a gentle nudge, hoping it would encourage the AI to systematically cover new ground instead of looping. However, Gemini still sometimes claimed it was stuck when the map actually had unseen areas it simply hadn’t walked into yet. It was frankly frustrating to watch it give up and consider drastic measures (like intentionally blacking out its team to respawn at a Pokémon Center) when a few steps away lay the exit it needed.

The solution was to make Gemini obsessed with exploration. I cranked up the directive in its prompt to uncover every single tile of the map, placing that priority even above its primary goal of, say, beating the next Gym. If there was dark fog-of-war in an accessible area, that was priority number one. This change was transformative – suddenly Gemini almost never said it was stuck. It diligently sniffed out every nook and cranny, and once the entire map of a floor was revealed, it inherently had all the information a human would to find an exit or a key item. This externalized spatial memory was the key to overcoming its navigational limitations.

To be honest, this felt like a heavy-handed solution at times. The AI would develop a horrible compulsion, complaining that it couldn’t enter a key building yet because there was a single, unimportant tile left to explore on the other side of the map. While effective, it wasn’t elegant. I later found a better balance in the Pokémon Yellow run by loosening this restriction, which proved just as effective without the obsessive behavior.

A Council of Agents

Even with a full map at its disposal, Gemini struggled with certain puzzles that require planning far ahead – notably, the infamous spinner maze in Team Rocket’s Hideout (Basement Floor 3) and the boulder Strength puzzles in Victory Road. These puzzles are tricky because the correct path is often counter-intuitive. For example, in the Rocket Hideout spinner maze, the obvious visual goal (the staircase) is nearby but unreachable if you walk straight towards it; you must first go the opposite way, through a maze of conveyor belts that loop you around. Time and again, I watched Gemini fall for the “attractor” – it would see the goal, try to beeline towards it, and get sent back to the start by a trap. The AI just wouldn’t take a step back and plan a longer, indirect route.

image of Mt Moon B2F attractor puzzle

A classic “attractor” puzzle in Mt. Moon. The direct path (shown in red) is blocked. The correct path (shown in magenta) requires moving away from the goal to find the loop that leads to the exit.

Part of the issue was that Gemini wasn’t fully understanding how the spinner tiles worked. These tiles send you involuntarily sliding in one direction until you hit a stop. It would repeatedly step on the same spinner, seemingly thinking it would be sent in a different direction. To help, I augmented the Map Memory with special annotations for spinner paths. This involved pre-computing the end-coordinates for any spinner sequence Gemini had already fully explored, giving it a deterministic view of what each seen spinner would do, instead of it having to guess.

But even with a grasp of the mechanics, the pathfinding problem in these mazes was non-trivial. I was reluctant to just hard-code a solution or give it a typical pathfinding algorithm that automatically moved the player. I still wanted Gemini to figure out the path – but maybe it could use some extra brainpower to do so. This led to a significant upgrade in the harness: the introduction of specialized planning agents.

I created a tool that let Gemini essentially spawn a secondary Pathfinder Agent (another instance of Gemini 2.5 Pro) with a blank slate and a singular focus: find a route from point A to point B. When Gemini encountered a maze it couldn’t easily solve (or just didn’t understand why it couldn’t walk through a wall), it could invoke this agent, which was provided with the current map layout and tile mechanics.

The agent was then prompted to mentally simulate an optimal pathfinding algorithm to find a route. Thinking in isolation, free from the clutter of the full game context, the agent would devise a path consistent with A* or Breadth-First Search—an incredible demonstration of pure reasoning. The first time I tried this, the Pathfinder agent one-shot the Rocket Hideout’s B3F maze, a task that had stumped the primary agent for days.

This success felt like magic, a powerful demonstration of the reasoning strength of Gemini 2.5 Pro.

Of course, it wasn’t foolproof. On more convoluted maps that required long detours, even the Pathfinder agent would sometimes hallucinate shortcuts (e.g. “just walk through this wall!”) or convince itself that a row of impassable tiles was, in fact, a ledge it could jump down from. But overall it dramatically improved Gemini’s ability to handle complex navigation. I observed an interesting scaling difference here: Gemini 2.5 Pro (the large model) could reason out quite long paths, whereas the smaller Gemini 2.5 Flash model (which I tested in a brief side-run) often failed on the same tasks. The Flash model, while significantly faster, was considerably less capable and struggled with navigation, never managing to exit Mt. Moon in my local tests.

The success of the Pathfinder agent highlights a crucial point about evaluating modern AI. A model’s difficulty with interpreting raw game visuals reframes the challenge. By providing the AI with a textual, “fog-of-war” style map that records explored tiles, we bypass the unreliable vision system and present it with a different kind of test. The question shifts from “Can you see the path?” to “Given a structured map, can you interpret the data and deduce the correct path?” This is a complex task in its own right, requiring the model to reason about spatial relationships based on textual information. The clear performance gap between Gemini 2.5 Pro, which could solve complex mazes, and the smaller Gemini 2.5 Flash, which struggled with the same harness, demonstrates that this approach effectively measures the model’s core reasoning abilities. This principle of empowering the AI with better tools to isolate its reasoning capabilities was a foundational concept that we would later expand upon, even allowing the AI to program its own navigational solutions.

The other major in-game challenge was Strength boulder puzzles (think Sokoban-style puzzles where you push rocks onto switches). Gemini actually solved most of the Victory Road boulder puzzles on its own in Run 1, but it got stuck on one particularly challenging puzzle that required pushing a boulder from far away. The AI couldn’t make the logical leap that this specific, distant boulder was the correct one. I believe that given enough time, it could have eventually solved the puzzle through trial and error, but a trade-off was made for the sake of viewer entertainment.

image of victory road boulder puzzle with red arrow

The Victory Road boulder puzzle that stumped the AI. The correct boulder had to be pushed from a considerable distance (path shown in red) to reach the switch.

I decided to apply the same idea as the Pathfinder here: a dedicated Boulder Puzzle Strategist agent. This agent would, given the current layout of boulders, holes, and switches, simulate which boulders could reach which switches, determine the correct pairings, and then plan out the sequence of pushes.

I even let Gemini 2.5 Pro help write the prompt for this agent. I described the problem, saying something like, “write a prompt for an LLM to solve Sokoban-style puzzles, where boulders need to be pushed onto switches”, then let the model draft a prompt, which I refined. The result was an agent that could solve these multi-step puzzles in one go, sparing the main model from potentially hundreds of trial-and-error moves. Like the Pathfinder, it wasn’t perfect and would sometimes claim no solution existed, but it was a dramatic improvement. In effect, it compressed what might be countless hours of shoving a boulder around, resetting, and repeating, into a single burst of reasoning and a clean plan.

By the time the first full run of Pokémon Blue concluded, the harness had evolved into a sophisticated support system. Here’s a high-level look at the final architecture:

ComponentFunctionKey Benefit
Information ExtractionRAM data + Annotated ScreenshotsProvides accurate, real-time game state (e.g., Pokémon stats, PP, inventory) and visual context, overcoming LLM vision limitations.
Critique SystemPeriodically prompts AI to analyze its behavior and suggest improvements.Facilitates self-correction and adaptive strategy refinement.
Summary SystemCondenses 100 past actions into a summary, clearing context.Prevents long-context looping, maintains long-term memory while keeping context manageable.
Goals SystemSets Primary, Secondary, Tertiary goals; re-inserted after context clear.Ensures long-term task coherence and prevents goal flip-flopping.
Map MemoryFog-of-war minimap (textual XML) of explored tiles.Provides persistent spatial awareness, enables exploration directives, prevents getting lost in mazes.
Pathfinder AgentExternal Gemini 2.5 Pro instance for complex pathfinding.Solves intricate mazes (e.g., spinner puzzles) by focused, simulated reasoning, overcoming LLM spatial reasoning deficits.
Boulder Puzzle Strategist AgentExternal Gemini 2.5 Pro instance for Strength puzzles.Automates complex multi-step boulder puzzles, compresses context, improves reliability.

Gemini 2.5 Pro, empowered by this scaffold, was able to navigate the entire game and overcome every obstacle without human hand-holding. On May 2, 2025, after roughly 813 hours of cumulative play, Gemini defeated the Elite Four and became the Pokémon League Champion of Kanto. Mission accomplished – and yet, this was just the beginning.

Victory and Validation

The Twitch chat, Discord, and Reddit communities were ecstatic when Gemini finally rolled the credits on Pokémon Blue. But the celebration wasn’t confined to the usual AI circles. To my surprise, the project caught the attention of some very high-profile onlookers. Google’s CEO, Sundar Pichai, had been quietly following the AI Pokémon saga. The day Gemini clinched the championship, Pichai triumphantly posted on X (Twitter): “What a finish! Gemini 2.5 Pro just completed Pokémon Blue! Special thanks to @TheCodeOfJoel for creating and running the livestream, and to everyone who cheered Gem on along the way.” Seeing the CEO of Google publicly thank me by name was surreal. (He even gave the project a shoutout during Google I/O 2025, playfully dubbing it “Artificial Pokémon Intelligence.”)

Other Google leaders were cheering from the sidelines too. Logan Kilpatrick, product lead for Google AI Studio, had been tweeting updates as Gemini progressed – at one point noting that Gemini earned its 5th Gym badge in just ~500 hours, whereas “the next best model only has 3 badges so far (though with a different agent harness)”. The “other model” was Anthropic’s Claude, still stuck earlier in the game. The media picked up on this AI showdown: outlets like TechCrunch and India Today ran pieces on “Gemini vs Claude” and the broader implications of using Pokémon as a benchmark for agentic AI. I even got credited in Google’s official Gemini 2.5 technical report – they devoted an entire section to Gemini Plays Pokémon as a case study in long-horizon reasoning, citing the project as an example of complex agentic tool use.

Amidst all the fanfare, I did previously caution viewers (on my stream’s FAQ) not to read too much into the “Gemini vs Claude” narrative. Yes, Gemini 2.5 Pro ultimately finished the game and Claude hadn’t yet, but the playing field wasn’t level. Gemini had the benefit of my custom harness with all these extra tools and information, whereas Claude’s setup was more bare-bones. “Please don’t consider this a benchmark for how well an LLM can play Pokémon,” I wrote on the stream page – direct model-to-model comparisons are tricky when the scaffolding around them differs. This experience, however, highlighted the need for a neutral, standardized platform for these kinds of evaluations. In the end, both AI agents needed significant help (in the form of those external tools and inputs) to handle a game like Pokémon. The real achievement, in my view, wasn’t Gemini “being smarter” than Claude – it was demonstrating that with the right support, an AI can tackle a complex, open-ended task over hundreds of hours, persist through setbacks, and eventually succeed.

That said, beating Pokémon Blue (twice!) was incredibly satisfying, and we stumbled upon some unintended “firsts” along the way:

  • During the second run, Gemini encountered an obscure glitch in the Seafoam Islands – essentially becoming the first AI to find a bug in Pokémon’s code on its own.

  • It had amusing hallucinations from its training data, at one point becoming convinced it needed to find “TEA” to give a thirsty guard (a mechanic from a different Pokémon game, not present in Blue).

  • We even saw what the community dubbed “model panic.” When its Pokémon were low on health, the AI’s reasoning would degrade, causing it to forget its own advanced tools and fixate on simple, often ineffective escape strategies.

Kalm Panik meme showing Gemini’s reaction to its party’s health

A single, healthy Pokémon was fine. But adding low-HP party members would trigger a panic, even though the healthy Pokémon was unchanged. This irrational fear often caused its reasoning to collapse. (Meme credit: mrcheeze_ on Discord)

As an aside, to fix the cross-game hallucinations, I re-prompted the AI to act like a new player in its second run. This solved one problem but created another: a persistent delusion that Full Heal restored HP. (To be fair, the names are confusing: Full Heal only cures status ailments, while Full Restore is the item that restores HP). This single error sparked a completely unnecessary 100-hour loop of losing to the Elite Four then going back to Victory Road to train some more. It was a lesson in the sometimes delicate art of prompt engineering — a single instruction can have complex and unpredictable downstream consequences.

By June 9, 2025, the fully autonomous second run (with the finalized harness and zero manual interventions from me) was complete – and much faster, about 406 hours of playtime, roughly half the time of the development run. Gemini had truly become a Pokémon Master, and I was ready for the next challenge.

Beyond Pokémon Blue: Giving Gemini More Autonomy

For the next phase, I set my sights on Pokémon Yellow – specifically a ROM hack called Pokémon Yellow Legacy that ups the difficulty. Yellow is very similar to Blue under the hood, which meant I could reuse a lot of the harness, but the mod made it interesting: all 151 Pokémon are catchable, trainers are tougher, and an optional “Hard Mode” enforces level caps and a strict no-switching battle style. Since Gemini had already beaten the base game twice, I enabled Hard Mode to present it with a fresh challenge. This Yellow run would serve as both an entertainment filler (something fun for viewers while I worked on bigger upgrades) and a testbed for new ideas.

For the Yellow Legacy run, we also sought to increase the strategic demands of combat. In a standard playthrough, it’s possible for battles to become a simple matter of grinding to a higher level than the opponent. The “Hard Mode” in this version of the game, however, introduces constraints that transform combat into a genuine test of tactical skill. With strict level caps preventing over-leveling and a “set” battle style that removes the advantage of switching after an opponent faints, brute-force approaches become ineffective. Instead, the AI must engage in sophisticated reasoning—carefully managing team composition, type matchups, and move selection to overcome opponents on an even footing. This turns every major battle into a high-stakes puzzle, creating a rigorous evaluation of the model’s strategic reasoning.

The core idea for this phase was to shift the balance of power and increase Gemini’s own agency in developing solutions. The Blue runs had established what strong scaffolding could do; for Yellow, the question evolved. Could the AI learn to win with less? The goal was to move beyond pre-built solutions and see if Gemini could compensate for a weaker harness by relying on its own emergent strategies.

To achieve this, I fundamentally altered the toolkit. I removed the specialized agents for pathfinding and boulder puzzles, along with most environmental hints like tile navigability and the strict exploration directive. In their place, I introduced a suite of open-ended meta-tools. Gemini could now define its own mini-agents with define_agent, execute scripts with run_code, place labeled map markers, and store long-term plans in a notepad. This gave Gemini the building blocks to identify a sub-problem, create a plan, and code its own solution from scratch—a major step toward true agentic behavior.

This new approach fundamentally changed the nature of the evaluation. For example, when faced with a complex navigational problem, Gemini could now write and execute its own Python code to generate a path. This demonstrates a higher-order capability, and thus a shift to testing whether the model could understand a problem deeply enough to design and implement a solution on its own, rather than rely on pre-built tools.

In practice, Gemini began using the agent-creation tools as a workaround to store and execute reusable scripts, as the built-in code execution tool couldn’t do so. To streamline this, I added a dedicated custom tool feature, which allowed Gemini to build a library of its own reusable functions. This change reserved the define_agent command for its original intent: creating agents for focused or complex, reasoning-heavy tasks like battle strategy. It was captivating to watch the AI effectively extend its own abilities by writing new prompts and code—a true glimpse into the future of AI autonomy.

Another experiment was the introduction of a “World Knowledge Graph.” The idea was to give Gemini a more robust sense of world geography by allowing it to log connections between areas. In practice, however, the agent often wasted time adding nodes and edges without checking if they already existed, and ultimately didn’t make effective use of the data for navigation. Since the feature wasn’t providing a significant benefit in its initial form, it was later removed. While ultimately not useful, this kind of iteration is still valuable in such an experimental project.

To provide transparency into this new phase of AI autonomy, I implemented several corresponding upgrades to the stream’s interface. A new UI panel was created to display the active agents Gemini had defined. For deeper insight, I set up a public Git repository where viewers could inspect the underlying code and notepad diffs, which were committed in real-time. The stream also received a visual overhaul, including a new avatar for Gemini and a floating “verbal response” bubble for its roleplay persona, giving the stream more character.

image of old vs new stream UI

The stream’s UI evolution: from a simple terminal and game screen (left) to a feature-rich interface with custom panels for agents, status, and a more polished design (right).

As I write this, the Pokémon Yellow Legacy run is ongoing, and Gemini is steadily earning badges with its enhanced self-driven harness. The groundwork here is also preparing me for the next big leap: Pokémon Crystal. The eventual goal isn’t just to beat the base game, but to tackle more unpredictable challenges like difficult romhacks or randomizers, many of which are built upon Crystal. The increased scope of Generation II games (a bigger world, more story, and new mechanics like the day/night cycle) means the harness will need significantly more work to adapt than it did for the jump from Pokémon Blue to Yellow Legacy. I’ve already launched a secondary Twitch channel for early testing of Gemini in Crystal, and it’s been both exciting and daunting to see the AI tackle an even larger world. The plan is to fully pivot the main project to Crystal (and Crystal-based romhacks) once Yellow Legacy wraps up. With the power of these new meta-tools and world modeling, I’m optimistic that Gemini can become the champion of Johto as well.

Project Timeline

The project was developed live on stream from the beginning, a journey that involved a significant amount of work across two main repositories. The backend harness, the core of the project, saw over 2,300 commits, with more than 63,000 lines of code added and 31,000 deleted. The frontend UI, while smaller, still required over 200 commits, with over 30,000 lines of code added and 12,000 deleted. This table gives a high-level overview of that evolution, showing how AI features and stream UI improvements were developed in parallel.

PhaseDatesAI Development MilestonesStream UI Development Milestones
Phase 1: From MVP to PerceptionLate March 2025
Week 1 (Mar 25 - Mar 31)Core loop (mGBA emulator, screenshot capture, button presses), mGBASocketServer.lua for RAM access, AI responsible for movement.Initial stream UI: terminal left, game screen right.
Phase 2: Memory, Strategy, and Self-CorrectionApril - Mid-May 2025
Week 2 (Apr 1 - Apr 7)Goals system (Mar 28), Critique system (Apr 1), Summarization feature (long-term memory/context consolidation).
Week 3 (Apr 8 - Apr 14)Map Memory for persistent spatial awareness, Pathfinder Agent (Apr 14).First web-based UI (Apr 8).
Week 4 & 5 (Apr 15 - Apr 28)Groundwork for Twitch chat integration.Richer UI (Apr 16 - Apr 27): Dynamic team panel, badge case, multi-panel goals, live minimap (player position). Syntax highlighting, pop-up displays for critiques/summaries.
Week 6 & 7 (Apr 29 - May 12)Boulder Puzzle Strategist (May 1).Visualization (May 8 - May 20): “Who’s That Pokémon?“ overlay (May 17), visual pathfinding indicator, Pokeball inventory display.
Phase 3: The Yellow Legacy Run & The Autonomy UpdateLate May - Mid-June 2025
Week 8 (May 26 - June 5)Yellow Legacy run begins. New agent-based architecture: define_agent, call_agent, delete_agent tools (AI writes agents). New notepad_edit tool.UI refinement (May 27 - June 4): Tabbed panel for team/inventory (June 3), token usage counter, polished animations.
World Knowledge Graph (June 6 - June 14)World Knowledge Graph implemented. Git repository for AI’s custom agents/notepad (transparency). Project pivots to Pokémon Crystal.UI for agent architecture (June 7 - June 14): Displays for active agents, code, notepad diffs. Visual overhaul: Gem avatar, floating “chat” bubble (June 12).

Final Thoughts

In the span of just a few months, this project evolved from a simple script into a real-world showcase of what modern AI models can do when you combine long context, multimodal perception, and creative tool use. It served as a powerful, practical lesson in the current state of the art. We saw that models like Gemini still have obvious weaknesses – they can get stuck in thought loops, misread simple visuals, or hallucinate objectives that don’t exist. But we also learned how to counteract many of those weaknesses with clever scaffolding. Sometimes, a breakthrough came not from a better model, but from a better-framed task, like the directive to “explore every tile,” or giving the AI a fresh context to think in, like the Pathfinder agent.

The scaffolding around an AI is as important as the model itself for complex tasks.

This was one of my core takeaways from the project. By incrementally giving Gemini the right tools and nudges, I witnessed it achieve something pretty astounding: it played a 30-hour RPG from start to finish, through planning, perseverance, and problem-solving – all on its own. Once the stuff of science fiction, now it’s a Twitch stream that anyone can tune into.

This deep dive has only scratched the surface of the countless hours of coding, debugging, and discovery that went into this project. Looking back, the journey from a late-night experiment to a project recognized by Google’s own CEO has been surreal. It’s a reminder that even a childhood game can be a frontier for exploring the future of AI.

I also have to give a huge thank you to the dedicated members of the Discord and Reddit communities, who meticulously kept logs and tracked Gemini’s progress. And of course, none of this would have been possible without the generosity of Google DeepMind, who provided the free, unlimited API access that powered the stream.

And for me, this is just the beginning. The long-term vision is to expand on this work by continuing the public streams, tackling future Pokémon generations and other turn-based strategy games like Fire Emblem, along with experiments into real-time platformers like Super Mario Bros. But as this project has shown, comparing agent performance is difficult without a level playing field. To that end, and to help build on these long-horizon evaluations in a standardized way, I’m establishing the ARISE Foundation (Agentic Research & Intelligence Systems Evaluation Foundation), which will open-source the harness for the Pokémon Blue run and be my focus for the foreseeable future. The road ahead is long, but I hope you’ll join me for it.

Follow my work on X (Twitter) to be a part of the journey.