GLM5 Autonomously Builds Working Game Boy Emulator

A tiktok by @mattrwolfe — researched and verified by Depth

6/8

●●●●●●○○ Credibility Score

2 of 5 claims independently confirmed — some gaps remain

📝 What They Said

Open source large language models, exemplified by Z.ai's GLM-5 autonomously building a functional Game Boy Advance emulator over 24 hours, are closing the gap with proprietary frontier models at a remarkable pace.

1 A new open source model called GLM-5 was released by a company called 'ZAI' (correctly: Z.ai, formerly Zhipu AI).
2 E01 Research used GLM-5 to build a Game Boy Advance emulator by giving it a system prompt and hardware documentation, then letting it run autonomously for 24 hours.
3 The final result allows users to load a ROM and play GBA games with a full GUI movable in 3D space, including a movable controller.
4 Open source models are catching up to state-of-the-art proprietary models at a remarkable pace.

🔬 What We Found

GLM-5 — What It Is

Developer: Z.ai (formerly Zhipu AI / ZhipuAI), a Tsinghua University spinoff founded in 2019. The company rebranded to Z.ai in 2025 and completed a Hong Kong IPO on January 8, 2026, raising approximately HKD 4.35 billion (USD $558 million).

Official release date: February 11, 2026
GitHub: https://github.com/zai-org/GLM-5
HuggingFace weights: https://huggingface.co/zai-org/GLM-5
API: https://api.z.ai
OpenRouter: https://openrouter.ai/z-ai/glm-5
License: MIT (fully open-source, commercial use permitted)
Parameters: 744–745B total, ~40–44B active per token (MoE)
Context window: 200K input tokens, up to 128K output tokens
Pre-training data: 28.5 trillion tokens
Pricing: ~$1.00/M input tokens, ~$3.20/M output tokens (approximately 5–8x cheaper than Claude Opus 4.5)

The company name in the transcript is slightly garbled — it is Z.ai (stylized), not "ZAI." The model family is the GLM (General Language Model) series, now in its fifth generation.

How It Works — Technical Architecture

GLM-5 uses a Mixture-of-Experts (MoE) architecture with 744B total parameters and approximately 40–44B active parameters per token, representing roughly a 2x scale-up from its predecessor GLM-4.5 (355B total, 32B active). Pre-training data grew from 23T to 28.5T tokens.

Key architectural innovations:
- DeepSeek Sparse Attention (DSA): Borrowed from DeepSeek, this mechanism enables efficient long-context handling up to 200K tokens without the typical quadratic memory cost.
- Slime async RL framework: A novel asynchronous reinforcement learning infrastructure developed by Zhipu AI. Traditional RL training at this scale is notoriously inefficient; Slime enables post-training runs producing 3,000–6,000 messages per run (~60–100M output tokens), specifically honing long-range planning and tool use.
- MoE routing: 78 transformer layers, 256 experts per layer with 8 activated (~5.9% sparsity). The first three layers are dense; subsequent layers use sparse attention.
- Hardware independence: GLM-5 was trained entirely on 100,000 Huawei Ascend 910B chips using the MindSpore framework — zero dependency on NVIDIA hardware. This is both a technical milestone and a geopolitical statement, as Zhipu AI has been on the U.S. Entity List since January 2025, blocking access to H100/H200 GPUs.

Benchmark performance (self-reported):
- SWE-bench Verified: 77.8% (#1 open-source; Claude Opus 4.5 leads at 80.9%)
- AIME 2026: 92.7% (matches Claude Opus 4.5 at 93.3%)
- GPQA-Diamond: 86.0% (vs Claude Opus 4.5 at 87.0%)
- Humanity's Last Exam (with tools): 50.4% (beats Claude Opus 4.5 at 43.4%)
- Terminal-Bench 2.0: 56.2–60.7
- Vending Bench 2: $4,432 simulated balance (#1 open-source)

Known limitations:
- Inference speed: ~17–19 tok/s, noticeably slower than NVIDIA-backed competitors (~25+ tok/s)
- Self-hosting requires ~8× A100 80GB GPUs (substantial infrastructure)
- Fewer third-party integrations vs. OpenAI/Anthropic ecosystems
- Knowledge cutoff is not officially published

The E01 Research GBA Emulator Experiment — What Actually Happened

E01 Research (e01.ai) gained early access to GLM-5 to stress-test its long-task capabilities. They designed what they called the "Emulator Challenge": build a Game Boy Advance emulator from scratch in JavaScript, embedded in a 3D rendered scene, using a single agent with no parallelism.

There were two distinct test conditions — and the video conflates them:

"Easy mode" (with reference code): E01 gave GLM-5 the gbajs open-source GBA emulator source code as reference. GLM-5 read the architecture, understood the design, and reimplemented it independently. Result: working core emulator, ROM loading, 3D scene. Live demo: https://e01.ai/gba
"Zero reference" mode (no code, no web search): Only a system prompt and a GBA hardware specification document. This ran 24+ hours straight. Result: CPU instruction set core was completed, but the full emulator was still in progress — it was NOT a finished, playable emulator.

The working demo shown in the video (the one where ROMs load and games play) is from the "easy mode" run with reference code, not the pure 24-hour autonomous run. The 24-hour zero-reference run produced a partial result.

How the agent maintained state across context resets: The prompt defined a meta-loop (work → test → log → advance), persisting state in files (/notes/progress.md, /notes/decisions.md, /notes/blockers.md). GLM-5 made 700+ tool calls and 800+ context switches without degradation — prior-generation models failed by looping, forgetting goals, or halting on erroneous tool calls.

A notable human intervention: During the experiment, GLM-5 got stuck trying to generate a 3D model of the GBA console from scratch (a task better suited to a human sourcing an asset). A human intervened to unblock it. E01 Research noted that setting explicit "pause-and-ask" thresholds is important for long-running agents.

Try It Yourself

Access GLM-5 via API

pip install z-ai-sdk

from z_ai_sdk import ZAI

client = ZAI(api_key="YOUR_API_KEY")  # Get key at api.z.ai

response = client.chat.completions.create(
    model="glm-5",
    messages=[
        {"role": "user", "content": "Write a Python function to parse a GBA ROM header."}
    ]
)
print(response.choices[0].message.content)

Self-host with vLLM (requires 8× A100 80GB)

# FP8 quantized version
vllm serve zai-org/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.85 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5-fp8

Self-host with SGLang (recommended for Hopper/Blackwell GPUs)

docker pull lmsysorg/sglang:glm5-hopper  # For H100/H200
docker pull lmsysorg/sglang:glm5-blackwell  # For B100/B200

Try the live GBA emulator demo

Visit: https://e01.ai/gba (the "easy mode" result with reference code)

What The Creator Didn't Mention

The company name is wrong. The creator says "ZAI" — the correct name is Z.ai (formerly Zhipu AI). This matters for finding documentation, the API, and the GitHub org (zai-org).
The 24-hour run did NOT produce a working emulator. The video shows the "easy mode" result where GLM-5 was given the gbajs reference source code. The pure zero-reference 24-hour run only completed the CPU instruction set — the full emulator was still in progress. This is a significant omission that changes the impressiveness of the claim.
Human intervention was required. E01 Research documented that GLM-5 got stuck trying to generate a 3D GBA console model from scratch and a human had to intervene to unblock it. It was not fully autonomous.
GLM-5 was soft-launched as "Pony Alpha" on OpenRouter in early February 2026 before the official February 11 release — a stealth stress test that the AI community identified through benchmark analysis and GitHub PRs.
Geopolitical significance: GLM-5 was trained entirely on Huawei Ascend chips with no NVIDIA hardware — a direct response to U.S. export controls that placed Zhipu AI on the Entity List in January 2025.
Alternatives for long-horizon agentic coding: Qwen 3.5 (Alibaba, 397B MoE), DeepSeek-V3.2-Thinking, Kimi K2.5, and MiniMax M2.2 all launched in the same February 2026 "Spring Festival offensive" wave. Claude Opus 4.5 still leads on SWE-bench Verified (80.9% vs GLM-5's 77.8%).
Self-hosting is not consumer-grade. Running GLM-5 locally requires approximately 8× A100 80GB GPUs. Most users will need to use the API or OpenRouter.
The E01 Research blog post is the primary source — it is publicly available at https://blog.e01.ai/glm5-gameboy-and-long-task-era-64db7074a026 and contains the full technical methodology.

✓ Verified Claims

⚠️

A new open source model called GLM5 was released by a company called ZAI.

Confirmed that GLM-5 was released on February 11, 2026, by Z.ai (formerly Zhipu AI) — the company name 'ZAI' is a garbled rendering of 'Z.ai'; it is open-source under MIT license on HuggingFace.

— Source

⚠️

E01 Research used GLM5 to create a working Game Boy Advanced emulator by giving it a system prompt in a hardware dock and letting it work autonomously for 24 hours.

Confirmed E01 Research ran GLM-5 for 24+ hours, but the working emulator shown was from the 'easy mode' run with gbajs reference code provided; the pure 24-hour zero-reference run only completed the CPU instruction set and was still in progress — not a finished playable emulator.

— Source

✅

The final result allows users to load a ROM and play Game Boy Advanced games with a full graphical user interface that is movable in 3D space, including a movable controller.

Confirmed — the live demo at e01.ai/gba shows a working GBA emulator with ROM loading and a 3D-rendered scene, produced by the 'easy mode' run where GLM-5 was given the gbajs source code as reference.

— Source

⚠️

The model essentially built that autonomously over 24 hours off of being given a goal in some documents.

The working demo required reference source code (gbajs), not just documents; the truly autonomous 24-hour run (no code, no web search) did not produce a working emulator, and at least one human intervention was required to unblock a stuck loop.

— Source

✅

Open models are catching up to state of the art models.

Confirmed by benchmarks: GLM-5 scores 77.8% on SWE-bench Verified vs Claude Opus 4.5's 80.9%, and beats proprietary models on Humanity's Last Exam (tool-augmented) and HMMT Nov. 2025.

— Source

→ Suggested Actions

medium

Run a head-to-head benchmark test: assign the same agentic coding task (e.g., build a working CLI tool with file I/O, error handling, and tests) to GLM-5 via api.z.ai, Claude Opus 4.5, and GPT-4o, then compare output quality, token cost, and time-to-completion using identical prompts and tool access.

Self-reported benchmarks are unreliable; internal empirical data on your specific use cases is the only trustworthy basis for procurement or architecture decisions. The 5-8x cost differential makes this comparison high-stakes.

quick

Read the primary source E01 Research blog post at https://blog.e01.ai/glm5-gameboy-and-long-task-era-64db7074a026 and document the exact agent loop design (meta-loop structure, file-based state persistence, pause-and-ask thresholds) for potential reuse in your own long-horizon agent workflows.

The architectural pattern for maintaining state across 800+ context switches is immediately applicable to any long-running agentic task, regardless of which model you use. This is transferable knowledge.

medium

Prototype a minimal long-horizon agentic task using GLM-5 via OpenRouter (openrouter.ai/z-ai/glm-5) with the file-based state persistence pattern: define a work→test→log→advance meta-loop, persist state in markdown files, and run a 2-4 hour autonomous coding or research task to validate real-world reliability before committing to infrastructure investment.

OpenRouter removes the need for API key setup with Z.ai directly and allows immediate experimentation. Validating the 700+ tool call reliability claim on your own tasks is essential before architectural commitment.

quick

Audit your current AI spend and identify the top 3 workloads where you are paying Claude Opus 4.5 or GPT-4o rates, then calculate the cost delta if those workloads migrated to GLM-5 at $1.00/$3.20 per million tokens input/output.

At 5-8x cheaper with near-parity benchmark scores on most tasks, even a partial migration could represent significant savings. This creates a concrete business case for further evaluation.

medium

Evaluate whether your organization's threat model permits using a model from a company on the U.S. Entity List (Zhipu AI / Z.ai, listed January 2025): consult legal/compliance on data residency, export control implications, and whether API calls to api.z.ai constitute a restricted transaction before deploying in production.

The geopolitical dimension is a real operational risk for U.S.-based organizations. This is a blocking issue that must be resolved before any production deployment, regardless of technical merit.

medium

Set up a cost-benefit analysis for self-hosting GLM-5-FP8 using the vLLM configuration provided: get quotes for 8x A100 80GB GPU cloud instances (e.g., AWS p4d.24xlarge or Lambda Labs), calculate break-even point against API pricing at your projected token volume, and factor in the 17-19 tok/s inference speed penalty.

Self-hosting eliminates the Entity List compliance risk and provides data sovereignty, but the infrastructure cost is substantial. The break-even calculation will determine whether self-hosting is viable at your scale.

quick

Monitor the February 2026 'Spring Festival offensive' cohort — Qwen 3.5, DeepSeek-V3.2-Thinking, Kimi K2.5, and MiniMax M2.2 — by subscribing to their respective GitHub repos and HuggingFace model pages, and schedule a comparative evaluation in 30 days once community benchmarks and third-party evals stabilize.

GLM-5 is one of five major open-source releases in the same window. Committing to any single model before the community has stress-tested all five is premature; 30 days of external benchmarking will dramatically improve signal quality.

heavy

Replicate the E01 emulator experiment's 'easy mode' condition at small scale: give GLM-5 an open-source project's source code as reference and ask it to reimplement a specific module independently, measuring whether it can maintain coherent architecture decisions across a multi-hour session without human intervention.

The 'easy mode' result (with reference code) is the actually validated claim, not the 24-hour zero-reference run. Testing this specific condition — reference-assisted reimplementation — maps directly to real enterprise use cases like codebase migration, refactoring, and port work.

💡 Go Deeper

What is the detailed architecture of the Slime async RL framework, and how does asynchronous rollout generation at 60-100M output tokens per run compare to synchronous RL approaches like PPO used in earlier RLHF pipelines?

How does DeepSeek Sparse Attention (DSA) achieve efficient 200K context handling, and what are the specific memory and compute tradeoffs compared to full attention, sliding window attention, and ring attention approaches?

What are the specific legal and compliance implications for U.S.-based organizations using API services from a company on the Bureau of Industry and Security Entity List, and does sending inference requests constitute a restricted transaction under EAR?

How does GLM-5's SWE-bench Verified score of 77.8% compare methodologically to Claude Opus 4.5's 80.9% — are the scaffolding, tool access, and evaluation conditions identical, or are there confounding differences that make direct comparison misleading?

What is the full technical specification of the file-based state persistence meta-loop used in the E01 GBA experiment, and how does it compare to other long-horizon agent memory architectures like MemGPT, Letta, or external vector store approaches?

How does training on 100,000 Huawei Ascend 910B chips with MindSpore compare to equivalent NVIDIA H100 cluster training in terms of throughput (FLOPS/chip), communication bandwidth, and software ecosystem maturity — and what does this imply for China's AI hardware independence timeline?

What are the specific failure modes that caused prior-generation models to fail the long-horizon agentic coding task (looping, goal forgetting, halting on erroneous tool calls), and what architectural or training changes in GLM-5 specifically address each failure mode?

How does the Mixture-of-Experts routing in GLM-5 (256 experts, 8 activated, 78 layers) compare to GPT-4's rumored MoE architecture and Mixtral 8x22B in terms of expert utilization, load balancing, and the tradeoff between total parameter count and active parameter efficiency?

📄 Related Research

Want research like this for any video?
Save a link, get back verified intelligence.

Try Depth free →