A tiktok by @mattrwolfe — researched and verified by Depth
Open source large language models, exemplified by Z.ai's GLM-5 autonomously building a functional Game Boy Advance emulator over 24 hours, are closing the gap with proprietary frontier models at a remarkable pace.
Developer: Z.ai (formerly Zhipu AI / ZhipuAI), a Tsinghua University spinoff founded in 2019. The company rebranded to Z.ai in 2025 and completed a Hong Kong IPO on January 8, 2026, raising approximately HKD 4.35 billion (USD $558 million).
The company name in the transcript is slightly garbled — it is Z.ai (stylized), not "ZAI." The model family is the GLM (General Language Model) series, now in its fifth generation.
GLM-5 uses a Mixture-of-Experts (MoE) architecture with 744B total parameters and approximately 40–44B active parameters per token, representing roughly a 2x scale-up from its predecessor GLM-4.5 (355B total, 32B active). Pre-training data grew from 23T to 28.5T tokens.
Key architectural innovations:
- DeepSeek Sparse Attention (DSA): Borrowed from DeepSeek, this mechanism enables efficient long-context handling up to 200K tokens without the typical quadratic memory cost.
- Slime async RL framework: A novel asynchronous reinforcement learning infrastructure developed by Zhipu AI. Traditional RL training at this scale is notoriously inefficient; Slime enables post-training runs producing 3,000–6,000 messages per run (~60–100M output tokens), specifically honing long-range planning and tool use.
- MoE routing: 78 transformer layers, 256 experts per layer with 8 activated (~5.9% sparsity). The first three layers are dense; subsequent layers use sparse attention.
- Hardware independence: GLM-5 was trained entirely on 100,000 Huawei Ascend 910B chips using the MindSpore framework — zero dependency on NVIDIA hardware. This is both a technical milestone and a geopolitical statement, as Zhipu AI has been on the U.S. Entity List since January 2025, blocking access to H100/H200 GPUs.
Benchmark performance (self-reported):
- SWE-bench Verified: 77.8% (#1 open-source; Claude Opus 4.5 leads at 80.9%)
- AIME 2026: 92.7% (matches Claude Opus 4.5 at 93.3%)
- GPQA-Diamond: 86.0% (vs Claude Opus 4.5 at 87.0%)
- Humanity's Last Exam (with tools): 50.4% (beats Claude Opus 4.5 at 43.4%)
- Terminal-Bench 2.0: 56.2–60.7
- Vending Bench 2: $4,432 simulated balance (#1 open-source)
Known limitations:
- Inference speed: ~17–19 tok/s, noticeably slower than NVIDIA-backed competitors (~25+ tok/s)
- Self-hosting requires ~8× A100 80GB GPUs (substantial infrastructure)
- Fewer third-party integrations vs. OpenAI/Anthropic ecosystems
- Knowledge cutoff is not officially published
E01 Research (e01.ai) gained early access to GLM-5 to stress-test its long-task capabilities. They designed what they called the "Emulator Challenge": build a Game Boy Advance emulator from scratch in JavaScript, embedded in a 3D rendered scene, using a single agent with no parallelism.
There were two distinct test conditions — and the video conflates them:
"Easy mode" (with reference code): E01 gave GLM-5 the gbajs open-source GBA emulator source code as reference. GLM-5 read the architecture, understood the design, and reimplemented it independently. Result: working core emulator, ROM loading, 3D scene. Live demo: https://e01.ai/gba
"Zero reference" mode (no code, no web search): Only a system prompt and a GBA hardware specification document. This ran 24+ hours straight. Result: CPU instruction set core was completed, but the full emulator was still in progress — it was NOT a finished, playable emulator.
The working demo shown in the video (the one where ROMs load and games play) is from the "easy mode" run with reference code, not the pure 24-hour autonomous run. The 24-hour zero-reference run produced a partial result.
How the agent maintained state across context resets: The prompt defined a meta-loop (work → test → log → advance), persisting state in files (/notes/progress.md, /notes/decisions.md, /notes/blockers.md). GLM-5 made 700+ tool calls and 800+ context switches without degradation — prior-generation models failed by looping, forgetting goals, or halting on erroneous tool calls.
A notable human intervention: During the experiment, GLM-5 got stuck trying to generate a 3D model of the GBA console from scratch (a task better suited to a human sourcing an asset). A human intervened to unblock it. E01 Research noted that setting explicit "pause-and-ask" thresholds is important for long-running agents.
pip install z-ai-sdk
from z_ai_sdk import ZAI
client = ZAI(api_key="YOUR_API_KEY") # Get key at api.z.ai
response = client.chat.completions.create(
model="glm-5",
messages=[
{"role": "user", "content": "Write a Python function to parse a GBA ROM header."}
]
)
print(response.choices[0].message.content)
# FP8 quantized version
vllm serve zai-org/GLM-5-FP8 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.85 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5-fp8
docker pull lmsysorg/sglang:glm5-hopper # For H100/H200
docker pull lmsysorg/sglang:glm5-blackwell # For B100/B200
Visit: https://e01.ai/gba (the "easy mode" result with reference code)
The company name is wrong. The creator says "ZAI" — the correct name is Z.ai (formerly Zhipu AI). This matters for finding documentation, the API, and the GitHub org (zai-org).
The 24-hour run did NOT produce a working emulator. The video shows the "easy mode" result where GLM-5 was given the gbajs reference source code. The pure zero-reference 24-hour run only completed the CPU instruction set — the full emulator was still in progress. This is a significant omission that changes the impressiveness of the claim.
Human intervention was required. E01 Research documented that GLM-5 got stuck trying to generate a 3D GBA console model from scratch and a human had to intervene to unblock it. It was not fully autonomous.
GLM-5 was soft-launched as "Pony Alpha" on OpenRouter in early February 2026 before the official February 11 release — a stealth stress test that the AI community identified through benchmark analysis and GitHub PRs.
Geopolitical significance: GLM-5 was trained entirely on Huawei Ascend chips with no NVIDIA hardware — a direct response to U.S. export controls that placed Zhipu AI on the Entity List in January 2025.
Alternatives for long-horizon agentic coding: Qwen 3.5 (Alibaba, 397B MoE), DeepSeek-V3.2-Thinking, Kimi K2.5, and MiniMax M2.2 all launched in the same February 2026 "Spring Festival offensive" wave. Claude Opus 4.5 still leads on SWE-bench Verified (80.9% vs GLM-5's 77.8%).
Self-hosting is not consumer-grade. Running GLM-5 locally requires approximately 8× A100 80GB GPUs. Most users will need to use the API or OpenRouter.
The E01 Research blog post is the primary source — it is publicly available at https://blog.e01.ai/glm5-gameboy-and-long-task-era-64db7074a026 and contains the full technical methodology.
Confirmed that GLM-5 was released on February 11, 2026, by Z.ai (formerly Zhipu AI) — the company name 'ZAI' is a garbled rendering of 'Z.ai'; it is open-source under MIT license on HuggingFace.
— SourceConfirmed E01 Research ran GLM-5 for 24+ hours, but the working emulator shown was from the 'easy mode' run with gbajs reference code provided; the pure 24-hour zero-reference run only completed the CPU instruction set and was still in progress — not a finished playable emulator.
— SourceConfirmed — the live demo at e01.ai/gba shows a working GBA emulator with ROM loading and a 3D-rendered scene, produced by the 'easy mode' run where GLM-5 was given the gbajs source code as reference.
— SourceThe working demo required reference source code (gbajs), not just documents; the truly autonomous 24-hour run (no code, no web search) did not produce a working emulator, and at least one human intervention was required to unblock a stuck loop.
— SourceConfirmed by benchmarks: GLM-5 scores 77.8% on SWE-bench Verified vs Claude Opus 4.5's 80.9%, and beats proprietary models on Humanity's Last Exam (tool-augmented) and HMMT Nov. 2025.
— SourceRun a head-to-head benchmark test: assign the same agentic coding task (e.g., build a working CLI tool with file I/O, error handling, and tests) to GLM-5 via api.z.ai, Claude Opus 4.5, and GPT-4o, then compare output quality, token cost, and time-to-completion using identical prompts and tool access.
Self-reported benchmarks are unreliable; internal empirical data on your specific use cases is the only trustworthy basis for procurement or architecture decisions. The 5-8x cost differential makes this comparison high-stakes.
Read the primary source E01 Research blog post at https://blog.e01.ai/glm5-gameboy-and-long-task-era-64db7074a026 and document the exact agent loop design (meta-loop structure, file-based state persistence, pause-and-ask thresholds) for potential reuse in your own long-horizon agent workflows.
The architectural pattern for maintaining state across 800+ context switches is immediately applicable to any long-running agentic task, regardless of which model you use. This is transferable knowledge.
Prototype a minimal long-horizon agentic task using GLM-5 via OpenRouter (openrouter.ai/z-ai/glm-5) with the file-based state persistence pattern: define a work→test→log→advance meta-loop, persist state in markdown files, and run a 2-4 hour autonomous coding or research task to validate real-world reliability before committing to infrastructure investment.
OpenRouter removes the need for API key setup with Z.ai directly and allows immediate experimentation. Validating the 700+ tool call reliability claim on your own tasks is essential before architectural commitment.
Audit your current AI spend and identify the top 3 workloads where you are paying Claude Opus 4.5 or GPT-4o rates, then calculate the cost delta if those workloads migrated to GLM-5 at $1.00/$3.20 per million tokens input/output.
At 5-8x cheaper with near-parity benchmark scores on most tasks, even a partial migration could represent significant savings. This creates a concrete business case for further evaluation.
Evaluate whether your organization's threat model permits using a model from a company on the U.S. Entity List (Zhipu AI / Z.ai, listed January 2025): consult legal/compliance on data residency, export control implications, and whether API calls to api.z.ai constitute a restricted transaction before deploying in production.
The geopolitical dimension is a real operational risk for U.S.-based organizations. This is a blocking issue that must be resolved before any production deployment, regardless of technical merit.
Set up a cost-benefit analysis for self-hosting GLM-5-FP8 using the vLLM configuration provided: get quotes for 8x A100 80GB GPU cloud instances (e.g., AWS p4d.24xlarge or Lambda Labs), calculate break-even point against API pricing at your projected token volume, and factor in the 17-19 tok/s inference speed penalty.
Self-hosting eliminates the Entity List compliance risk and provides data sovereignty, but the infrastructure cost is substantial. The break-even calculation will determine whether self-hosting is viable at your scale.
Monitor the February 2026 'Spring Festival offensive' cohort — Qwen 3.5, DeepSeek-V3.2-Thinking, Kimi K2.5, and MiniMax M2.2 — by subscribing to their respective GitHub repos and HuggingFace model pages, and schedule a comparative evaluation in 30 days once community benchmarks and third-party evals stabilize.
GLM-5 is one of five major open-source releases in the same window. Committing to any single model before the community has stress-tested all five is premature; 30 days of external benchmarking will dramatically improve signal quality.
Replicate the E01 emulator experiment's 'easy mode' condition at small scale: give GLM-5 an open-source project's source code as reference and ask it to reimplement a specific module independently, measuring whether it can maintain coherent architecture decisions across a multi-hour session without human intervention.
The 'easy mode' result (with reference code) is the actually validated claim, not the 24-hour zero-reference run. Testing this specific condition — reference-assisted reimplementation — maps directly to real enterprise use cases like codebase migration, refactoring, and port work.
Want research like this for any video?
Save a link, get back verified intelligence.