@howard.mov7 A 4B parameter model just outperformed 235B. . . . Not by being bigger. By being integrated. There are 3 main ways AI can remember long-term: → Bake it into the weights (forgets) → Bolt on search like RAG (imprecise) → Build from internal states (couldn’t scale — until now) A new paper from EverMind cracked option 3. End-to-end memory scaled to 100M tokens. Barely any quality loss. Duct-taped systems always lose to integrated ones. Eventually. Follow for more AI research distilled into your feed. #ai #research #tech #software #memory
♬ original sound - .
A Chinese AI startup (Evermind) has solved the AI memory problem by creating an end-to-end trainable internal memory system that scales to 100 million tokens, proving that integrated systems will always outperform modular 'duct tape' solutions.
On March 18, 2026, EverMind, a Chinese AI startup incubated by Shanda Group's founder Tianqiao Chen, released a research paper titled 'Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens' (MSA). The paper is published on Zenodo (https://zenodo.org/records/19103670) and open-sourced on GitHub (https://github.com/EverMind-AI/MSA), with the arXiv identifier 2603.23516.
The MSA architecture addresses AI memory through four core innovations: Memory Sparse Attention mechanism, Document-wise RoPE for extreme context extrapolation, KV Cache Compression with Memory Parallelism, and a Memory Interleave mechanism supporting complex reasoning. Unlike RAG, which relies on external, fixed similarity metrics (like cosine distance), MSA's router is co-optimized with the generation task during training via a supervised contrastive loss, fundamentally solving the core pain point of misaligned objectives between RAG's 'retrieval' and 'generation'.
When scaling the context length from 16K to 100M tokens, the model's performance degrades by less than 9%, demonstrating extraordinary scalability. KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. The system uses Qwen3-4B-Instruct as the backbone.
Benchmark results show the 4B-parameter MSA model (average score 3.760) significantly outperforms complex RAG systems built on the identical Qwen3-4B foundation, including those with a Reranker, improving over standard RAG (+16.0%), RAG+rerank (+11.5%), and HippoRAG2 (+14.8%). Against KaLMv2+Qwen3-235B and KaLMv2+Llama-3.3-70B (w/ and w/o reranking), MSA achieves the best score on 4/9 datasets and an average 3.760, with relative gains of +7.2%, +5.0%, +10.7%, and +5.4% over the strongest configurations respectively. This confirms the transcript's claim about a 4B model beating systems running on 235B parameters.
Setup & Implementation: The GitHub repository (https://github.com/EverMind-AI/MSA) provides the research code, though code and models are marked as 'Coming Soon'. EverMind also offers EverMemOS (https://github.com/EverMind-AI/EverOS), a production memory operating system. Setup requires Python 3.10+, Docker 20.10+, uv package manager, and 4GB RAM. Installation: git clone https://github.com/EverMind-AI/EverOS.git && cd EverOS && docker compose up -d && curl -LsSf https://astral.sh/uv/install.sh | sh && uv sync && cp env.template .env (configure LLM_API_KEY and VECTORIZE_API_KEY) && uv run python src/run.py. On the LoCoMo benchmark, EverMemOS achieved a 92.3% reasoning accuracy (evaluated by LLM-Judge).
EverMind's end-to-end trainable memory system scales to 100 million tokens, demonstrating that integrated architectures fundamentally outperform modular approaches like RAG.