@howard.mov7
A 4B parameter model just outperformed 235B. . . . Not by being bigger. By being integrated. There are 3 main ways AI can remember long-term: → Bake it into the weights (forgets) → Bolt on search like RAG (imprecise) → Build from internal states (couldn’t scale — until now) A new paper from EverMind cracked option 3. End-to-end memory scaled to 100M tokens. Barely any quality loss. Duct-taped systems always lose to integrated ones. Eventually. Follow for more AI research distilled into your feed. #ai #research #tech #software #memory
♬ original sound - .

Research Brief

Name: Research Brief
Uploaded: 2026-03-30T18:17:15.273699+00:00
Description: A Chinese AI startup (Evermind) has solved the AI memory problem by creating an end-to-end trainable internal memory system that scales to 100 million tokens, proving that integrated systems will always outperform modular 'duct tape' solutions.

5.0/8

●●●●●○○○ Credibility Score

confirmed

📝 What They Said

A Chinese AI startup (Evermind) has solved the AI memory problem by creating an end-to-end trainable internal memory system that scales to 100 million tokens, proving that integrated systems will always outperform modular 'duct tape' solutions.

1 AI memory has three approaches: (1) baking knowledge into models (causes catastrophic forgetting), (2) external search/RAG (disconnects search from reasoning, finds approximate not exact matches), (3) internal state memory (most precise but previously too expensive to scale)
2 Evermind released a paper demonstrating internal memory that is trainable end-to-end, scales to 100 million tokens with minimal quality degradation, and their 4B parameter model outperformed RAG systems running on 235B parameters
3 The strategic insight: integrated systems designed to work together will always eventually beat modular systems cobbled together from separate components—if your competitive advantage relies on duct-taping disparate tools, purpose-built end-to-end solutions will displace you

🔬 What We Found

On March 18, 2026, EverMind, a Chinese AI startup incubated by Shanda Group's founder Tianqiao Chen, released a research paper titled 'Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens' (MSA). The paper is published on Zenodo (https://zenodo.org/records/19103670) and open-sourced on GitHub (https://github.com/EverMind-AI/MSA), with the arXiv identifier 2603.23516.

The MSA architecture addresses AI memory through four core innovations: Memory Sparse Attention mechanism, Document-wise RoPE for extreme context extrapolation, KV Cache Compression with Memory Parallelism, and a Memory Interleave mechanism supporting complex reasoning. Unlike RAG, which relies on external, fixed similarity metrics (like cosine distance), MSA's router is co-optimized with the generation task during training via a supervised contrastive loss, fundamentally solving the core pain point of misaligned objectives between RAG's 'retrieval' and 'generation'.

When scaling the context length from 16K to 100M tokens, the model's performance degrades by less than 9%, demonstrating extraordinary scalability. KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. The system uses Qwen3-4B-Instruct as the backbone.

Benchmark results show the 4B-parameter MSA model (average score 3.760) significantly outperforms complex RAG systems built on the identical Qwen3-4B foundation, including those with a Reranker, improving over standard RAG (+16.0%), RAG+rerank (+11.5%), and HippoRAG2 (+14.8%). Against KaLMv2+Qwen3-235B and KaLMv2+Llama-3.3-70B (w/ and w/o reranking), MSA achieves the best score on 4/9 datasets and an average 3.760, with relative gains of +7.2%, +5.0%, +10.7%, and +5.4% over the strongest configurations respectively. This confirms the transcript's claim about a 4B model beating systems running on 235B parameters.

Setup & Implementation: The GitHub repository (https://github.com/EverMind-AI/MSA) provides the research code, though code and models are marked as 'Coming Soon'. EverMind also offers EverMemOS (https://github.com/EverMind-AI/EverOS), a production memory operating system. Setup requires Python 3.10+, Docker 20.10+, uv package manager, and 4GB RAM. Installation: git clone https://github.com/EverMind-AI/EverOS.git && cd EverOS && docker compose up -d && curl -LsSf https://astral.sh/uv/install.sh | sh && uv sync && cp env.template .env (configure LLM_API_KEY and VECTORIZE_API_KEY) && uv run python src/run.py. On the LoCoMo benchmark, EverMemOS achieved a 92.3% reasoning accuracy (evaluated by LLM-Judge).

✓ Verified Claims

✅

A Chinese AI startup (Evermind) released a new paper

— Source

✅

Internal memory, trainable end-to-end, scaled to a hundred million tokens with barely any quality loss

— Source

✅

Their four billion parameter model beat rag systems running on 235 billion

— Source

✅

AI memory has three approaches: (1) baking knowledge into models (causes catastrophic forgetting), (2) external search/RAG (disconnects search from reasoning), (3) internal state memory (most precise but previously too expensive to scale)

— Source

→ Suggested Actions

→

💡 Go Deeper

Historical comparison of integrated vs. modular AI architectures: when have end-to-end systems previously displaced modular approaches (e.g., neural MT vs. phrase-based MT)?

Technical deep-dive into memory-attention mechanisms: how does MSA compare to other long-context approaches (Mamba, RWKV, Transformer-XL, Memorizing Transformers)?

China's AI research ecosystem: mapping the competitive landscape of Chinese AI startups, state support mechanisms, and talent pipelines from top universities

Economic analysis of RAG market disruption: quantify the total addressable market for vector databases and retrieval systems that could be displaced by integrated memory solutions

Key Takeaway

EverMind's end-to-end trainable memory system scales to 100 million tokens, demonstrating that integrated architectures fundamentally outperform modular approaches like RAG.

Open Original Try Free