Home OpenAI Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

OpenAI

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

adminUpdated 5 hours Ago3 Mins read3 Views

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

LLMs have rapidly advanced with soaring parameter counts, widespread use of mixture-of-experts (MoE) designs, and massive context lengths. Models like DeepSeek-R1, LLaMA-4, and Qwen-3 now reach trillions of parameters, demanding enormous compute, memory bandwidth, and fast inter-chip communication. MoE improves efficiency but creates challenges in expert routing, while context windows exceeding a million tokens strain attention and KV cache storage, which scales with concurrent users. In real-world deployments, unpredictable inputs, uneven expert activations, and bursty queries further complicate serving. Addressing these pressures requires a ground-up rethinking of AI infrastructure through hardware–software co-design, adaptive orchestration, and elastic resource management.

Recent progress in LLMs is shaped by three main trends: ever-growing parameter counts, sparse MoE architectures, and extended context windows. Models like Llama 4, DeepSeek-V3, and Google’s PaLM push scale into the trillions of parameters, while MoE designs activate only subsets of experts per token, balancing efficiency with capacity. Meanwhile, context windows now span hundreds of thousands to millions of tokens, enabling long-form reasoning but straining compute and memory through large key-value caches. These advances place immense pressure on datacenters, demanding higher compute, memory, and bandwidth while introducing challenges in parallelism, workload heterogeneity, data convergence, and storage performance.

Huawei researchers introduced CloudMatrix, a new AI datacenter architecture designed to handle the rising demands of large-scale LLMs. Its first implementation, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs, all linked by a high-bandwidth, low-latency Unified Bus that enables fully peer-to-peer communication. This design allows flexible pooling of compute, memory, and network resources, making it ideal for MoE parallelism and distributed KV cache access. On top of this, CloudMatrix-Infer offers an optimized serving framework with peer-to-peer resource pools, large-scale expert parallelism, and hardware-aware optimizations like pipelining and INT8 quantization. Evaluations with DeepSeek-R1 show state-of-the-art throughput, efficiency, and scalability.

Huawei CloudMatrix is a new AI datacenter architecture built on peer-to-peer high-bandwidth interconnects and fine-grained resource disaggregation. Its first large-scale implementation, CloudMatrix384, integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs into a single supernode, all linked by a unified bus network that enables direct all-to-all communication. This design allows compute, memory, and network resources to be shared seamlessly and scaled independently, operating as one cohesive system. By avoiding the bottlenecks of traditional hierarchical setups, CloudMatrix384 is particularly effective for communication-heavy tasks such as large-scale MoE parallelism and distributed KV cache management, making it ideal for scalable LLM serving.

The researchers evaluate CloudMatrix-Infer on the DeepSeek-R1 model using the CloudMatrix384 supernode. The system achieves a prefill throughput of 6,688 tokens per second per NPU and a decode throughput of 1,943 tokens per second with latency kept under 50 ms, outperforming comparable systems such as SGLang on NVIDIA H100 and DeepSeek on H800. Even when constrained to stricter latency requirements of under 15 ms, it sustains 538 tokens per second in decoding. Moreover, INT8 quantization on the Ascend 910C preserves accuracy across 16 benchmarks, showing that efficiency improvements do not compromise model quality.

In conclusion, Huawei CloudMatrix is a next-generation AI datacenter architecture designed to overcome the scalability limits of conventional clusters. Its first production system, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs in a fully peer-to-peer supernode connected through a high-bandwidth, low-latency Unified Bus. To exploit this design, the study proposes CloudMatrix-Infer, which separates prefill, decode, and caching into independent pools, supports large-scale expert parallelism, and applies hardware-aware optimizations like pipelining and INT8 quantization. Tested on DeepSeek-R1, it achieved superior throughput and latency performance compared to NVIDIA-based systems, while preserving accuracy, showcasing its potential for large-scale AI deployments.

Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Source link

Previous post Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Next post AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation

AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation

Semantic parsing converts natural language into formal query languages such as SQL...

admin3 Mins read

OpenAI

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing Large...

admin2 Mins read

OpenAI

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents

In the rapidly evolving landscape of AI-driven automation, Zhipu AI has introduced...

admin3 Mins read

OpenAI

Top 10 AI Blogs and News Websites for AI Developers and Engineers in 2025

Staying current with the latest breakthroughs, tools, and industry shifts is critical...

admin2 Mins read

This Week

Liquid AI Releases LFM2-VL: Super-Fast, Open-Weight Vision-Language Models Designed for Low-Latency and Device-Aware Deployment

ZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training

Google AI Released 5 New AI Agents/Platforms for Developers

Weekly Newsletter

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

Leave a comment

Leave a Reply Cancel reply

Latest Posts

ZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training

Google AI Released 5 New AI Agents/Platforms for Developers

Google Finance Becomes Your AI-Powered Financial Sidekick—Beyond Tickers and into Conversations

Best Uncensored Roleplay AI Chat Apps You Can Try

AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents

Top 10 AI Blogs and News Websites for AI Developers and Engineers in 2025

Get to Know Us

keep in touch