Home OpenAI Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving
OpenAI

Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving

Share
Huawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving
Share


LLMs have rapidly advanced with soaring parameter counts, widespread use of mixture-of-experts (MoE) designs, and massive context lengths. Models like DeepSeek-R1, LLaMA-4, and Qwen-3 now reach trillions of parameters, demanding enormous compute, memory bandwidth, and fast inter-chip communication. MoE improves efficiency but creates challenges in expert routing, while context windows exceeding a million tokens strain attention and KV cache storage, which scales with concurrent users. In real-world deployments, unpredictable inputs, uneven expert activations, and bursty queries further complicate serving. Addressing these pressures requires a ground-up rethinking of AI infrastructure through hardware–software co-design, adaptive orchestration, and elastic resource management. 

Recent progress in LLMs is shaped by three main trends: ever-growing parameter counts, sparse MoE architectures, and extended context windows. Models like Llama 4, DeepSeek-V3, and Google’s PaLM push scale into the trillions of parameters, while MoE designs activate only subsets of experts per token, balancing efficiency with capacity. Meanwhile, context windows now span hundreds of thousands to millions of tokens, enabling long-form reasoning but straining compute and memory through large key-value caches. These advances place immense pressure on datacenters, demanding higher compute, memory, and bandwidth while introducing challenges in parallelism, workload heterogeneity, data convergence, and storage performance. 

Huawei researchers introduced CloudMatrix, a new AI datacenter architecture designed to handle the rising demands of large-scale LLMs. Its first implementation, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs, all linked by a high-bandwidth, low-latency Unified Bus that enables fully peer-to-peer communication. This design allows flexible pooling of compute, memory, and network resources, making it ideal for MoE parallelism and distributed KV cache access. On top of this, CloudMatrix-Infer offers an optimized serving framework with peer-to-peer resource pools, large-scale expert parallelism, and hardware-aware optimizations like pipelining and INT8 quantization. Evaluations with DeepSeek-R1 show state-of-the-art throughput, efficiency, and scalability. 

Huawei CloudMatrix is a new AI datacenter architecture built on peer-to-peer high-bandwidth interconnects and fine-grained resource disaggregation. Its first large-scale implementation, CloudMatrix384, integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs into a single supernode, all linked by a unified bus network that enables direct all-to-all communication. This design allows compute, memory, and network resources to be shared seamlessly and scaled independently, operating as one cohesive system. By avoiding the bottlenecks of traditional hierarchical setups, CloudMatrix384 is particularly effective for communication-heavy tasks such as large-scale MoE parallelism and distributed KV cache management, making it ideal for scalable LLM serving. 

The researchers evaluate CloudMatrix-Infer on the DeepSeek-R1 model using the CloudMatrix384 supernode. The system achieves a prefill throughput of 6,688 tokens per second per NPU and a decode throughput of 1,943 tokens per second with latency kept under 50 ms, outperforming comparable systems such as SGLang on NVIDIA H100 and DeepSeek on H800. Even when constrained to stricter latency requirements of under 15 ms, it sustains 538 tokens per second in decoding. Moreover, INT8 quantization on the Ascend 910C preserves accuracy across 16 benchmarks, showing that efficiency improvements do not compromise model quality. 

In conclusion, Huawei CloudMatrix is a next-generation AI datacenter architecture designed to overcome the scalability limits of conventional clusters. Its first production system, CloudMatrix384, combines 384 Ascend 910C NPUs and 192 Kunpeng CPUs in a fully peer-to-peer supernode connected through a high-bandwidth, low-latency Unified Bus. To exploit this design, the study proposes CloudMatrix-Infer, which separates prefill, decode, and caching into independent pools, supports large-scale expert parallelism, and applies hardware-aware optimizations like pipelining and INT8 quantization. Tested on DeepSeek-R1, it achieved superior throughput and latency performance compared to NVIDIA-based systems, while preserving accuracy, showcasing its potential for large-scale AI deployments. 


Check out the Technical Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.



Source link

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting this form, you are consenting to receive marketing emails and alerts from: techaireports.com. You can revoke your consent to receive emails at any time by using the Unsubscribe link, found at the bottom of every email.

Latest Posts

Related Articles
AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation
OpenAI

AmbiGraph-Eval: A Benchmark for Resolving Ambiguity in Graph Query Generation

Semantic parsing converts natural language into formal query languages such as SQL...

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?
OpenAI

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technique for enhancing Large...

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents
OpenAI

Zhipu AI Unveils ComputerRL: An AI Framework Scaling End-to-End Reinforcement Learning for Computer Use Agents

In the rapidly evolving landscape of AI-driven automation, Zhipu AI has introduced...

Top 10 AI Blogs and News Websites for AI Developers and Engineers in 2025
OpenAI

Top 10 AI Blogs and News Websites for AI Developers and Engineers in 2025

Staying current with the latest breakthroughs, tools, and industry shifts is critical...