Turns out using 100% of your AI brain all the

Feature If you’ve been following AI development over the past few years, one trend has remained constant: bigger models are usually smarter, but also harder to run.

This is particularly problematic in parts of the world where access to America’s most sophisticated AI chips is restricted – like, say, China.

But even outside of China, model builders are increasingly turning to mixture of experts (MoE) architectures along with emerging compression tech to drive down the compute requirements of serving large language models (LLMs). Nearly three years since ChatGPT kicked off the generative AI boom, it seems folks are finally starting to think about the cost of running these things.

To be clear, we’ve seen MoE models, like Mistral AI’s Mixtral, before, but it’s only in the last year or so the technology has really taken off.

Over the past few months, we’ve seen a wave of new open-weight LLMs from the likes of Microsoft, Google, IBM, Meta, DeepSeek, and Alibaba based on some kind of mixture-of-experts (MoE) architecture.

And the reason is simple: The architecture is a helluva lot more efficient than traditional “dense” model architectures.

Vaulting the memory wall

First described in the early ’90s in a paper [PDF] titled “Adaptive Mixtures of Local Experts,” the basic idea is that instead of one great big model trained on a bit of everything, work is routed to one or more of any number of smaller sub-models, or “experts.”

In theory, each of these experts can be optimized for a domain-specific task, like coding, mathematics, or writing. Unfortunately, few model builders go into much detail about the various experts that make up their MoE models, and the exact number varies from model to model. The important bit is only a small portion of the model is in use at any given moment.

For example, DeepSeek’s V3 model is composed of 256 routed experts along with one shared expert. But only eight routed experts, plus the shared one, are activated per token.

Because of this, MoE models don’t always match the quality of similarly sized dense models. Take Alibaba’s Qwen3-30B-A3B MoE model for example. It consistently fell behind the dense Qwen3-32B model in Alibaba’s own benchmark testing.

The loss in quality – at least if the benchmarks are to be believed – is pretty minor compared to the leap in efficiency gained from the MoE architecture. Fewer active parameters also mean the amount of memory bandwidth required to achieve a given level of performance is no longer proportional to the capacity needed to store the model weights.

In other words, MoE models may still need a ton of memory, but it doesn’t all have to be ultra-fast or ultra-expensive HBM anymore.

To illustrate this, let’s compare the system requirements for Meta’s largest “dense” model, Llama 3.1 405B, to Llama 4 Maverick, which is nearly as big, but uses a MoE architecture with 17 billion active parameters.

Factors like batch size, floating point performance, and the key-value cache all play into real-world performance, but we can at least get a rough sense of the minimum bandwidth requirements of a model by multiplying its size in gigabytes at a given precision (1 byte per parameter for 8-bit models) by the target tokens per second at a batch size of one.

To run an 8-bit quantized version of Llama 3.1 405B — more on quantization in a bit — you’d need more than 405 GB of vRAM and at least 20 TB/s of memory bandwidth in order to generate text at 50 tokens per second.

For reference, Nvidia’s HGX H100-based systems, which we’ll remind you were selling for $300,000 or more until recently, only had 640 GB of HBM3 and about 26.8 TB/s of aggregate bandwidth. If you wanted to run the full 16-bit model, you would have needed at least two of them.

By comparison, Llama 4 Maverick still consumes the same amount of memory, but needs less than 1 TB/s of bandwidth to achieve the same performance. That’s because only 17 billion parameters worth of model experts are actually used to generate the output.

That means, on the same hardware, Llama 4 Maverick should generate text an order of magnitude faster than Llama 3.1 405B.

On the other hand, if performance isn’t as big a concern, you can now get away with running many of these models on cheaper, albeit slower GDDR6, GDDR7, or even DDR in the case of Intel’s latest Xeons.

Nvidia’s new RTX Pro Servers, announced at Computex this week, are primed to do just that. Rather than high-bandwidth memory (HBM), which is expensive, power-hungry, and requires advanced packaging to integrate, each of the eight RTX Pro 6000 GPUs found in the systems feature 96 GB of GDDR7 memory — the same kind you’d find in a modern gaming card.

Combined, these systems offer up to 768 GB of vRAM and 12.8 TB/s of aggregate bandwidth — more than enough to run Llama 4 Maverick at several hundred tokens per second.

Nvidia hasn’t shared pricing, but with the workstation edition of these cards currently retailing for around $8,500, we wouldn’t be surprised to find them selling for less than half of what an HGX H100 used to go for.

With that said, MoE doesn’t spell an end for HBM-stacked GPUs. We don’t expect we’ll see Llama 4 Behemoth — assuming it ever ships — running on anything short of a rack full of GPUs.

While the thing has roughly half the active parameters as Llama 3.1 405B, it’s got 2 trillion of them in total. There’s not a single conventional GPU server on the market today that can fit the full 16-bit model and what’ll inevitably be a million-plus token context window.

Are CPUs finally having their AI moment?

Depending on your use case, you may not need a GPU at all — something that might come in handy in regions where imports of high-end accelerators are restricted.

Back in April, Intel demoed a dual-socket Xeon 6 platform equipped with a full complement of 8800 MT/s MCRDIMMs, achieving a throughput in Llama 4 Maverick of 240 Tokens per second at an average output latency of less than 100 ms per token.

Put more succinctly, the Xeon platform was able to maintain 10 tokens per second or better per user for roughly 24 concurrent users.

Intel didn’t share batch 1 (single user) performance — and we can’t blame th

» …
Read More

Turns out using 100% of your AI brain all the

Recent Posts

Recent Comments

Stay Updated with Tech Actual