Cerebras Launches AI Inference Solution, Claims to Be 20 Times Faster Than NVIDIA GPU

California-based AI startup Cerebras today launched Cerebras Inference, claiming it to be the world’s fastest AI inference solution. In a blog post, Cerebras stated: "Cerebras Inference delivers 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, which is 20 times faster than NVIDIA’s (NVDA-US) GPU-based hyperscale cloud."

Cerebras Inference is powered by the third-generation Wafer-Scale Engine, with the company claiming its GPU solution operates at one-fifth the cost of competitors, achieving higher speeds by eliminating memory bandwidth bottlenecks. Cerebras noted, "Cerebras tackles memory bandwidth limitations by building the world's largest chip and storing the entire model on-chip, thus eliminating the need for external memory and the slow pathways connecting external memory to computation."

Micah Hill-Smith, co-founder and CEO of Artificial Analysis, stated that Cerebras leads the field in AI inference benchmarks at Artificial Analysis. "Cerebras provides speeds an order of magnitude faster than GPU-based solutions for Meta's Llama 3.1 8B and 70B AI models. We measured speeds exceeding 1,800 tokens per second on Llama 3.1 8B and over 446 tokens per second on Llama 3.1 70B."

Cerebras filed for an initial public offering earlier this month and is expected to go public in the second half of this year.