Blog

The AI Chip Race: Who Will Dominate with the Most Nvidia Processors?

Hannah Perry | November 25, 2024

Responsive image

The Next AI Battle: Who Can Get the Most Nvidia Chips in One Place

In the rapidly evolving world of artificial intelligence, tech giants are increasingly engaging in a new method of measuring supremacy: the quantity of Nvidia chips they can amass in a single location. Over the past two years, companies managing colossal data centers have been fiercely competing to acquire Nvidia’s advanced artificial intelligence processors. Recently, however, several leading players have taken their efforts to the next level by building expansive, billion-dollar super clusters of computer servers, equipped with an unprecedented number of Nvidia chips.

The Rise of Supercomputers

One notable example is Elon Musk’s xAI, which quickly constructed a supercomputer named Colossus in Memphis, boasting a staggering 100,000 of Nvidia’s Hopper AI chips. Similarly, Meta CEO Mark Zuckerberg disclosed last month that his company is already training its most sophisticated AI models with a chip assembly referred to as “bigger than anything I’ve seen reported for what others are doing.” Just a year ago, the concept of clusters containing tens of thousands of chips was considered extraordinarily large. For context, OpenAI utilized around 10,000 Nvidia chips to train the initial version of ChatGPT released in late 2022, according to analysts from UBS.

The Stakes of Scaling Up

The race towards assembling larger super clusters could enable Nvidia to maintain a remarkable growth trajectory. The company has seen its quarterly revenue surge from approximately $7 billion two years ago to over $35 billion today, propelling it to become the world’s most valuable publicly traded company, with a market cap exceeding $3.5 trillion. By aggregating many chips in one location and interlinking them through ultra-fast networking cables, companies are producing larger AI models at exponentially faster rates. However, questions remain about whether simply increasing the size of these super clusters will consistently result in smarter chatbots and more realistic image-generation tools.

Investor Sentiment and Future Prospects

The ongoing success of Nvidia is substantially dependent on how these expansive clusters function. This trend not only promises a surge in demand for Nvidia’s chips but also fosters growth in its networking equipment division, which is rapidly maturing into a significant contributor to the company’s yearly billions in sales. In a recent analyst call, Nvidia CEO Jensen Huang expressed optimism about the potential for improvement in AI foundation models through larger-scale computing infrastructures. Huang noted that while the largest clusters currently utilize around 100,000 of Nvidia’s existing chips, the next generation of chips, dubbed Blackwell, will kick-start computations with similar numbers.

The Competitive Landscape

The competition amongst tech giants like xAI, Meta, OpenAI, Microsoft, and Google is heating up. Each is striving to develop substantial new computing facilities to enhance their AI capabilities. Huang recently marveled at the speed with which Musk’s Colossus cluster was erected and projected a trend of more substantial clusters on the horizon. He indicated the necessity of millions of GPUs and posed the question of how data centers should be architected to accommodate such large-scale demand.

The Technical Challenges Ahead

Musk announced via social media platform X last month that his Colossus super cluster would soon escalate from 100,000 to 200,000 chips in one facility, with a potential 300,000-chip cluster planned for the following summer. Nonetheless, the construction of these massive super clusters poses significant financial risks. The Blackwell chips, which are set to ship in the coming months, are estimated to cost around $30,000 each—making a 100,000-chip cluster potentially worth $3 billion, not including auxiliary infrastructure costs.

Issues of Reliability and Management

Building out these super clusters comes with extensive hurdles. For instance, Meta researchers reported issues in July where a cluster comprising over 16,000 Nvidia GPUs faced unexpected failures during the training of an advanced Llama model. The close proximity of power-hungry chips necessitates robust cooling solutions, pushing a shift towards liquid cooling systems that pipe refrigerant directly to the chips. Moreover, the immense scale of these clusters demands heightened management oversight due to the complexities that arise with component failures.

Conclusion: Is Bigger Always Better?

Industry insiders warn that while the aspiration to create larger clusters exists, it is uncertain whether they will yield AI models that justify the steep investment required. Mark Adams, CEO of Penguin Solutions, shed light on the complexities tied to managing extensive chip clusters, stating that a multitude of operational failures could significantly reduce utilization rates. As companies race towards AI supremacy, the journey to obtaining the most Nvidia chips in one place is both a high-stakes gamble and a testament to the technological advancements shaping our future.