With RDMA over Converged Ethernet, or RoCE, ethernet switching is ready to replace InfiniBand as an interconnect for GPUs, says ethernet switch chip vendor Broadcom.
Broadcom 2022For some time now, specialists in the area of computer networking have been talking about a second network. The usual network is the one that connects client computers to servers, the LAN. The rise of artificial intelligence has created a network "behind" that network, a "scale-out" network to run AI tasks such as deep learning programs that must be trained on thousands of GPUs.
That has led to what switch silicon vendor Broadcom describes as a critical impasse. Nvidia, the dominant vendor of the GPU chips running deep learning, is also becoming the dominant vendor of networking technology to interconnect the chips, using the InfiniBand technology that it added when it acquired Mellanox in 2020.
The danger, some suggest, is that everything is tied up with one company, with no diversification and no way to build a data center where many chips compete.
"What Nvidia is doing, is saying, I can sell a GPU for a couple thousand dollars, or I can sell an equivalent to an integrated system for half a million to a million-plus dollars," said Ram Velaga, the senior vice president and general manager of the Core Switching Group at networking chip giant Broadcom, in an interview withZDNet.
"This is not going well at all with the cloud providers," Velaga toldZDNet, meaning, Amazon and Alphabet's Google and Meta and others. That is because those cloud giants' economics are based on cutting costs as they scale computing resources, which dictates avoiding single-sourcing.
"And so now there's this tension in this industry," he said.
To address that tension, Broadcom says the solution is to follow the open networking path of ethernet technology, away from the proprietary path of InfiniBand.
Broadcom on Tuesday unveiled Tomahawk 5, the company's latest switch chip, capable of interconnecting a total of 51.2 terabits per second of bandwidth between endpoints.
"There's an engagement with us, saying, Hey, look, if the ethernet ecosystem can help address all the benefits that InfiniBand is able to bring to a GPU interconnect, and bring it onto a mainstream technology like ethernet, so it can be pervasively available, and create a very large networking fabric, it's going to help people win on the merits of the GPU, rather than the merits of a proprietary network," said Velaga.
The Tomahawk 5, available now, follows by two years Broadcom's prior part, the Tomahawk 4, which was a 25.6-terabit-per-second chip.
The Tomahawk 5 part aims to level the playing field by adding capabilities that had been the preserve of InfiniBand. The key difference is latency, the average time to send the first bit of data from point A to point B. Latency has been an edge for InfiniBand, which is something that becomes especially crucial in going out from the GPU to memory and back again, either to fetch input data or to fetch parameter data for large neural networks in AI.
A new technology called RDMA over Converged Ethernet, or RoCE, closes the gap in latency between InfiniBand and ethernet. With RoCE, an open standard wins out over the tight coupling of Nvidia GPUs and Infiniband.
"Once you get RoCE, there's no longer that infiniband advantage," said Velaga. "The performance of ethernet actually matches that of InfiniBand."
"Our thesis is if we can out-execute InfiniBand, chip-to-chip, and you have an entire ecosystem that's actually looking for ethernet to be successful, you have a recipe to displace infiniband with ethernet and allow a broad ecosystem of GPUs to be successful," said Velaga.
The cloud computing giants such as Amazon "are insisting that the only way the GPU can be sold into them is with a standard NIC interface that can transmit over an ethernet," says Ram Velaga, general manager of Broadcom's Core Switching Group.
Broadcom, 2022The reference to a broad ecosystem of GPUs is actually an allusion to the many competing silicon providers in the AI market who are offering novel chip architectures.
They include a raft of well-funded startups such as Cerebras Systems, Graphcore, and SambaNova, but they also include the cloud vendors' own silicon, such as Google's own Tensor Processing Unit, or TPU, and Amazon's Trainium chip. All those efforts might conceivably have more of an opportunity if compute resources were not dependent on a single network sold by Nvidia.
"The big cloud guys today are saying, We want to build our own GPUs, but we don't have an InfiniBand fabric," observed Velaga. "If you guys can give us an ethernet-equivalent fabric, we can do the rest of this stuff on our own."
Broadcom is betting that as the latency issue goes away, InfiniBand's weaknesses will become apparent, such as the number of GPUs that the technology can support. "InfiniBand was always a system that had a certain scale limit, maybe a thousand GPUs, because it didn't really have a distributed architecture."
In addition, ethernet switches can serve not only GPUs but also Intel and AMD CPUs, so collapsing the networking technology into one approach has certain economic benefits, suggested Velaga.
"I expect the fastest adoption of this market will come from GPU interconnect, and over a period of time, I probably would expect the balance will be fifty-fifty," said Velaga, "because you will have the same technology that can be used for the CPU interconnect and the GPU interconnect, and the fact that there's far more CPUs sold than GPUs, you will have a normalization of the volume." The GPUs will consume the majority of bandwidth, while CPUs may consume more ports on an ethernet switch.
In accord with that vision, Velaga points out special capabilities for AI processing, such as a total of 256 ports of 200 gigabit-per-second ethernet ports, the most of any switch chip. Broadcom claims such dense 200-gig port configuration is important to enable "flat, low latency AI/ML clusters."
Although Nvidia has a lot of power in the data center world, with sales of data center GPUs this year expected at$16 billion, the buyers, the cloud companies, also have a lot of power, and the advantage is on their side.
"The big cloud guys want this," said Velaga of the pivot to ethernet from InfiniBand. "When you have these massive clouds with a lot of buying power, they have shown they are capable of forcing a vendor to disaggregate, and that is the momentum that we are riding," said Velaga. "All of these clouds really do not want this, and they are insisting that the only way the GPU can be sold into them is with a standard NIC interface that can transmit over an ethernet.
"That's already happening: you look at Amazon, that's how they're buying, look at Meta, Google, that's how they're buying."