Intel is now customizing its latest Xeon 6 server chips for use with Nvidia’s GPUs that dominate the AI landscape. The chipmaker’s new Xeon 6 chips, also called Granite Rapids, have been customized and validated specifically for server boxes with Nvidia’s latest and upcoming GPUs.
“Nvidia is the leader on the GPU side…so we’re partnering closely with them to make sure that people deploying MGX or HGX-based systems, we have a full suite of CPUs that have been qualified together with Nvidia for those systems,” said Ronak Singhal, senior fellow at Intel.
Amid financial struggles, Intel is repositioning its business around x86 CPUs. One way to sell more Granite Rapids chips is to follow the coattails of Nvidia’s red-hot GPUs.
“This is really just the beginning of some of the collaboration we’re doing with Nvidia over the course of the next year. You’ll see more from us as we talk about ways that we’ve optimized some of these SKUs very specifically for this use case,” Singhal said.
It’s a shocking reversal of roles. There was a time when Intel’s server chips ruled the roost while Nvidia’s GPUs relied heavily on CPU sales. Intel is now playing second fiddle to Nvidia.
Role of Xeon 6 with GPUs
Intel’s beefier Xeon 6 6900P chips, announced this week, have up to 128 cores, double that of its previous generation Emerald Rapids and Sapphire Rapids chips.
The Granite Rapids chips are based on chiplets, which allows Intel to mix and match computing capabilities based on customer requirements. The CPU is important for preprocessing and “making sure that you’re not bottlenecking the GPU,” Singhal said.
With its survival at stake, Intel has shown a lot more flexibility in customizing server chips. It recently announced it would customize Xeon 6 chips for Amazon Web Services and hinted at customizing chips for Google Cloud.
A majority of CPUs in data centers are based on Xeon, giving it an early advantage. Enterprise workloads are also based on x86 architecture.
Intel offers an alternative to Nvidia’s proprietary server system, which includes its Grace CPUs, GPUs, and networking infrastructure.
Granite Rapids can be a host CPU for customers who want to build their own Nvidia infrastructure. Customers can select their own multi-way boxes with Nvidia GPUs, memory, or I/O.
Granite Rapids – Not the Server Savior
Until earlier this year, Intel projected that Granite Rapids would give it momentum in servers, but that’s not the case.
The stronger server cycle didn’t materialize, and the “AI build” market is depressing the market.
“Where we still haven’t completely gotten the business to a good place is on the data center side of CPU,” said Intel chief financial officer Dave Zinsner this month during an interview at Citi Global’s analyst day.
Intel’s position on CPUs is stronger than on GPUs – and the successor to Granite Rapids, called Diamond Rapids, could put Intel on top, Zinsner said.
“I think Granite [Rapids] is a meaningful step forward for us in terms of making us competitive. Diamond Rapids will definitely put us in a good place competitively. It’s just important we work our way through the roadmap to get us to where we want to be,” Zinsner said.
The Guts of Granite Rapids
The Granite Rapids chip has two main computing chiplets. The main compute tile is made using the Intel 3 process, while the I/O tile is made using the Intel 7 process.
The top-line 6900P with 128 cores has 12 memory channels. It has DDR5 memory and supports a new memory type called MR-DIMM, which provides up to 2.3 times more memory bandwidth than the 5th Gen Xeon chips.
AI typically requires high memory capacity and bandwidth, which MR-DIMM addresses.
Intel claims Granite Rapids offers two times more cores per socket, a 1.2 times performance improvement per core, and 1.6 times better performance per watt. The L3 cache is as large as 504MB.
Granite Rapids can do AI capabilities on its own before it offloads work to Nvidia GPUs. For example, the AMX capabilities now support the FP16 data type.
The chip offers a total bandwidth of 24GT/s via six UPI 2.0 links. The chip also has the new AVX2 instruction set.
The Granite Rapids chips have the P-core, which is the performance core. It differs from another Xeon 6 chip called Sierra Forest, launched earlier this year with cores that consume less power but aren’t as fast and are built around the more efficient E-core.
CXL 2.0 — More Memory, But Slow
Granite Rapids also supports CXL 2.0, which provides access to larger memory pools, be it DDR4, DDR5, or HBM.
But there are caveats – access to the memory pool is slower, which is a major disadvantage. It could be useless for AI, which may stick to on-chip or internal HBM memory. However, despite the lower latency issue, it still provides access to a larger memory pool in a wider cluster, which isn’t always a disadvantage.
For example, workloads that are not urgent could be offloaded to DDR4. Intel calls the technology flat-memory mode.
“CXL gives me flexibility as to what memory type I might be using depending on the device that I’m using,” Singhal said.
The future of CXL 2.0 has been up in the air, lacking Nvidia’s support. But Singhal said he is now seeing large-scale deployments, which is a good sign for the technology’s adoption.
“Customers are able to get a lower spend on their memory side by taking advantage of our flat memory mode. We’re pretty excited about this,” Singhal said.
Granite Rapids Basics
The five new Xeon 6 6900P chips for two-socket systems have between 96 to 128 cores and draw up to 500 watts of power.
The base frequency ranges from 2.0GHz to 3.9GHz in turbo mode. Intel didn’t share pricing. Newer versions of the chip with fewer cores will come out in the first quarter of next year. A quick look at Granite Rapids HPC benchmarks is available in a recent HPCwire article: Granite Rapids HPC Benchmarks: I’m Thinking Intel is Back
Intel’s New AI Chip for GPU Poors
Additionally, Intel is also providing its own Gaudi 3 AI accelerator, which was also announced. Intel is targeting Gaudi 3 at AI inferencing and isn’t going after the AI training market, which it has conceded to Nvidia, AMD, and Google, which has its TPUs.
Typically, ASICs can take on workloads only tuned to the chip, which is unlike GPUs. But Gaudi 3 has default features and instructions that allow it to take on many AI workloads, though the performance may not be optimal.
The AI chip has 64 Tensor processor cores and eight matrix multiplication engines. The chip was made using the 5-nanometer process. Gaudi 3 is behind GPUs on memory. It has 128GB of HBM2e memory, while Nvidia and AMD GPUs have graduated to HBM3E.
Gaudi 3 is available through IBM Cloud and will be used in servers from companies that include Dell and Supermicro.