What’s the best path forward for large-scale chip/system integration? Good question.
Cerebras has set a high bar with its wafer scale engine 2 (WSE-2); it has 2.6 trillion transistors, including 850,000 cores, and was fabricated using TSMC’s 7nm process on a roughly 8” x 8” silicon footprint. A different approach is to use chiplet technology in which various devices are mounted to a single silicon wafer using passive silicon-interconnect technology. This approach is more flexible, can scale, and offers significant cost advantages contends a new paper by researchers from UCLA and the University of Illinois, Urbana-Champaign.
The team of researchers[i] has designed and is now prototyping a “2048-chiplet, 14336-core waferscale processor” and their paper summarizing the work provides a good look at the benefits and challenges of the chiplet approach. The researchers will present the work at the Design Automation Conference (DAC 21) in December.
“To the best of our knowledge, this is the largest chiplet assembly based system ever attempted. In terms of active area, our prototype system is about 10x larger than a single chiplet-based system from Nvidia/AMD etc., and about 100x larger than the 64-chiplet Simba (research) system from Nvidia,” write the researchers.
The underlying premise is familiar. The “proliferation of highly parallel workloads such as graph processing, data analytics, and machine learning is driving the demand for massively parallel high-performance systems with a large number of processing cores, extensive memory capacity, and high memory bandwidth,” they state.
So far, heterogeneous architectures, predominantly based on “discrete packaged processors connected using conventional off-package communication links,” have been the dominant solution for dealing with the new workloads. There’s also a crop of new chips and systems aimed at these workloads. Cerebras’ WSE-2 is one example.
The researchers argue monolithic waferscale “chips cannot integrate components from heterogeneous technologies such as DRAM or other dense memory technologies. Moreover, in order to obtain good yields, redundant cores and network links need to be reserved on the waferscale chip,” write the researchers.
A chiplet strategy, they say, should be able to overcome some of these limits:
“A competing approach to building waferscale systems is to integrate pre-tested known-good chiplets (in this work, we call un-packaged bare-dies/dielets as chiplets) on a waferscale interconnect substrate. Silicon interconnect Fabric (Si-IF) is a candidate technology which allows us to tightly integrate many chiplets on a high-density interconnect wafer. Si-IF technology provides fine-pitch copper pillar based (10μm pitch) I/Os which are at least 16x denser than conventional μ-bumps used in an interposer based system, as well as ∼100μm inter-chiplet spacing. Therefore, it provides global on-chip wiring-like characteristics for inter-chiplet interconnects. Moreover, in a chiplet-based waferscale system, the chiplets can be manufactured in heterogeneous technologies and can potentially provide better cost-performance trade-offs.”
The figure shown below provides a good overview of the design.
As you would expect the chipset approach brings its own set of design challenges, which the team enumerated:
- “How should we deliver power to all the flip-chip bonded chiplets across the wafer?
- “How can we reliably distribute clock across such a large area?
- “How can we design area-efficient I/Os when a large number of fine-pitch copper pillar-based I/Os need to be supported per chiplet, and how do we achieve very high overall chiplet assembly and bonding yield?
- “What is the inter-chip network architecture and how do we achieve resiliency if a few chiplets fail?
- “What is the testing strategy when I/O pads have small dimensions and how do we ensure scalability of the testing schemes?
- “How can we design the chiplets and the substrate with the uncertainty and constraints of the manufacturing process?”
The paper walks through solution approaches and specific considerations for the overall architecture, compute chiplet, memory chiplet, and waferscale substrate selected. Also examined in detail are networking, power distribution, and testing infrastructure.
The team validated the system design and architecture by emulating a reduced-size multi-tile system on an FPGA platform. “We were successfully able to run various workloads including graph applications such as breadth-first search (BFS), single-source shortest path (SSSP), etc. on this system,” according to the paper.
It will be interesting to see how the prototype behaves.
Saptadeep Pal, one of the paper’s authors and a Ph.D. student at UCLA, told HPCwire, “A smaller silicon prototype is now up and running programs. The wafer scale prototype is currently being built. We are taking it one step at a time. The tapeout and system is a “first ever” in many aspects and being at a university, the time and dollar cost of respins is very high. The full waferscale system will probably take a few more months.”
Saptadeep Pal∗, Jingyang Liu†, Irina Alam∗, Nicholas Cebry†, Haris Suhail∗, Shi Bu∗, Subramanian S. Iyer∗, Sudhakar Pamarti∗, Rakesh Kumar†, and Puneet Gupta∗
∗Department of Electrical and Computer Engineering, University of California, Los Angeles
†Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign