Cerebras Systems has been making waves for a few years with its massive, dinner plate-sized Wafer Scale Engine (WSE) chips, which are aimed at helping organizations achieve their “most ambitious AI goals.” Now, Cerebras—in collaboration with French multinational TotalEnergies—has announced the development of a massively scalable stencil algorithm: a development made possible by the use of one of Cerebras’ CS-2 systems (pictured in the header).
“Stencil algorithms are at the core of many high-performance computing applications,” explained Matthias Cremon, a member of the technical staff at Cerebras Systems, in a blog post. “They are used to solve several partial differential equations, including fluid mechanics, weather forecast or seismic imaging.”
Stencil algorithms, Cremon explained, traditionally spend a lot of time accessing input data but typically do not spend much time computing the result from that data. This makes them challenging for hierarchical memory architectures like CPUs and GPUs that do well with compute-bound problems; stencil algorithms are, instead, memory-bound. As a result, Cremon said, scaling up the compute resources or increasing the clock speeds often does little to improve the result, which, on the other hand, respond well to improvements in the efficiency of data transfers.
Staff from Cerebras and TotalEnergies developed a “novel way to implement a stencil algorithm on the Cerebras CS-2,” writing the algorithm in the Cerebras Software Language (CSL) and leveraging the “extremely large memory bandwidth” (20PB/s) of the WSE to maximize the efficiency of the data transfers in a test problem. The test problem in question: Minimod, a benchmark problem used by TotalEnergies to evaluate new hardware that is solved using a 25-point stencil. (TotalEnergies had previously evaluated the CS-2 using another stencil algorithm—in that case, a piece of seismic modeling code.)
The researchers also ran the benchmark problem on an Nvidia A100-based system (40GB variant), comparing the accelerator-to-accelerator results of the experiment. The WSE-2 implementation saw a time-to-solution of 0.075-0.076 seconds across the board, while the A100 implementation had a time-to-solution of 0.79 seconds at smaller problem sizes—but that gradually increased to 15.51 seconds as the problem size was increased, while the WSE-2 implementation remained steady.
“For the largest size shown here, the WSE-2 outperforms the A100 by more than 220×,” Cremon said. “The weak scaling efficiency of the WSE-2 is virtually perfect, greater than 98 percent for all sizes. To seasoned HPC practitioners, these are both astonishing results.”
The researchers are now working on more complicated stencil- and machine learning-based applications for the WSE-2.
To learn more about this research, read the blog post from Cremon here and read the paper on arXiv here.