Making sense of ML performance and benchmark data is an ongoing challenge. In light of last week’s release of the most recent MLPerf (v1.1) inference results, now is perhaps a good time to review how valuable (or not) such ML benchmarks are and the challenges they face. Two researchers from Purdue University recently tackled this issue in a fascinating blog on ACM SIGARCH – An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War.
Tim Rogers and Mahmoud Khairy wrote, “At its core, all engineering is science optimized (or perverted) by economics. As academics in computer science and engineering, we have a symbiotic relationship with industry. Still, it is often necessary for us to peel back the marketing noise and understand precisely what companies are building, why their offering is better than their competitors, and how we can innovate beyond what industrial state-of-the-art provides.
“The more immediately relevant a problem is, the more funding and innovation the solutions will receive, at the cost of increased noise for both customers deciding where to place investments and researchers attempting to innovate without bias. In 2021, the raging machine learning accelerator war is now a well-established battleground of processors, ideas, and noise.”
With that the researchers plunge into describing the two dominant approaches (data parallelism and model parallelism) to solving ML problems and how they match to different chip/system architectures and different problem types and sizes.
Rogers and Khairy credit MLPerf for providing “a SPEC-like benchmark suite that we hope will eventually provide clarity on the best-performing machine learning systems,” but then recognize many companies don’t participate for a variety of economic and other reasons. The larger question they pose is, “In the coming age of customized accelerators, do standardized benchmark suites make sense at all anymore? Broadly, this is likely to be a challenge, but in the context of GEMM-heavy machine learning workloads, we think standardization is both possible and necessary.”
Shown below is a chart they prepared comparing several accelerators including Cerebras and Graphcore, which have not competed in MLPerf. Khairy dug through a variety of public documentation to pull together this “apples-to-apples comparison of 16-bit mixed-precision training:”
The meat of the post is about data parallelism and model parallelism and Rogers and Khairy dig in and provide some comparison metrics. Here’s a summary description of the approaches excerpted from the blog:
- Data Parallelism. “Generally both NVIDIA and Google rely on data parallelism to scale performance. In this context, data parallelism refers to replicating the network model on different accelerators and assigning a fraction of each training batch to each node (Figure 2). The relatively large DRAM capacity of the TPU and GPU makes network replication feasible, and the dedicated hardware for dense GEMM operations make large batch training efficient. Both vendors have also invested significant effort in making the reduction operation across nodes as efficient as possible. However, the key to exploiting data parallelism is achieving efficient training with a large batch size, which requires a non-trivial amount of hyper-parameter tuning.
- Model Parallelism. “Cerebras and Graphcore take a different approach, focusing heavily on model parallelism. In model parallelism, each layer of a single model is mapped to hardware, and data flows between each layer, ideally to processing units on the same chip. To exploit this philosophy, both Graphcore and Cerebras devote a large portion of their silicon area to SRAM cells such that large models are placed entirely in SRAM. Cerebras achieves this with its wafer-scale chip, while Graphcore’s IPU servers consist of smaller chips linked via a ring interconnect.”
They emphasize it’s not, strictly-peaking, an either-or question. GPUs and TPUs can exploit model parallelism, and Cerebras/Graphcore servers can exploit data parallelism.
“However, each offering has a particular type of parallelism where its design has an advantage. It seems that one open question academics can help answer is what the right balance between model and data parallelism should be. Data parallelism is difficult to scale, and excessive hyperparameter tuning gives a clear advantage to more prominent players with the time and resources to tune them. In contrast, model parallelism requires less tuning, but as soon as the model no longer fits on-chip or on-node, most of the advantages are lost,” they wrote.
Their blog is a quick read and worth the effort (apologies for excerpting so much material). They emphasize their intent is not to predict the war’s winner but rather predict what weapons will turn the tide: “Academically, the data versus model parallelism tradeoff is interesting. If models continue to grow, exploiting model parallelism on the same chip, wafer-scale or not, will be infeasible. Conversely, increasing the batch size to exploit data parallelism is difficult, and larger models won’t make this easier. It seems that a hybrid approach will be the solution, and hardware that can effectively adapt to different model/data parallelism levels will likely have the most impact.”
Link to ACM SIGARCH blog: https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/