When Dense Matrix Representations Beat Sparse

By James Reinders

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance penalties on matrix operations, can end up costing both performance and memory space.

As reported in a paper at ISC19, researchers[i] recently rethought use of sparse matrix representations, originally motivated by GPU memory constraints, to use dense matrices in order to benefit from the larger memory capacities and scale-out capabilities of CPUs. The result was  not only superior performance and scaling using CPUs, it also (perhaps surprisingly) included a reduction in memory footprint because of the interplay between using sparse representations to reduce memory and the increased memory usage due to algorithm inefficiencies.

The researchers demonstrated the positive effects of their work in Horovod – an open source distributed Deep Learning framework for TensorFlow created by Uber Engineering. They also demonstrated its outstanding ability to scale-out, proving it using supercomputers run with large numbers of CPUs. Their work has been incorporated into Horovod 0.15.2 and later, allowing anyone to benefit from their approach. The researchers encourage others to think as they have, because they believe that their rethinking of such work has applicability to other frameworks and libraries, such as BERT (Bidirectional Encoder Representations from Transformers).

The science – NMT

Neural machine translation (NMT) — using neural networks to translate human language — is an area of active research with the goal of dramatically improving machine translation performance. Current state-of-the-art approaches have hit roadblocks due to excessive memory use (a graph shared later in this article shows their scaling results on 8 nodes, proving how badly the original code fails to scale even at such low levels). Researchers made modifications to reduce memory usage for transformer models by converting assumed-sparse tensors to dense tensors, and subsequently replacing sparse gradient gather with dense gradient reduction. NMT now reaches new heights by leaning on CPU capabilities including superior memory capacity.

Being dense has its advantages

Dense Matrix representations consume more memory than sparse representations for many real-world matrices. As a result, many Deep Learning and AI algorithms err on the side of using sparse matrix representations to deal with the small local memories available when using GPUs. Unfortunately, while often saving memory they come with a non-trivial performance penalty, and coding complexity, for many matrix operations. This is markedly different than CPU programmers who tend to err on the side of using dense matrix representations because operations on them remain straightforward and simple to program and maintain.

Common wisdom questioned: GPUs like sparse, CPUs like dense

Originally, the researchers were looking to undo the performance degradations associated with sparse matrix representations — motivated by the GPU port of the code, and unnecessary for a CPU port of the code. The researcher suspected the matrices might not be as sparse as originally assumed (hence they emphasize “assumed sparse” in their discussions), and they knew the benefits on memory savings in such cases are diminished as they can be easily overwhelmed by the additional costs of matrix operations.

In the particular case they investigated, the distributed learning algorithm utilized an accumulation instead of a reduction operation because that is more practical when using sparse matrix representations. However, this approach dramatically contributes to increased memory utilization because it accumulated results instead of holding down the memory footprint of results through reductions. In this case, the interplay of algorithm choice and memory layout, combined with the denseness of these assumed sparse matrices, led to a benefit for both GPU and CPU in terms memory footprint — while unleashing the full potential of CPU based systems to scale-out with this simpler to understand algorithm (uncomplicated by the GPU inspired use of sparse matrices).

Unleashing CPU scaling

Once the researchers shifted to dense matrix representations, their new implementation opened the door for much improved scaling. What would take one month when using a single node, is now reduced to slightly over 6 hours when using 200 nodes (121 times faster). This result can significantly increase the productivity for NMT researchers by allowing the use of CPU-based HPC infrastructures. Researchers reported that their ability to maintain very high scaling efficiencies up to the 300-node level that they tested, suggests that continued scale-out is worthwhile beyond what they have tried thus far. That is certainly far better than the inability to scale beyond 8 nodes effectively when they started!

Even at only 8 nodes, the rapid decline in scaling of the original (sparse) approach dooms any high degree of scale-out — so runs at higher levels would be a waste of money and compute resources. The new approach (dense) shows enough promise here, that researchers later show exceptional scaling results above 256 nodes.

Results — faster execution and smaller memory footprint

Their code using a dense representation resulted in a more than 82x reduction (11446MB to 139MB) in the amount of memory required on 64-node run. It also, saw a more than 25x reduction in time required for the accumulation operation (4321ms to 169ms).

Space/time for tensor accumulated (sparse gather vs. dense reduce)

Model training experiments were run on the Zenith cluster in the Dell EMC HPC & AI Innovation Lab, as well as the Stampede2 cluster at the Texas Advanced Computing Center (TACC) in Austin, Texas, both featuring Intel processors and Intel Omni-Path fabric. In both cases, the researchers used Python 2.7, with Intel’s MKL-optimized version of TensorFlow (1.12), and modifications to Horovod that are available to everyone now in the versions 0.15.2  and  later.

Each Zenith node consists of dual Intel Xeon Scalable Gold 6148/F processors, 192GB of memory, and an M.2 boot drive to house the operating system that does not provide user-accessible local storage. Nodes are interconnected by a 100Gbps Intel Omni-path fabric, and shared storage is provided by a combination of NFS (for HOME directories) and Lustre filesystems.

Work on the Stampede2, used the Skylake (SKX) partition, which consists of 1,736 nodes. Each node is outfitted with dual Intel Xeon Scalable Platinum 8160 processors, 192GB of memory, and 200GB internal SSD drive for the operating system and local /tmp. All nodes are interconnected with 100Gbps Intel Omni-Path fabric and connected to Lustre-based shared filesystems.

The researchers summarized their work in a paper at ISC19. The software changes which they discuss in their paper have been incorporated into Horovod  0.15.2  and  later,  providing  other  researchers  the  opportunity  to apply their approach on any models that may benefit.

[i] Valeriu Codreanu and Damian Podareanu of SURFsara, Derya Cavdar, Can Karakus, and Victor Suthichai of Amazon, Alexander Sergeev of Uber, Vikram Saletore of Intel, and John A. Lockman III, Don D. Smith II, Quy Ta, Srinivas Varadharajan, Lucas A. Wilson, Rengan Xu, and Pei Yang of Dell EMC.

About the Author

James Reinders likes fast computers and the software tools to make them speedy. With over 30 years in High Performance Computing (HPC) and Parallel Computing including 27 Years at Intel Corporation (retired June 2016), he is also the author of nine books in the HPC field, numerous papers and blogs.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

UT Dallas Grows HPC Storage Footprint for Animation and Game Development

October 28, 2020

Computer-generated animation and video game development are extraordinarily computationally intensive fields, with studios often requiring large server farms with hundreds of terabytes – or even petabytes – of storag Read more…

By Staff report

Frame by Frame, Supercomputing Reveals the Forms of the Coronavirus

October 27, 2020

From the start of the pandemic, supercomputing research has been targeting one particular protein of the coronavirus: the notorious “S” or “spike” protein, which allows the virus to pry its way into human cells a Read more…

By Oliver Peckham

AMD Reports Record Revenue and $35B Deal to Buy Xilinx

October 27, 2020

AMD this morning reported record quarterly revenue of $2.8 billion and a finalized deal to buy FPGA-maker Xilinx for $35 billion in an all-stock transaction. The acquisition helps AMD keep pace during a time of consolida Read more…

By John Russell

Nvidia-Arm Deal a Boon for RISC-V?

October 26, 2020

The $40 billion blockbuster acquisition deal that will bring chip maker Arm into the Nvidia corporate family could provide a boost for the competing RISC-V architecture. As regulators in the U.S., China and the Europe Read more…

By George Leopold

OpenHPC Progress Report – v2.0, More Recipes, Cloud and Arm Support, Says Schulz

October 26, 2020

Launched in late 2015 and transitioned to a Linux Foundation Project in 2016, OpenHPC has marched quietly but steadily forward. Its goal “to provide a reference collection of open-source HPC software components and bes Read more…

By John Russell

AWS Solution Channel

Rapid Chip Design in the Cloud

Time-to-market and engineering efficiency are the most critical and expensive metrics for a chip design company. With this in mind, the team at Annapurna Labs selected Altair AcceleratorRead more…

Intel® HPC + AI Pavilion

Berlin Institute of Health: Putting HPC to Work for the World

Researchers from the Center for Digital Health at the Berlin Institute of Health (BIH) are using science to understand the pathophysiology of COVID-19, which can help to inform the development of targeted treatments. Read more…

NASA Uses Supercomputing to Measure Carbon in the World’s Trees

October 22, 2020

Trees constitute one of the world’s most important carbon sinks, pulling enormous amounts of carbon dioxide from the atmosphere and storing the carbon in their trunks and the surrounding soil. Measuring this carbon sto Read more…

By Oliver Peckham

AMD Reports Record Revenue and $35B Deal to Buy Xilinx

October 27, 2020

AMD this morning reported record quarterly revenue of $2.8 billion and a finalized deal to buy FPGA-maker Xilinx for $35 billion in an all-stock transaction. Th Read more…

By John Russell

OpenHPC Progress Report – v2.0, More Recipes, Cloud and Arm Support, Says Schulz

October 26, 2020

Launched in late 2015 and transitioned to a Linux Foundation Project in 2016, OpenHPC has marched quietly but steadily forward. Its goal “to provide a referen Read more…

By John Russell

Nvidia Dominates (Again) Latest MLPerf Inference Results

October 22, 2020

The two-year-old AI benchmarking group MLPerf.org released its second set of inferencing results yesterday and again, as in the most recent MLPerf training resu Read more…

By John Russell

HPE, AMD and EuroHPC Partner for Pre-Exascale LUMI Supercomputer

October 21, 2020

Not even a week after Nvidia announced that it would be providing hardware for the first four of the eight planned EuroHPC systems, HPE and AMD are announcing a Read more…

By Oliver Peckham

HPE to Build Australia’s Most Powerful Supercomputer for Pawsey

October 20, 2020

The Pawsey Supercomputing Centre in Perth, Western Australia, has had a busy year. Pawsey typically spends much of its time looking to the stars, working with a Read more…

By Oliver Peckham

DDN-Tintri Showcases Technology Integration with Two New Products

October 20, 2020

DDN, a long-time leader in HPC storage, announced two new products today and provided more detail around its strategy for integrating DDN HPC technologies with Read more…

By John Russell

Is the Nvidia A100 GPU Performance Worth a Hardware Upgrade?

October 16, 2020

Over the last decade, accelerators have seen an increasing rate of adoption in high-performance computing (HPC) platforms, and in the June 2020 Top500 list, eig Read more…

By Hartwig Anzt, Ahmad Abdelfattah and Jack Dongarra

Nvidia and EuroHPC Team for Four Supercomputers, Including Massive ‘Leonardo’ System

October 15, 2020

The EuroHPC Joint Undertaking (JU) serves as Europe’s concerted supercomputing play, currently comprising 32 member states and billions of euros in funding. I Read more…

By Oliver Peckham

Supercomputer-Powered Research Uncovers Signs of ‘Bradykinin Storm’ That May Explain COVID-19 Symptoms

July 28, 2020

Doctors and medical researchers have struggled to pinpoint – let alone explain – the deluge of symptoms induced by COVID-19 infections in patients, and what Read more…

By Oliver Peckham

Nvidia Said to Be Close on Arm Deal

August 3, 2020

GPU leader Nvidia Corp. is in talks to buy U.K. chip designer Arm from parent company Softbank, according to several reports over the weekend. If consummated Read more…

By George Leopold

Intel’s 7nm Slip Raises Questions About Ponte Vecchio GPU, Aurora Supercomputer

July 30, 2020

During its second-quarter earnings call, Intel announced a one-year delay of its 7nm process technology, which it says it will create an approximate six-month shift for its CPU product timing relative to prior expectations. The primary issue is a defect mode in the 7nm process that resulted in yield degradation... Read more…

By Tiffany Trader

Google Hires Longtime Intel Exec Bill Magro to Lead HPC Strategy

September 18, 2020

In a sign of the times, another prominent HPCer has made a move to a hyperscaler. Longtime Intel executive Bill Magro joined Google as chief technologist for hi Read more…

By Tiffany Trader

HPE Keeps Cray Brand Promise, Reveals HPE Cray Supercomputing Line

August 4, 2020

The HPC community, ever-affectionate toward Cray and its eponymous founder, can breathe a (virtual) sigh of relief. The Cray brand will live on, encompassing th Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

Aurora’s Troubles Move Frontier into Pole Exascale Position

October 1, 2020

Intel’s 7nm node delay has raised questions about the status of the Aurora supercomputer that was scheduled to be stood up at Argonne National Laboratory next year. Aurora was in the running to be the United States’ first exascale supercomputer although it was on a contemporaneous timeline with... Read more…

By Tiffany Trader

Is the Nvidia A100 GPU Performance Worth a Hardware Upgrade?

October 16, 2020

Over the last decade, accelerators have seen an increasing rate of adoption in high-performance computing (HPC) platforms, and in the June 2020 Top500 list, eig Read more…

By Hartwig Anzt, Ahmad Abdelfattah and Jack Dongarra

Leading Solution Providers

Contributors

European Commission Declares €8 Billion Investment in Supercomputing

September 18, 2020

Just under two years ago, the European Commission formalized the EuroHPC Joint Undertaking (JU): a concerted HPC effort (comprising 32 participating states at c Read more…

By Oliver Peckham

Nvidia and EuroHPC Team for Four Supercomputers, Including Massive ‘Leonardo’ System

October 15, 2020

The EuroHPC Joint Undertaking (JU) serves as Europe’s concerted supercomputing play, currently comprising 32 member states and billions of euros in funding. I Read more…

By Oliver Peckham

Google Cloud Debuts 16-GPU Ampere A100 Instances

July 7, 2020

On the heels of the Nvidia’s Ampere A100 GPU launch in May, Google Cloud is announcing alpha availability of the A100 “Accelerator Optimized” VM A2 instance family on Google Compute Engine. The instances are powered by the HGX A100 16-GPU platform, which combines two HGX A100 8-GPU baseboards using... Read more…

By Tiffany Trader

Microsoft Azure Adds A100 GPU Instances for ‘Supercomputer-Class AI’ in the Cloud

August 19, 2020

Microsoft Azure continues to infuse its cloud platform with HPC- and AI-directed technologies. Today the cloud services purveyor announced a new virtual machine Read more…

By Tiffany Trader

Oracle Cloud Infrastructure Powers Fugaku’s Storage, Scores IO500 Win

August 28, 2020

In June, RIKEN shook the supercomputing world with its Arm-based, Fujitsu-built juggernaut: Fugaku. The system, which weighs in at 415.5 Linpack petaflops, topp Read more…

By Oliver Peckham

HPE, AMD and EuroHPC Partner for Pre-Exascale LUMI Supercomputer

October 21, 2020

Not even a week after Nvidia announced that it would be providing hardware for the first four of the eight planned EuroHPC systems, HPE and AMD are announcing a Read more…

By Oliver Peckham

DOD Orders Two AI-Focused Supercomputers from Liqid

August 24, 2020

The U.S. Department of Defense is making a big investment in data analytics and AI computing with the procurement of two HPC systems that will provide the High Read more…

By Tiffany Trader

Oracle Cloud Deepens HPC Embrace with Launch of A100 Instances, Plans for Arm, More 

September 22, 2020

Oracle Cloud Infrastructure (OCI) continued its steady ramp-up of HPC capabilities today with a flurry of announcements. Topping the list is general availabilit Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This