Despite its relatively vague name, DiRAC – which stands for “Distributed Research utilizing Advanced Computing” – serves predominantly specialized research communities, with emphases on fields like cosmology and nuclear physics. Now, in partnership with AMD, one of DiRAC’s installations is upgrading in order to yield a clearer view of the cosmos.
DiRAC’s facilities are spread across four campuses (the Universities of Cambridge, Durham, Edinburgh and Leicester), with each campus’ hardware serving specific purposes. At the DiRAC installation at Durham University, which mostly serves cosmology research, large-memory nodes are the name of the game. Durham’s systems, nicknamed COSMA (for “cosmology machine”) have been iterating since 2001, with current users stretched across four systems: COSMAs 5, 6, 7 and, now, 8.
“[COSMA] is what we call a capability system,” explained Alastair Basden, the technical manager for DiRAC’s memory-intensive services at Durham University. “If you’re doing large-scale cosmological simulations of the universe, you need a lot of RAM. They can have run times of months, and then, after they’ve produced their data, which will be snapshots of the universe at lots of different time steps and different redshifts, then years are spent in processing and analysis.”
“For our current system [COSMA 7], we have about 18GB of RAM per core,” he continued. “When you compare that with typical systems of two to four gigabytes, that’s a significant uplift. We do large-scale cosmological simulations of the universe, right from the start of the Big Bang up until the present day.”
By way of example, Basden cited the European machines that DiRAC researchers sometimes use when there isn’t enough computational time available on in-house systems. “We generally find that they’re not so good, because they’ve got much lower RAM per core,” he said. “Even with a much larger allocation of CPUs on those machines, we tend to get better results on COSMA, because it’s more designed for these simulations.”
For COSMA 8, Basden turned to AMD’s Epyc CPUs, having encountered them previously in the context of a smaller installation. His team benchmarked the second-generation “Rome” CPUs at Dell’s datacenter in Austin using a piece of cosmology simulation software called SWIFT. “It ran as we would hope,” he said. “In terms of core for core, it was basically the same, and when you’ve got more cluster cores [thanks to the higher core density], it’s a no-brainer.”
COSMA 8’s initial installation mainly comprises 32 nodes, each with 1 TB of RAM and dual 64-core AMD Epyc 7H12 Rome CPUs. Those nodes are complemented by twin login nodes, one “fat” node with 4 TB of RAM, two AMD GPU nodes (each sporting 3 MI50 GPUs), one Nvidia GPU node with 10 V100 GPUs and Intel Xeon CPUs, and two console nodes. Components are housed in Dell’s Cloud Service C-series chassis in a 2U form factor with custom CoolIT water cooling.
“This [initial installation] was primarily for testing, for getting code up to scratch, and small-scale simulations,” Basden said. “We wanted a large number of cores per node, because then it meant we could cut down on the amount of internode communication, but because there are parts of the code that don’t parallelize 100 percent, we also wanted high clock rates so that the lower-threaded parts of the code stood up well, which meant the 7H12 would be the best option.”
COSMA 8, which entered prototype service last October, is being used to assess future design possibilities. Basden explained that the AMD GPU nodes, for instance, were installed “for running AI workloads like TensorFlow,” but DiRAC is also planning to port code to them specifically to examine whether such nodes would be a viable solution for exascale computing.
The full version of COSMA 8, which will consist of 360 compute nodes, is currently undergoing installation with an expected service date of October 2021.
“This [new system] will help us to understand the nature of the universe, dark matter, dark energy and how the universe was formed,” Basden concluded. “It’s really going to help us drill down to a fundamental understanding of the world that we live in.”