As the largest-ever radio telescope, the Square Kilometre Array (SKA) will be a behemoth. As the name implies, the instruments of the massive radio telescope will span well over one square kilometer, using hundreds of dishes and hundreds of thousands of low-frequency aperture array telescopes spread across remote lands in Australia and South Africa that have as little human radio interference as possible. Costing billions of dollars, construction for the SKA was finally approved just six days ago. It won’t be finished until late in the 2020s, but researchers around the world are already preparing for what they anticipate will be an unprecedented deluge of radio astronomy data.
One such team hails from the Pawsey Supercomputing Centre in Perth, Western Australia, located a few hundred kilometers south of one of the SKA sites. In fact, Pawsey itself was “launched in 2014 with the goal of playing a critical role in the Square Kilometre Array project,” explained Ugo Varetto, Pawsey’s CTO, in a presentation at the virtual ISC 2021 conference held last week.
“We recently received 70 million dollars in funding to invest in replacing our computing, network and storage infrastructure,” Varetto said. “With the recently received funding, we acquired a brand new HPE Cray supercomputer named after a small marsupial endemic to areas close to Perth” – the computer is named Setonix – “[and] in a separate procurement, we acquired 60 petabytes of object storage which will be soon be integrated with the HPE supercomputer in the future.”
This equipment is powering, in part, precursor projects to prepare for Pawsey’s work on the SKA in the coming years: data ingestion, data visualization, data lifecycle management, data sharing – in short, data work. This is because the data from radio telescopes – and particularly the anticipated data load from the SKA – pose a serious burden on centers.
“Although the [computational] focus [at Pawsey] has been on radio astronomy, the center today allocates about 80 percent of the computational resources to other domains,” Varetto said. “But 80 percent of the storage is currently filled with radio astronomy data.”
In the absence of the SKA, Pawsey works with the Murchison Widefield Array (MWA), also located north of Perth. In the middle of the MWA, there’s a datacenter inside of a faraday cage, which aggregates all the signals from all the spider-like antennae that cover the MWA using a mix of FPGAs and GPUs (“one of the biggest FPGA deployments in the world,” Varetto said) before the aggregated signals are sent to Pawsey in Perth.
“Now,” Varetto said, “let’s finally talk about pipelines, which is the actual subject of the presentation.”
He means, of course, data pipelines. For the last seven years, Varetto explained, the general radio astronomy pipeline has been to receive, amplify, digitize and combine the signals, then move them into a hierarchical storage management system equipped with tape libraries, sans post-processing and calibration. “Calibrations might happen in different places and at different stages,” Varetto said, “but final calibration steps are normally required to, for example, calibrate the position of a source with respect to another source.”
“This model works quite well when the bandwidth is low and the observations are not too big – you are talking about gigabytes to a few terabytes range,” Varetto continued. “But the issue is that when bandwidth and data size [get larger and move toward petabytes of data], more processing needs to be moved into the pipeline because it cannot be easily performed by individual researchers. … All the processing pipelines including that analysis will need to be run on supercomputers.”
Ergo, he argued, functions like calibration and image reconstruction will need to be performed before the data is stored. Otherwise, the scale of the data – a step change from 4K-by-4K cubes to 50K-by-50K cubes – will require an HPC system for analysis.
To that end, he cited the Australian Square Kilometre Array Pathfinder (ASKAP) project, which serves as an example of future processing pipelines. ASKAP has integrated pre- and post-processing and currently, Varetto explained, requires a filesystem for communication between pipeline stages – though he said that in-memory processing would be required to address datasets and bandwidth on the scale of the eventual SKA.
“But also,” he continued, showing a long list of products, “[these analyses] require the support of very diverse software environments, [and it is] hard to impossible to support all of them at once on a single HPC system.”
So at Pawsey, the team is building something. “We’re building an environment that includes a software-configurable layer – namely Shasta, from HPE Cray – on top of a powerful, low-latency, high-bandwidth hardware – again from HPE Cray – and then a number of containers and orchestrators,” Varetto said. With this system, he explained, they can make development, deployment and maintainability of these diverse packages portable across environments.
Varetto further explained that in dealing with real-time data processing pipelines, the time taken to translate network packets across systems needs to be minimized. “In this case, it’s addressed by HPE’s new Slingshot interconnect, based on the Ethernet protocol,” he said.
Of course, there’s still some time before any of this technology is strictly necessary for the SKA.
“We won’t … be using the current systems at Pawsey to support the SKA project,” Varetto said. “But we need to learn how to dynamically run workloads at unprecedented workloads in the meantime.”