Shining a Light on SKA’s Massive Data Processing Requirements

By Tiffany Trader

June 4, 2015

One of the many highlights of the fourth annual Asia Student Supercomputer Challenge (ASC15) was the MIC optimization test, which this year required students to optimize a gridding algorithm used in the world’s largest international astronomy effort, the Square Kilometre Array (SKA) project.

Gridding is one of the most time-consuming steps in radio telescope data processing. To reconstruct a sky image from the data collected by the radio telescope, scientists need to take the irregular sampled data and map it onto a standardized 2-D mesh. The process of adding sampled data from the telescopes to a grid is called gridding. After this step, the grid can be Fourier transformed to create a sky image.

To say that radio astronomy is pushing the limits of data processing is an understatement. Consider that the data produced by SKA per second is expected to exceed 12TB and nearly 50 percent of this astronomy data need to be processed through gridding. In a 2012 paper, Netherlands Institute for Radio Astronomy (ASTRON) researcher John W. Romein placed SKA phase one image processing requirements in the petaflops range; the full-scale project will cross into exaflops territory.

Unlike the other five ASC15 test applications (LINPACK, NAMD, WRF-CHEM, Palabos and the surprise-app HPCC), which run on Inspur provided racks with a maximum power consumption limit of 3000 watts, the gridding app is run on a separate platform provided by the committee consisting of one login node and 16 computing nodes. The 16 nodes are outfitted with two CPUs (Intel Xeon E5-2670 v3, 12-core, 2.30GHz, 64GB memory) and one MIC card (Intel Xeon Phi 7110P, 61 cores, 1.1Ghz, 8GB memory) connected over InfiniBand FDR.

The gridding portion of the ASC15 challenge is worth 20 points out of a possible 100 and the team with the fastest run time is awarded the e-Prize award, which comes with $4,380 in prize money. During the awards ceremony held Friday, May 22, the winner of this challenge was declared to be Sun Yat-sen University. This was not Sun Yat-sen’s first time being honored in an ASC competition. Last year, the team rewrote the LINPACK record by achieving a peak performance of 9.272 teraflops within the 3,000 watts power budget.

Sun Yat-sen University was victorious in this effort, but they were not alone in their ability to impress the judges, a panel of HPC experts that included ASTRON researcher Chris Broekema. Compute platform lead for the SKA Science Data Processor (essentially the HPC arm of SKA), Broekema shared with HPCwire that while the solutions that the students came up with were not entirely new ideas, the quality of the teams’ work exceeded his expectations.

The 16 teams who competed in the ASC15 finals were allowed to research the application in advance, but the way that they tackled the problem showed creativity and an understanding of the main issues involved in optimizing this I/O bound algorithm. In fact, they managed to get fairly close to the state-of-the-art in just a couple weeks, according to Broekema.

While the various teams employed different optimization techniques, Broekema said that the best results were the ones that completely reordered the way the data was handled and altered the structure of the different loops. This led to a result that was essentially one step short of most successful optimization developed by the SKA community.

One of the primary challenges of this algorithm relates to memory accesses, something that was correctly identified by most of the teams. Gridding involves many memory reads and writes for very little compute. The current state-of-the-art in addressing this imbalance is to sacrifice compute for reduced memory accesses. Implementing this solution takes a while, and requires a complete rethink of the way you go through your data.

“Even though it’s a bit more expensive in terms of compute, the fact that it’s far more efficient in going through memory makes it a far more efficient implementation of gridding altogether,” Broekema explained.

According to the ASC15 committee, the application selected for the MIC optimization test should be “practical, challenging and interesting.” Asked why this application was a good fit for the contest, Broekema responded that the shortness of the code snippet engendered a much more detailed analysis of what’s happening in the actual code, compared to the other applications, which, being established and somewhat bulky code bases, can be very difficult for students to fully penetrate. While the snippet allowed for a more meaningful challenge in some ways, Broekema is already thinking about ways to fine-tune the test code to further enrich the student experience. He wants to make it more like real-world implementations so students can get a feel for how it is used in practice.

MIC optimization is one of many projects that Broekema and his colleagues are working on. Several of the SKA processing workloads, including the gridding algorithm, have been optimized for GPUs, he said, but it can work for other platforms as well, including MIC, FPGAs and ASICs. Each of these necessitates a different approach to data handling. A number of benchmarking efforts have already been completed and others are underway as the SKA ramps up to its 2017 Phase 1 launch.

Broekema’s next point drove home just how integral platform evaluation is to the greater SKA effort. “One of the undertakings of the SKA community in general is looking at the various platforms that are currently available and the various algorithms important to the work to see how they map on those platforms,” he said. “This isn’t confined to the Science Data Processor [the high-performance computing component of the SKA].”

“Before data is sent to the Science Data Processor, which does the gridding, Fourier transforming, etc., there’s the central signal processor, essentially the correlator, which involves a very large amount of fairly simple algorithms – correlation, filtering, and also Fourier Transforms, probably on fixed integer size data – and those may well be done in FPGAs or ASICS, although it’s also possible to use accelerators like GPUs or Phi. So there’s a range of algorithms, correlators, Fourier transforms, gridding, convolutions, filters, etc., that are analyzed for different kinds of platforms, to see what is the best combination of platform and implementation.”

Asked whether FGPAs/ASICs wouldn’t be the best choice in terms of highest performance and performance-per-watt, Broekema said they are still very hard to program, which increases the risk of a late implementation. It’s also his opinion that the performance gap between GPUs and FPGAs is narrowing fairly quickly. It used to be several factors of discrepancy, but now it’s just a couple of dozen percent, he reported, and implementations of months (with GPUs) rather than years (with FPGAs) is a great advantage as well.

After a slight pause, however, Broekema began laying out the factors that could turn the tide toward FPGAs, starting with Intel’s purchase of Altera on Monday. The February announcement that Altera FPGAs will be manufactured on Intel’s 14 nm tri-gate transistor technology was cited as another reason to believe that FPGAs will continue to maintain their energy-efficiency edge over GPUs. And the fact that the reconfigurable chips can now be programmed using OpenCL promises to ease one of their main weaknesses. Just how much having this OpenCL support changes the FPGA programming paradigm is something that the SKA HPC group will be exploring with a new pilot project.

In summary, Broekema characterized the boundary between different kinds of programmable accelerators as fuzzy, which is why they are taking a look at all of them. “FPGAs are getting easier to integrate,” he stated. “There’s the Xeon Phi, which has the advantage of being easier to program and looking more like a regular Xeon, but they are a little late to the party and performance is not optimal at the moment. We did benchmarks on DSPs as well, and found them to be even more difficult to program than FPGAs.”

With all this benchmarking, GPUs are currently the preferred accelerator within the SKA community and the one they have deployed in production environments.

While the research into different platforms is being carried out by and for the benefit of the radio astronomy community in preparation for the immense SKA radio telescope, the value does not end there. “There’s an obvious parallel with medical imaging,” Broekema told HPCwire. “The data from large MRI machines, they do fairly similar work; then there’s the multimedia sector, streaming video has very similar data rates,” he said.

More significant is the potential for shared lessons going forward as HPC and even general computing become ever more data-laden. Radio astronomy knows all about these extremely I/O bound algorithms, where the data rates far exceed the compute element. The skewed ratio between I/O and compute is set to skew even further in the future, according to Broekema, and not just in radio astronomy.

“The problems that we face now are probably indicative of the problems that everyone is going to face in the next few years,” he commented. “So in that sense, I believe that the problems that we solve are useful for pretty much the entire HPC community and possibly even computer science in general.”

The ASTRON scientist recalled an example of this synergistic cross-HPC pollination that occurred several years ago. The systems software team at Argonne National Lab built an extension to an operating system that was intended to be for high-performance computing on their Blue Gene systems, and the radio astronomy community coopted it with great success for their Blue Gene that performed the data processing for LOFAR.

“Many of the optimizations that we come up with are equally valuable and equally useful for other HPC and other computer science applications,” he stated.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire