A New Breed of Heterogeneous Computing

By Michael Feldman

April 18, 2012

With the introduction of add-on accelerators like GPUs, Intel’s upcoming MIC chip, and, to a lesser extent, FPGAs, the foundation of high performance computing is undergoing somewhat of a revolution. But an emerging variant of this heterogenous computing approach may upend the current accelerator model in the not-too-distant future. And it’s already begun in the mobile space.

In October 2011, ARM announced their “big.LITTLE” design, a chip architecture than integrates large, performant ARM cores with small, power-efficient ones. The goal of this approach is to minimize power draw in order to extend the battery life of devices like smartphones and tablets.

The way it works is by mapping an application to the optimal cores based on performance demands and power availability. For mobile devices, big cores would be used for performance-demanding tasks like navigation and gaming, and the smaller cores for the OS and simpler tasks like social media apps. But when the battery runs low, the software can shunt everything to the low power cores in order the keep the device operational. ARM is claiming that battery life can be extended by as much as 70 percent by migrating tasks intelligently.

ARM’s first incarnation of big.LITTLE pairs its large Cortex-A15 design with the smaller Cortex-A7, along with glue technology to provide cache and I/O coherency between the two sets of cores. Companies like Samsung, Freescale, and Texas Instruments, among others, are already signing up.

ARM didn’t invent the big core/little core concept though. This model has been kicked around in the research community for nearly a decade. One of the first papers on the subject was written in 2003 by Rakesh Kumar, along with colleagues at UCSD and HP Labs. He proposed a single-ISA heterogenous multicore design, but in this case based on the Alpha microprocessors, a CPU line that, at the time, was being targeted to high-end workstations and servers.

He found that a chip with four different Alpha core microarchitectures had the potential to “increase energy efficiency by a factor of three… without dramatic losses in performance.” He also discovered that most of these gains would be possible with as little as two types of cores.

In a recent conversation with Kumar, he expressed the notion that the time may be ripe for single-ISA heterogeneous chips to find a home in the server arena, even in high performance computing. The driver, once again, is power, or the lack thereof. As server farms and supercomputers expand in size, electricity usage has become a limiting factor. Whether you’re scaling up or scaling out, everyone is now focused on more energy-efficient computers.

“The key insight was that even if you map an application to a little core, it’s not going to perform much worse than running it on a big core,” said Kumar, referring to his earlier research. “But you can save many factors of power.”

The problem with big powerful CPUs like the Xeon, Opteron, and Power is now well known. Although Moore’ Law is still working to expand transistor budgets at a good clip, clock frequencies are stagnant. That means performance and, especially, performance-per-watt are increasing more slowly. For these high-end server chips, essentially you have to spend four units of power to deliver one unit of performance on a per core basis.

That’s a result of the superscalar nature of these big-core microarchitectures, which feature a lot of instruction level parallelism (ILP) and deep pipelines. Such a design reduces execution latency, but at a hefty price in wattage. As Kumar explains it, “It takes a lot of power and a lot of [die] area to squeeze that last 5 to 10 percent of performance.”

The implication is to just switch to smaller, power-efficient cores, with simpler pipelines and less ILP. If you can parallelize an application across many smaller, simpler cores, you get the best of both worlds: better throughput and higher energy efficiency. The problem is that for many applications, decent performance is contingent upon single-threaded performance as well. That has led to the adoption of the types of accelerator-based computing platforms mentioned at the beginning of this article, which pairs a serial CPU chip with a throughput coprocessor.

What the big/little model brings to the table is having both types of cores on the same die. And perhaps more importantly, unlike the CPU-GPU integration that AMD is doing with their Fusion chips and what NVIDIA is planning to do with their “Project Denver” platform, the big/little model consolidates on a homogeneous instruction set.

That has a number of advantages, one of which is easier software development. With a common ISA, there is no need for a complex toolchain with multiple compilers, runtimes, libraries, and debuggers that are needed to deal with two sets of architectures. For supercomputing-type applications though, writing the application is likely to remain challenging, inasmuch as the developer still has to parallelize the code as well as explicitly map the serial work and throughput work to the appropriate cores. Unlike with mobile computing, for HPC, assigning tasks to cores would be more static, since maximizing throughput is the overriding goal.

But where performance has to be compromised because of power or resource constraints, a single ISA chip is a huge advantage. So at run-time, application threads can migrate across the different microarchitectures, as needed, to optimize for throughput, power or both. And since the cores share cache and memory, suspending a thread on one core and resuming it on another is a relatively quick and painless operation.

So, for example, a render server farm equipped with big/little CPUs could shuffle application threads to faster or slower cores depending up the workload mix, available processor resources, and the turnaround time required. If a service level agreement (SLA) was in effect that allowed the rendering job to meet its deadline without maxing out on the big cores, the server farm could save on its electricity bill by utilizing more of the little cores.

It should be noted that power savings can also be achieved by varying a microprocessor’s power supply voltage and clock frequency, otherwise know as voltage/frequency scaling. But as transistor geometries shrink, this technique tends to yield diminishing returns. And as even Intel has concluded, big/little cores — Intel calls them asymmetric cores — seem to deliver the best results.

The most likely architectures to adopt the big/little paradigm over the next few years are x86 and ARM. As mentioned before ARM big.LITTLE implementations are already in the works for mobile computing, but with the unveiling of the 64-bit ARM architecture last year, and with companies like HP delving into ARM-based gear for the datacenter, big/little implementations of ARM servers could appear as early as the middle of this decade.

We may see x86-based big/little server chips even sooner. Intel, in particular, is in prime position to take advantage of this technology. For one thing, the chipmaker is the best in the business at transistor shrinking, which is an important element if you’re interested in populating a die with a useful number of big and little cores. It also has a huge stable of x86 cores designs, from the Atom chip all the way up to the Xeon.

Also, since Intel has little in the way of GPU IP that can be used for computing, the company is most likely to rely on its x86 legacy for throughput cores. For example, it’s not too hard to imagine Intel’s big-core Xeon paired up with its little-core MIC chip in a future SoC geared for HPC duty. The same model, but with a different mix of x86 microarchitectures, could also be used to build more generic enterprise server processors, not to mention its own mobile chips.

Whether Intel intends to go down this path or not remains to be seen. But a recent patent the company filed regarding mixing asymmetric x86 cores in a processor suggests the chipmaker has indeed given serious thought to big/little products. And since both AMD and NVIDIA are pursing their own heterogeneous SoCs, which by the way could also incorporated this technology, Intel is not likely cede any advantage to its competitors.

The big/little approach won’t be a panacea for energy-efficient computing, but it looks like one of the most promising approaches, at least at the level of the CPU. The fact that it incorporates the advantages of a heterogeneous architecture, but with a simpler model, has much to recommend it. And while big/little CPUs may be seen as somewhat of a threat to GPU computing, it can also be viewed as a complementary technology. What is certain is that the days of one-size-fits-all architectures are coming to a close.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than ever, the network plays a crucial role. While fast, perform Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of personalized treatments based on an individual’s genetic makeup Read more…

By Warren Froelich

WCRP’s New Strategic Plan for Climate Research Highlights the Importance of HPC

July 19, 2018

As climate modeling increasingly leverages exascale computing and researchers warn of an impending computing gap in climate research, the World Climate Research Programme (WCRP) is developing its new Strategic Plan – and high-performance computing is slated to play a critical role. Read more…

By Oliver Peckham

HPE Extreme Performance Solutions

Introducing the First Integrated System Management Software for HPC Clusters from HPE

How do you manage your complex, growing cluster environments? Answer that big challenge with the new HPC cluster management solution: HPE Performance Cluster Manager. Read more…

IBM Accelerated Insights

Are Your Software Licenses Impeding Your Productivity?

In my previous article, Improving chip yield rates with cognitive manufacturing, I highlighted the costs associated with semiconductor manufacturing, and how cognitive methods can yield benefits in both design and manufacture.  Read more…

U.S. Exascale Computing Project Releases Software Technology Progress Report

July 19, 2018

As is often noted the race to exascale computing isn’t just about hardware. This week the U.S. Exascale Computing Project (ECP) released its latest Software Technology (ST) Capability Assessment Report detailing progress so far. Read more…

By John Russell

InfiniBand Still Tops in Supercomputing

July 19, 2018

In the competitive global HPC landscape, system and processor vendors, nations and end user sites certainly get a lot of attention--deservedly so--but more than Read more…

By Tiffany Trader

HPC for Life: Genomics, Brain Research, and Beyond

July 19, 2018

During the past few decades, the life sciences have witnessed one landmark discovery after another with the aid of HPC, paving the way toward a new era of perso Read more…

By Warren Froelich

D-Wave Breaks New Ground in Quantum Simulation

July 16, 2018

Last Friday D-Wave scientists and colleagues published work in Science which they say represents the first fulfillment of Richard Feynman’s 1982 notion that Read more…

By John Russell

AI Thought Leaders on Capitol Hill

July 14, 2018

On Thursday, July 12, the House Committee on Science, Space, and Technology heard from four academic and industry leaders – representatives from Berkeley Lab, Argonne Lab, GE Global Research and Carnegie Mellon University – on the opportunities springing from the intersection of machine learning and advanced-scale computing. Read more…

By Tiffany Trader

HPC Serves as a ‘Rosetta Stone’ for the Information Age

July 12, 2018

In an age defined and transformed by its data, several large-scale scientific instruments around the globe might be viewed as a ‘mother lode’ of precious data. With names seemingly created for a ‘techno-speak’ glossary, these interferometers, cyclotrons, sequencers, solenoids, satellite altimeters, and cryo-electron microscopes are churning out data in previously unthinkable and seemingly incomprehensible quantities -- billions, trillions and quadrillions of bits and bytes of electro-magnetic code. Read more…

By Warren Froelich

Tsinghua Powers Through ISC18 Field

July 10, 2018

Tsinghua University topped all other competitors at the ISC18 Student Cluster Competition with an overall score of 88.43 out of 100. This gives Tsinghua their s Read more…

By Dan Olds

HPE, EPFL Launch Blue Brain 5 Supercomputer

July 10, 2018

HPE and the Ecole Polytechnique Federale de Lausannne (EPFL) Blue Brain Project yesterday introduced Blue Brain 5, a new supercomputer built by HPE, which displ Read more…

By John Russell

Pumping New Life into HPC Clusters, the Case for Liquid Cooling

July 10, 2018

High Performance Computing (HPC) faces some daunting challenges in the coming years as traditional, industry-standard systems push the boundaries of data center Read more…

By Scott Tease

Leading Solution Providers

SC17 Booth Video Tours Playlist

Altair @ SC17


AMD @ SC17


ASRock Rack @ SC17

ASRock Rack



DDN Storage @ SC17

DDN Storage

Huawei @ SC17


IBM @ SC17


IBM Power Systems @ SC17

IBM Power Systems

Intel @ SC17


Lenovo @ SC17


Mellanox Technologies @ SC17

Mellanox Technologies

Microsoft @ SC17


Penguin Computing @ SC17

Penguin Computing

Pure Storage @ SC17

Pure Storage

Supericro @ SC17


Tyan @ SC17


Univa @ SC17


  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This