A New Breed of Heterogeneous Computing

By Michael Feldman

April 18, 2012

With the introduction of add-on accelerators like GPUs, Intel’s upcoming MIC chip, and, to a lesser extent, FPGAs, the foundation of high performance computing is undergoing somewhat of a revolution. But an emerging variant of this heterogenous computing approach may upend the current accelerator model in the not-too-distant future. And it’s already begun in the mobile space.

In October 2011, ARM announced their “big.LITTLE” design, a chip architecture than integrates large, performant ARM cores with small, power-efficient ones. The goal of this approach is to minimize power draw in order to extend the battery life of devices like smartphones and tablets.

The way it works is by mapping an application to the optimal cores based on performance demands and power availability. For mobile devices, big cores would be used for performance-demanding tasks like navigation and gaming, and the smaller cores for the OS and simpler tasks like social media apps. But when the battery runs low, the software can shunt everything to the low power cores in order the keep the device operational. ARM is claiming that battery life can be extended by as much as 70 percent by migrating tasks intelligently.

ARM’s first incarnation of big.LITTLE pairs its large Cortex-A15 design with the smaller Cortex-A7, along with glue technology to provide cache and I/O coherency between the two sets of cores. Companies like Samsung, Freescale, and Texas Instruments, among others, are already signing up.

ARM didn’t invent the big core/little core concept though. This model has been kicked around in the research community for nearly a decade. One of the first papers on the subject was written in 2003 by Rakesh Kumar, along with colleagues at UCSD and HP Labs. He proposed a single-ISA heterogenous multicore design, but in this case based on the Alpha microprocessors, a CPU line that, at the time, was being targeted to high-end workstations and servers.

He found that a chip with four different Alpha core microarchitectures had the potential to “increase energy efficiency by a factor of three… without dramatic losses in performance.” He also discovered that most of these gains would be possible with as little as two types of cores.

In a recent conversation with Kumar, he expressed the notion that the time may be ripe for single-ISA heterogeneous chips to find a home in the server arena, even in high performance computing. The driver, once again, is power, or the lack thereof. As server farms and supercomputers expand in size, electricity usage has become a limiting factor. Whether you’re scaling up or scaling out, everyone is now focused on more energy-efficient computers.

“The key insight was that even if you map an application to a little core, it’s not going to perform much worse than running it on a big core,” said Kumar, referring to his earlier research. “But you can save many factors of power.”

The problem with big powerful CPUs like the Xeon, Opteron, and Power is now well known. Although Moore’ Law is still working to expand transistor budgets at a good clip, clock frequencies are stagnant. That means performance and, especially, performance-per-watt are increasing more slowly. For these high-end server chips, essentially you have to spend four units of power to deliver one unit of performance on a per core basis.

That’s a result of the superscalar nature of these big-core microarchitectures, which feature a lot of instruction level parallelism (ILP) and deep pipelines. Such a design reduces execution latency, but at a hefty price in wattage. As Kumar explains it, “It takes a lot of power and a lot of [die] area to squeeze that last 5 to 10 percent of performance.”

The implication is to just switch to smaller, power-efficient cores, with simpler pipelines and less ILP. If you can parallelize an application across many smaller, simpler cores, you get the best of both worlds: better throughput and higher energy efficiency. The problem is that for many applications, decent performance is contingent upon single-threaded performance as well. That has led to the adoption of the types of accelerator-based computing platforms mentioned at the beginning of this article, which pairs a serial CPU chip with a throughput coprocessor.

What the big/little model brings to the table is having both types of cores on the same die. And perhaps more importantly, unlike the CPU-GPU integration that AMD is doing with their Fusion chips and what NVIDIA is planning to do with their “Project Denver” platform, the big/little model consolidates on a homogeneous instruction set.

That has a number of advantages, one of which is easier software development. With a common ISA, there is no need for a complex toolchain with multiple compilers, runtimes, libraries, and debuggers that are needed to deal with two sets of architectures. For supercomputing-type applications though, writing the application is likely to remain challenging, inasmuch as the developer still has to parallelize the code as well as explicitly map the serial work and throughput work to the appropriate cores. Unlike with mobile computing, for HPC, assigning tasks to cores would be more static, since maximizing throughput is the overriding goal.

But where performance has to be compromised because of power or resource constraints, a single ISA chip is a huge advantage. So at run-time, application threads can migrate across the different microarchitectures, as needed, to optimize for throughput, power or both. And since the cores share cache and memory, suspending a thread on one core and resuming it on another is a relatively quick and painless operation.

So, for example, a render server farm equipped with big/little CPUs could shuffle application threads to faster or slower cores depending up the workload mix, available processor resources, and the turnaround time required. If a service level agreement (SLA) was in effect that allowed the rendering job to meet its deadline without maxing out on the big cores, the server farm could save on its electricity bill by utilizing more of the little cores.

It should be noted that power savings can also be achieved by varying a microprocessor’s power supply voltage and clock frequency, otherwise know as voltage/frequency scaling. But as transistor geometries shrink, this technique tends to yield diminishing returns. And as even Intel has concluded, big/little cores — Intel calls them asymmetric cores — seem to deliver the best results.

The most likely architectures to adopt the big/little paradigm over the next few years are x86 and ARM. As mentioned before ARM big.LITTLE implementations are already in the works for mobile computing, but with the unveiling of the 64-bit ARM architecture last year, and with companies like HP delving into ARM-based gear for the datacenter, big/little implementations of ARM servers could appear as early as the middle of this decade.

We may see x86-based big/little server chips even sooner. Intel, in particular, is in prime position to take advantage of this technology. For one thing, the chipmaker is the best in the business at transistor shrinking, which is an important element if you’re interested in populating a die with a useful number of big and little cores. It also has a huge stable of x86 cores designs, from the Atom chip all the way up to the Xeon.

Also, since Intel has little in the way of GPU IP that can be used for computing, the company is most likely to rely on its x86 legacy for throughput cores. For example, it’s not too hard to imagine Intel’s big-core Xeon paired up with its little-core MIC chip in a future SoC geared for HPC duty. The same model, but with a different mix of x86 microarchitectures, could also be used to build more generic enterprise server processors, not to mention its own mobile chips.

Whether Intel intends to go down this path or not remains to be seen. But a recent patent the company filed regarding mixing asymmetric x86 cores in a processor suggests the chipmaker has indeed given serious thought to big/little products. And since both AMD and NVIDIA are pursing their own heterogeneous SoCs, which by the way could also incorporated this technology, Intel is not likely cede any advantage to its competitors.

The big/little approach won’t be a panacea for energy-efficient computing, but it looks like one of the most promising approaches, at least at the level of the CPU. The fact that it incorporates the advantages of a heterogeneous architecture, but with a simpler model, has much to recommend it. And while big/little CPUs may be seen as somewhat of a threat to GPU computing, it can also be viewed as a complementary technology. What is certain is that the days of one-size-fits-all architectures are coming to a close.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “pre-exascale” award), parsed out additional information ab Read more…

By Tiffany Trader

Tsinghua Crowned Eight-Time Student Cluster Champions at ISC

June 22, 2017

Always a hard-fought competition, the Student Cluster Competition awards were announced Wednesday, June 21, at the ISC High Performance Conference 2017. Amid whoops and hollers from the crowd, Thomas Sterling presented t Read more…

By Kim McMahon

GPUs, Power9, Figure Prominently in IBM’s Bet on Weather Forecasting

June 22, 2017

IBM jumped into the weather forecasting business roughly a year and a half ago by purchasing The Weather Company. This week at ISC 2017, Big Blue rolled out plans to push deeper into climate science and develop more gran Read more…

By John Russell

Intersect 360 at ISC: HPC Industry at $44B by 2021

June 22, 2017

The care, feeding and sustained growth of the HPC industry increasingly is in the hands of the commercial market sector – in particular, it’s the hyperscale companies and their embrace of AI and deep learning – tha Read more…

By Doug Black

HPE Extreme Performance Solutions

Creating a Roadmap for HPC Innovation at ISC 2017

In an era where technological advancements are driving innovation to every sector, and powering major economic and scientific breakthroughs, high performance computing (HPC) is crucial to tackle the challenges of today and tomorrow. Read more…

At ISC – Goh on Go: Humans Can’t Scale, the Data-Centric Learning Machine Can

June 22, 2017

I've seen the future this week at ISC, it’s on display in prototype or Powerpoint form, and it’s going to dumbfound you. The future is an AI neural network designed to emulate and compete with the human brain. In thi Read more…

By Doug Black

Cray Brings AI and HPC Together on Flagship Supers

June 20, 2017

Cray took one more step toward the convergence of big data and high performance computing (HPC) today when it announced that it’s adding a full suite of big data and artificial intelligence software to its top-of-the-l Read more…

By Alex Woodie

AMD Charges Back into the Datacenter and HPC Workflows with EPYC Processor

June 20, 2017

AMD is charging back into the enterprise datacenter and select HPC workflows with its new EPYC 7000 processor line, code-named Naples, announced today at a “global” launch event in Austin TX. In many ways it was a fu Read more…

By John Russell

Hyperion: Deep Learning, AI Helping Drive Healthy HPC Industry Growth

June 20, 2017

To be at the ISC conference in Frankfurt this week is to experience deep immersion in deep learning. Users want to learn about it, vendors want to talk about it, analysts and journalists want to report on it. Deep learni Read more…

By Doug Black

How ‘Knights Mill’ Gets Its Deep Learning Flops

June 22, 2017

Intel, the subject of much speculation regarding the delayed, rewritten or potentially canceled “Aurora” contract (the Argonne Lab part of the CORAL “ Read more…

By Tiffany Trader

Tsinghua Crowned Eight-Time Student Cluster Champions at ISC

June 22, 2017

Always a hard-fought competition, the Student Cluster Competition awards were announced Wednesday, June 21, at the ISC High Performance Conference 2017. Amid wh Read more…

By Kim McMahon

GPUs, Power9, Figure Prominently in IBM’s Bet on Weather Forecasting

June 22, 2017

IBM jumped into the weather forecasting business roughly a year and a half ago by purchasing The Weather Company. This week at ISC 2017, Big Blue rolled out pla Read more…

By John Russell

Intersect 360 at ISC: HPC Industry at $44B by 2021

June 22, 2017

The care, feeding and sustained growth of the HPC industry increasingly is in the hands of the commercial market sector – in particular, it’s the hyperscale Read more…

By Doug Black

At ISC – Goh on Go: Humans Can’t Scale, the Data-Centric Learning Machine Can

June 22, 2017

I've seen the future this week at ISC, it’s on display in prototype or Powerpoint form, and it’s going to dumbfound you. The future is an AI neural network Read more…

By Doug Black

Cray Brings AI and HPC Together on Flagship Supers

June 20, 2017

Cray took one more step toward the convergence of big data and high performance computing (HPC) today when it announced that it’s adding a full suite of big d Read more…

By Alex Woodie

AMD Charges Back into the Datacenter and HPC Workflows with EPYC Processor

June 20, 2017

AMD is charging back into the enterprise datacenter and select HPC workflows with its new EPYC 7000 processor line, code-named Naples, announced today at a “g Read more…

By John Russell

Hyperion: Deep Learning, AI Helping Drive Healthy HPC Industry Growth

June 20, 2017

To be at the ISC conference in Frankfurt this week is to experience deep immersion in deep learning. Users want to learn about it, vendors want to talk about it Read more…

By Doug Black

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Just how close real-wo Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the cam Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its a Read more…

By Tiffany Trader

Google Pulls Back the Covers on Its First Machine Learning Chip

April 6, 2017

This week Google released a report detailing the design and performance characteristics of the Tensor Processing Unit (TPU), its custom ASIC for the inference Read more…

By Tiffany Trader

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

Nvidia’s Mammoth Volta GPU Aims High for AI, HPC

May 10, 2017

At Nvidia's GPU Technology Conference (GTC17) in San Jose, Calif., this morning, CEO Jensen Huang announced the company's much-anticipated Volta architecture a Read more…

By Tiffany Trader

Facebook Open Sources Caffe2; Nvidia, Intel Rush to Optimize

April 18, 2017

From its F8 developer conference in San Jose, Calif., today, Facebook announced Caffe2, a new open-source, cross-platform framework for deep learning. Caffe2 is the successor to Caffe, the deep learning framework developed by Berkeley AI Research and community contributors. Read more…

By Tiffany Trader

Leading Solution Providers

MIT Mathematician Spins Up 220,000-Core Google Compute Cluster

April 21, 2017

On Thursday, Google announced that MIT math professor and computational number theorist Andrew V. Sutherland had set a record for the largest Google Compute Engine (GCE) job. Sutherland ran the massive mathematics workload on 220,000 GCE cores using preemptible virtual machine instances. Read more…

By Tiffany Trader

Google Debuts TPU v2 and will Add to Google Cloud

May 25, 2017

Not long after stirring attention in the deep learning/AI community by revealing the details of its Tensor Processing Unit (TPU), Google last week announced the Read more…

By John Russell

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

Groq This: New AI Chips to Give GPUs a Run for Deep Learning Money

April 24, 2017

CPUs and GPUs, move over. Thanks to recent revelations surrounding Google’s new Tensor Processing Unit (TPU), the computing world appears to be on the cusp of Read more…

By Alex Woodie

Russian Researchers Claim First Quantum-Safe Blockchain

May 25, 2017

The Russian Quantum Center today announced it has overcome the threat of quantum cryptography by creating the first quantum-safe blockchain, securing cryptocurrencies like Bitcoin, along with classified government communications and other sensitive digital transfers. Read more…

By Doug Black

DOE Supercomputer Achieves Record 45-Qubit Quantum Simulation

April 13, 2017

In order to simulate larger and larger quantum systems and usher in an age of “quantum supremacy,” researchers are stretching the limits of today’s most advanced supercomputers. Read more…

By Tiffany Trader

Messina Update: The US Path to Exascale in 16 Slides

April 26, 2017

Paul Messina, director of the U.S. Exascale Computing Project, provided a wide-ranging review of ECP’s evolving plans last week at the HPC User Forum. Read more…

By John Russell

Knights Landing Processor with Omni-Path Makes Cloud Debut

April 18, 2017

HPC cloud specialist Rescale is partnering with Intel and HPC resource provider R Systems to offer first-ever cloud access to Xeon Phi "Knights Landing" processors. The infrastructure is based on the 68-core Intel Knights Landing processor with integrated Omni-Path fabric (the 7250F Xeon Phi). Read more…

By Tiffany Trader

  • arrow
  • Click Here for More Headlines
  • arrow
Share This