Heading into SC16 CENATE Flexes its Growing Muscle

By John Russell

November 8, 2016

In September, the Center for Advanced Technology Evaluation (CENATE) at Pacific Northwest National Laboratory (PNNL) took possession of NVIDIA’s DGX-1 GPU-based (Pascal 100) supercomputer. More on what they are doing with it later. Soon, IBM will deliver its Con Tutto memory technology. Data Vortex’s advanced switch technology is already in-house, along with products (and ideas) from a handful of other technology heavyweights. Now entering its second year, CENATE already has some potent equipment and ambitious ideas.

“We have established the lofty goal for us to even design some neuromorphic technologies that are doing machine learning natively and not as you can do machine learning for example on a GPU in which you sort of come in from behind and map machine workload to the architecture of the GPU,” says Adolfy Hoisie, PNNL’s chief scientist for computing and CENATE’s principal investigator and director. All things (time and money) being available, “We would like those neuromorphic systems chips, whatever, to actually cast them in silicon.”

This is perhaps getting ahead of the story. Launched last fall and funded by the Department of Energy (DOE) Office of Advanced Scientific Computing Research (ASCR), CENATE is envisioned as a proving ground for advanced technology that is making its way into market – the DGX-1 and Data Vortex platforms are good examples – and as a lens for keeping a selective eye on technologies further out. A core goal is to assess these technologies for DOE workloads and to influence emerging architecture including those headed into leadership class systems.

Adolfy Hoisie, CENATE’s director and PNNL’s chief scientist for computing
Adolfy Hoisie, CENATE’s director and PNNL’s chief scientist for computing

In CENATE taxonomy, the center has what Hoisie labels as “four thrusts:”

  • Enablement of Tests Beds. “Test beds don’t have a uniform definitions of what they are because in CENATE even the notion of a technology pipeline, technology maturity pipeline [varies]. We are going to tackle technologies that range from very early concepts or blueprints all the way to pre-productions machines.”
  • Extensive Instrumentation and Measurement. “For a national laboratory, we have unique, instrumentation and measurement capability. This applies not only to measuring time, which is performance, but also to measurement of power and thermal effects, and we are contemplating venturing into reliability as well. At this time our lab has the capability to measure performance, power and thermal at very high resolutions and very high frequency for dynamic measurements. We measure both static and dynamic these technologies.”
  • Technology Evaluation. “Not all these technologies apply equally well to all the application workloads that ASCR may be interested in. What evaluation means is to determine what applications map well onto various architectures that are being contemplated and studied within CENATE then go the extra step in trying to guide future development of those activities within the applications.”
  • Modeling and Simulation. “[This] is a big forte of ours. It is a very exciting capability because if we had machines from vendor X from vendor Y and we have the capability to measure things, to validate the initial models, our modeling and simulation [expertise] allows us to ‘place’ this system into the future not just doing simple statistical extrapolations but really guiding the architecture to maximize the positive impact on the applications.”

CENATE_graphics

Work on any technology may cross all CENATE domains and evaluation of NVIDIA’s DGX-1, CENATE’s latest addition, is a good example.

“Firstly we are interested in ways to accelerate computation. We are interested in PASCAL as the next generation of GPU. We’ll see what we learn through measurements on Pascal, running benchmarks, and pushing it forward with what-if scenarios, [looking for] when Pascal goes from where it is right now to double precision, to increased memory, to possibly modification in the SIMD characteristics and so forth,” says Hoisie.

Part of the interest, not surprisingly, stems from DOE’s forthcoming Summit supercomputer, to be based at Oak Ridge Leadership Computing Facility. Summit will be based on IBM Power technology and NVIDIA GPUS. Ian Buck, vice president of accelerated computing, NVIDIA, says the DGX-1, like Summit, is based on a strong node architecture versus weak node – where you have “a single node with maybe one or two cores, some memory, and the you replicated that en masse, hundreds of thousands, and you relied on the network for kind of MPI scalability to achieve performance at scale.”

The large infrastructure requirements (cabling, power, networks) inherently limited the weak node approach and complicated programming, he argues. “The vision we have been pursuing for exascale and in general for supercomputing is building these strong nodes systems like DGX-1 where we put a lot of horsepower into a single node and minimize the number of nodes you have to scale up to,” says Ian Buck, vice president of accelerated computing, NVIDIA.

As on the OLCF website: “Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 3,400 nodes when it arrives in 2017. Like Titan, Summit will have a hybrid architecture, and each node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink. Each node will have over half a terabyte of coherent memory (high bandwidth memory + DDR4) addressable by all CPUs and GPUs plus 800GB of non-volatile RAM that can be used as a burst buffer or as extended memory. To provide a high rate of I/O throughput, the nodes will be connected in a non-blocking fat-tree using a dual-rail Mellanox EDR InfiniBand interconnect.”

NVIDIA Tesla P100 frontThe DGX-1 is a mini-version of sorts, with eight P100 and NVlink interconnect. Buck says NVIDIA is looking to CENATE for “insight into scalability with a strong node architecture and to help us define where the bottlenecks are in power and these workloads so we can better optimize.” There’s no shortage of questions. “Maybe we should be focusing more on 32-bit floating and not 64-bit floating point for some of these workloads. Also programmability is a challenge. Hoisie and PNNL are users of PGI and OpenACC and may have ideas how can we improve programmability and compiler ability.”

Power consumption, of course, is major worry in the race to exascale and an area where CENATE and its unique measurement abilities can help address. “We are in the single gigaflop per watt era now with these strong nodes supercomputers,” says Buck. “We should get into the double digit category relatively soon and the goals for exascale we’ve got to get upwards of 25 Gflops per watt,” says Buck who is interested to see what new ideas CENATE might offer.

Besides learning more about overall acceleration, Hoisie says understanding and assessing the DGX-1’s machine learning capability is an important objective.

“We have developed here, not part of CENATE but part of PNNL, very important scalable algorithms for the machine learning. Those libraries exist today and they are open source and we want to assess frankly them on the DGX-1. NVIDIA represents this as a machine learning box, and we believe that there is some truth in that, but you know as researchers in CENATE we are going to ask the question, ‘OK let’s quantify that. What does it mean for a box to be a machine learning box?,’ he says.

“Part of that is looking at how it compares to others, how does DGX-1 compare to Knights Landing [systems] for example. Again these DGX-1 and KNL systems are offerings you can go and buy now, or they are very close to being in that stage (the sweet spot for CENATE). So we are going to look not only at where DXX-1 may go in the future, where Pascal may go in the future, and what would that do to our workloads, but also how machine learning performs today on the DGX-1 and how it compares with other top-of-the-line systems,” Hoisie says. “We are also looking at new algorithms for machine learning.”

Hoisie expects CENATE’s modeling and simulation capabilities will be valuable in this portion of the DGX-1 work. “If we measure something on a current machine and we validate and calibrate a model, we get enough confidence to predict what is the performance of a different algorithm is, sometimes drastically different sometimes only marginally different.”

Yet a third project with the DGX-1 is work by distinguished PNNL researcher Ruby Leung and her team with portions of their climate modeling code. In particular, says Hoisie, they are look at portion of code already running on GPUs and are benchmarking the code’s performance on DGX-1: “We are going to be able to say the DGX-1 is this much faster or it is not or whatever the case may be and look at what needs to be done to improve the performance as the specs of the machine allows.” Leung is the Chief Scientist of Department of Energy Accelerated Climate Modeling for Energy (ACME).

There’s obviously lots going on inside CENATE. One important mission element is diffusing the knowledge it gleans into the broader HPC community. Given all of the IP involved and the center’s capacity to identify shortcomings as well as strengths, sharing information is tricky.

“We would like the researchers from all of the national laboratories and from academia to have the opportunity to access these resources and we are very much committed to that; however it’s not a simple exercise. Vendors are legitimately concerned about crosspollination of ideas. We at the national laboratory are equipped to deal with that but academia is less so,” says Hoisie. As a general rule all of the national labs have NDAs “with all the vendors that are in the HPC orbit.”

For good or ill, Hoisie says, “If we saw something wrong, we wouldn’t go and publish a paper on that attempting to talk down that product; instead we would point it out for the benefit of the vendor and for ourselves the ways in which that particular architecture or architectural issue can be improved.”

Part of effort to share learnings will occur at regular CENATE workshops. “We are planning to have the first CENATE symposium in early spring next year. We want it to have enough information [available] so we can discuss ideas with the users, [we] want to have systems all set up, machines on the floor that people can access, both locally and remotely, and then we are going to organize that. It is something very important for us to do.” A frequency of the meetings hasn’t set been determined.

Clearly CENATE does have lofty goals and its list of projects is growing. That said, Hoisie is quick to emphasize CENATE has “no interest in hording technology” wanting instead to use the resources it has in a focused way the produces more than just incrementally advanced. On early stage CENATE project is development of a scalability test capability that combines optical and electronic technology for the network.

“Imagine that you have, for example, an InfiniBand network that connects a cluster of nodes – these are yet to be determined – and the network is also comprised of optical technology that allows you to basically re-cable the machine in seconds or less without literally having to move any cable. This allows us to create enclaves within the system that isolates jobs from the rest of the activity on the system, which is a problem that plagues say many of these very large scale machine in which the nodes are dedicated to a job but not a network path.

The design of this system for this project is far along, says Hoisie, and the project is now in the procurement process. “It’s generated a lot of interest within the vendor community – Penguin Computing, Cray, DDN, Mellanox, and optical switch vendor Calient. These vendors are all so interested in this concept and its potential for future commercial uses that in some cases, for example Mellanox, is donating the entire Infiniband gear for it.” That’s an indication of CENATE’s growing success,” says Hoisie.

As has been the case in the past few years, the national labs do not have individual SC booths, but are represented in the DOE both. CENATE will have a presence there and Hoisie expects several participating vendors, perhaps NVIDIA for example, to also have CENATE materials or demos at their booths.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

GTC 2019: Chief Scientist Bill Dally Provides Glimpse into Nvidia Research Engine

March 22, 2019

Amid the frenzy of GTC this week – Nvidia’s annual conference showcasing all things GPU (and now AI) – William Dally, chief scientist and SVP of research, provided a brief but insightful portrait of Nvidia’s rese Read more…

By John Russell

ORNL Helps Identify Challenges of Extremely Heterogeneous Architectures

March 21, 2019

Exponential growth in classical computing over the last two decades has produced hardware and software that support lightning-fast processing speeds, but advancements are topping out as computing architectures reach thei Read more…

By Laurie Varma

Interview with 2019 Person to Watch Jim Keller

March 21, 2019

On the heels of Intel's reaffirmation that it will deliver the first U.S. exascale computer in 2021, which will feature the company's new Intel Xe architecture, we bring you our interview with our 2019 Person to Watch Jim Keller, head of the Silicon Engineering Group at Intel. Read more…

By HPCwire Editorial Team

HPE Extreme Performance Solutions

HPE and Intel® Omni-Path Architecture: How to Power a Cloud

Learn how HPE and Intel® Omni-Path Architecture provide critical infrastructure for leading Nordic HPC provider’s HPCFLOW cloud service.

powercloud_blog.jpgFor decades, HPE has been at the forefront of high-performance computing, and we’ve powered some of the fastest and most robust supercomputers in the world. Read more…

IBM Accelerated Insights

Insurance: Where’s the Risk?

Insurers are facing extreme competitive challenges in their core businesses. Property and Casualty (P&C) and Life and Health (L&H) firms alike are highly impacted by the ongoing globalization, increasing regulation, and digital transformation of their client bases. Read more…

What’s New in HPC Research: TensorFlow, Buddy Compression, Intel Optane & More

March 20, 2019

In this bimonthly feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

By Oliver Peckham

GTC 2019: Chief Scientist Bill Dally Provides Glimpse into Nvidia Research Engine

March 22, 2019

Amid the frenzy of GTC this week – Nvidia’s annual conference showcasing all things GPU (and now AI) – William Dally, chief scientist and SVP of research, Read more…

By John Russell

At GTC: Nvidia Expands Scope of Its AI and Datacenter Ecosystem

March 19, 2019

In the high-stakes race to provide the AI life-cycle solution of choice, three of the biggest horses in the field are IBM, Intel and Nvidia. While the latter is only a fraction of the size of its two bigger rivals, and has been in business for only a fraction of the time, Nvidia continues to impress with an expanding array of new GPU-based hardware, software, robotics, partnerships and... Read more…

By Doug Black

Nvidia Debuts Clara AI Toolkit with Pre-Trained Models for Radiology Use

March 19, 2019

AI’s push into healthcare got a boost yesterday with Nvidia’s release of the Clara Deploy AI toolkit which includes 13 pre-trained models for use in radiolo Read more…

By John Russell

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

Oil and Gas Supercloud Clears Out Remaining Knights Landing Inventory: All 38,000 Wafers

March 13, 2019

The McCloud HPC service being built by Australia’s DownUnder GeoSolutions (DUG) outside Houston is set to become the largest oil and gas cloud in the world th Read more…

By Tiffany Trader

Quick Take: Trump’s 2020 Budget Spares DoE-funded HPC but Slams NSF and NIH

March 12, 2019

U.S. President Donald Trump’s 2020 budget request, released yesterday, proposes deep cuts in many science programs but seems to spare HPC funding by the Depar Read more…

By John Russell

Nvidia Wins Mellanox Stakes for $6.9 Billion

March 11, 2019

The long-rumored acquisition of Mellanox came to fruition this morning with GPU chipmaker Nvidia’s announcement that it has purchased the high-performance net Read more…

By Doug Black

Quantum Computing Will Never Work

November 27, 2018

Amid the gush of money and enthusiastic predictions being thrown at quantum computing comes a proposed cold shower in the form of an essay by physicist Mikhail Read more…

By John Russell

The Case Against ‘The Case Against Quantum Computing’

January 9, 2019

It’s not easy to be a physicist. Richard Feynman (basically the Jimi Hendrix of physicists) once said: “The first principle is that you must not fool yourse Read more…

By Ben Criger

Why Nvidia Bought Mellanox: ‘Future Datacenters Will Be…Like High Performance Computers’

March 14, 2019

“Future datacenters of all kinds will be built like high performance computers,” said Nvidia CEO Jensen Huang during a phone briefing on Monday after Nvidia revealed scooping up the high performance networking company Mellanox for $6.9 billion. Read more…

By Tiffany Trader

ClusterVision in Bankruptcy, Fate Uncertain

February 13, 2019

ClusterVision, European HPC specialists that have built and installed over 20 Top500-ranked systems in their nearly 17-year history, appear to be in the midst o Read more…

By Tiffany Trader

Intel Reportedly in $6B Bid for Mellanox

January 30, 2019

The latest rumors and reports around an acquisition of Mellanox focus on Intel, which has reportedly offered a $6 billion bid for the high performance interconn Read more…

By Doug Black

Looking for Light Reading? NSF-backed ‘Comic Books’ Tackle Quantum Computing

January 28, 2019

Still baffled by quantum computing? How about turning to comic books (graphic novels for the well-read among you) for some clarity and a little humor on QC. The Read more…

By John Russell

It’s Official: Aurora on Track to Be First US Exascale Computer in 2021

March 18, 2019

The U.S. Department of Energy along with Intel and Cray confirmed today that an Intel/Cray supercomputer, "Aurora," capable of sustained performance of one exaf Read more…

By Tiffany Trader

Contract Signed for New Finnish Supercomputer

December 13, 2018

After the official contract signing yesterday, configuration details were made public for the new BullSequana system that the Finnish IT Center for Science (CSC Read more…

By Tiffany Trader

Leading Solution Providers

SC 18 Virtual Booth Video Tour

Advania @ SC18 AMD @ SC18
ASRock Rack @ SC18
DDN Storage @ SC18
HPE @ SC18
IBM @ SC18
Lenovo @ SC18 Mellanox Technologies @ SC18
NVIDIA @ SC18
One Stop Systems @ SC18
Oracle @ SC18 Panasas @ SC18
Supermicro @ SC18 SUSE @ SC18 TYAN @ SC18
Verne Global @ SC18

Deep500: ETH Researchers Introduce New Deep Learning Benchmark for HPC

February 5, 2019

ETH researchers have developed a new deep learning benchmarking environment – Deep500 – they say is “the first distributed and reproducible benchmarking s Read more…

By John Russell

IBM Quantum Update: Q System One Launch, New Collaborators, and QC Center Plans

January 10, 2019

IBM made three significant quantum computing announcements at CES this week. One was introduction of IBM Q System One; it’s really the integration of IBM’s Read more…

By John Russell

IBM Bets $2B Seeking 1000X AI Hardware Performance Boost

February 7, 2019

For now, AI systems are mostly machine learning-based and “narrow” – powerful as they are by today's standards, they're limited to performing a few, narro Read more…

By Doug Black

The Deep500 – Researchers Tackle an HPC Benchmark for Deep Learning

January 7, 2019

How do you know if an HPC system, particularly a larger-scale system, is well-suited for deep learning workloads? Today, that’s not an easy question to answer Read more…

By John Russell

HPC Reflections and (Mostly Hopeful) Predictions

December 19, 2018

So much ‘spaghetti’ gets tossed on walls by the technology community (vendors and researchers) to see what sticks that it is often difficult to peer through Read more…

By John Russell

Arm Unveils Neoverse N1 Platform with up to 128-Cores

February 20, 2019

Following on its Neoverse roadmap announcement last October, Arm today revealed its next-gen Neoverse microarchitecture with compute and throughput-optimized si Read more…

By Tiffany Trader

France to Deploy AI-Focused Supercomputer: Jean Zay

January 22, 2019

HPE announced today that it won the contract to build a supercomputer that will drive France’s AI and HPC efforts. The computer will be part of GENCI, the Fre Read more…

By Tiffany Trader

Move Over Lustre & Spectrum Scale – Here Comes BeeGFS?

November 26, 2018

Is BeeGFS – the parallel file system with European roots – on a path to compete with Lustre and Spectrum Scale worldwide in HPC environments? Frank Herold Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This