What does 2020 look like to you? What did 2019 look like?
Lots happened but the main trends were carryovers from 2018 – AI messaging again blanketed everything; the roll-out of new big machines and exascale announcements continued; processor diversity and system disaggregation kicked up a notch; hyperscalers continued flexing their muscles (think AWS and its Graviton2 processor); and the U.S. and China continued their awkward trade war.
That’s hardly all. Quantum computing, though nicely-funded, remained a mystery to most. Intel got (back) into the GPU (Xe) business, AMD stayed surefooted (launched next-gen Epyc CPU on 7nm), and no one is quite sure what IBM’s next HPC move will be. TACC stood up Frontera (Dell), now the fastest academic computer in the world.
You get the idea. It was an eventful year…again.
For many participants 2019 presented a risky landscape in which the new normal (see trends above) started exacting tolls and delivering rewards. Amid the many swirling tornadoes of good and ill fortune dotting the HPC landscape there was a big winner and (perhaps) an unexpected loser. Presented here are a few highlights from the year, some thoughts about what 2019 portends, and links to a smattering of HPCwire articles as a quick 2019 rewind. Apologies for the many important trends/events omitted (surging Arm, Nvidia’s purchase of Mellanox, etc.)
- It was Good to Be Cray This Year
Cray had a phenomenal transformational year.
Dogged for years by questions about whether a stand-alone supercomputer company could survive in such a narrow, boom-bust market, Cray is thriving and no longer standing alone. Thank you HPE et al. Many questions remain following HPE’s $1.3B purchase of Cray in May but no matter how you look at, 2019 was Cray’s year.
As they say, to the victor goes the spoils:
- Exascale Trifecta. Cray swept the exascale sweepstakes winning all three procurements (Aurora, with Intel at Argonne National Laboratory; Frontier with AMD at Oak Ridge National Laboratory; and El Capitan at Lawrence Livermore National Laboratory).
- Shasta & Slingshot. Successful roll-out of Cray’s new system architecture first announced in late 2018. This was the big bet upon which all else, or almost all else, rests. The company declared its product portfolio refresh complete and exascale era ready in October.
- AMD and Arm. Cray seems to be a full participant in the burgeoning of processor diversity. Case-in-point: Its new collaboration with Fujitsu to develop a commercial supercomputer powered by the Fujitsu A64FX Arm-based processor, the same chip going into the post-K “Fugaku” supercomputer. It also has significant experience using AMD processors.
- Welcome to HPE. Fresh from gobbling SGI ($275M, ’16), HPE should be a good home for Cray, which will boost HPE’s ability to pursue high-end procurements and potentially speed the combined company’s development of next-generation technologies. HPE CEO Antonio Neri sizes the supercomputing/exascale sector at between $2.5 to $5 billion, while the sub-supercomputing HPC sector at $8.5 billion. Cray’s HPC storage business is another plus.
Cray’s heritage, of course, stretches back to 1972 when it was founded by Seymour Cray. The company has a leadership position in the top 100 supercomputer installations around the globe and is one of only a handful of companies capable of building these world-class supercomputers. Headquartered in Seattle, Cray has roughly1,300 employees (at time of purchase) and reported revenue of $456 million in its most recent fiscal year, up 16 percent year over year.
It seems a reasonable guess that Cray’s good fortune was more than just chance. Given the HPE investment (price was 3X revenues, 20X earnings), DoE’s exascale procurement investments, and Cray’s stature in US supercomputing amid global tensions, it’s likely many forces, mutually-aware, helped coax the deal forward. In any case, it’s good for Cray and for HPC.
HPE CEO Antonio Neri has said Cray will continue as an entity and brand with HPE. Pete Ungaro, Cray’s former CEO now becomes SVP & GM, for HPC and AI at HPE. Lots of eyes will be on Neri and Ungaro as HPE moves forward. Will there be a senior leadership shakeout or can Nero get his talented senior team to work together in ways that make sense? Absorbing SGI seemed to go well although the brand seemed to vanish once inside.
At HPCwire we have been wondering about what the new HPE strategy will be, what will become the broader HPE technology and product roadmap, etc.? Stay tuned.
CRAY HIGH POINTS
- Is the Top500 Finally Topping Out?
My pick for the biggest steam loser in 2019 may surprise. It’s the Top500 and maybe the occasional loss of steam is just part of the natural cycle. The November ’19 list was a bit of a yawn and perhaps not especially accurate. Summit (148 PF, Rmax) and Sierra (96 PF Rmax) and remained at the top. China’s Sunway (93 PF Rmax) and Tihanne-2A (61 PF Rmax) retained third and fourth. However there were reports of systems from China that ran the Linpack benchmark and submitted results that would have put them atop the list but later withdrew them in attempt avoid additional blacklisting by the US.
It almost doesn’t matter. Not the trade war – that matters. However handicapping world computer progress and leadership based on performance on the Top500 seems almost passé as a showcase for startlingly new technologies. It is still a rich list with lots to learn from it but whether LINPACK remains a good metric or whether the systems entered are really comparable or whether taking the top honors is worth the effort expended are tougher questions today. It would be interesting to get IBM’s candid take on the question given IBM’s success with Summit and Sierra on the Top500 but its less successful effort to turn smaller Summit look-alikes into broad commercial (system or processor) traction. Designing and standing up these giants isn’t trivial or cheap.
The list isn’t going away. We love lists. They have value. But the investment-reward ratio and now, potentially questionable bragging rights, undermine the Top500’s value as anointer of the top dog in supercomputing. In some ways, the secondary lists (Green500 and HPCG) are more interesting and crowding the spotlight. This is hardly a new gripe (mea culpa) but the critical mass of opinion may be shifting away from the value of the Top500. There was distinctly less buzz at SC this year around the latest list.
- AI in Science – the Next Exascale Initiative?
This year virtually all the major systems makers offered HPC-AI-aimed solutions – typically with one or two ‘supervisor’ CPUs and 4-8 accelerators. There are variations. Established chipmakers worked to beef up memory bandwidth, IO performance, and mixed precision capabilities. Multi-die packaging including the use of die with varying feature sizes in the same package started to take hold. Overall, these were continuations of existing trends with a more definite distinction emerging between training and inferencing platforms. One can see a menu of AI inference chips targeting specific applications coming.
On balance, the AI marketing drumbeat was even louder and more pervasive than last year.
More interesting were 1) efforts by the science community to start shaping a larger strategy to fuse AI with HPC and leveraging the synergy, and 2) the flurry of AI chips in various stages of readiness coming from start-ups (Graphcore, Cerebras, NovuMind, Wave Computing, Cambricon, etc). Many of these potential disrupters will find buyers. Intel, of course, just snapped up Habana Labs for $2B.
Let’s look at the new U.S. AI Initiative signed in September and ramping efforts to define a science strategy. The Department of Energy has held a series of AI for Science town halls led by Kathy Yelick (Lawrence Berkeley National Laboratory), Jeff Nichols (Oak Ridge National Laboratory), and Rick Stevens (Argonne National Laboratory) seeking input from a broad science constituency from academia, government, and industry. In theory, this could lead to a funded AI program modeled loosely on the current Exascale Initiative.
Their formal report was due by the end of 2019 but has been pushed slightly into January when we’ll get the first glimpse of the recommendations which are likely to encompass hardware, software, and application areas.
Stevens told HPCwire, “Clearly there’s huge progress in the internet space, but those Facebooks and Googles and Microsofts and Amazons and so on, those guys are not going to be the primary drivers for AI in areas like high-energy physics or nuclear energy or wind power or new materials for solar or for cancer research – it’s not their business focus. We recognize that the challenge is how to leverage the investments made by the private sector to build on those [advances] to add what’s missing for scientific applications — and there’s lots of things missing. And then figure out what the computing community has to do to position the infrastructure and our investments in software and algorithms and math and so on to bring the AI opportunity closer to where we currently are.”
Meanwhile test beds are sprouting up in various national labs and NDAs are being signed with most of the new AI chip crowd. ANL is a good example.
“We’re setting up an AI accelerator test bed [and] it’s going to be open to the whole community. The accelerator market is filled with companies, right. Our intent is to populate the test bed with many of these as we can get working hardware from. So this one’s [Cerebras chip] a slightly different situation. It’s not nominally aimed at the test bed, but it’s actually a working system for us to do our hardest AI problems on.”
After all the ‘AI’ groundwork done by hyperscalers and enterprise community it will be fascinating to watch what contributions the science community makes.
Is AI the new Exascale? We’ll see.
- IBM & Intel – Two Giants with Giant Challenges
That Intel and IBM are large impressive companies with formidable reach is a given. It’s a massive mistake to underestimate their strengths or to dismiss them. But stuff happens. Markets shift. Technologies plateau. Longtime rivals and upstart competitors all clamber for a share of the pie. Both Intel and IBM are in the midst of massive pivots that encompass their HPC and other activities.
First, Intel. For decades it has been king of the microprocessor market – leveraging design and manufacturing technology leadership to claim a mid-to-high 90s percent market share in CPUs along with an extensive portfolio of other semiconductor products. The decline in Moore’s Law, the rise of heterogeneous computing architectures, product and process missteps (e.g. KNL and OmniPath, both discontinued, Lustre stewardship now ended), along with reinvigorated rivals (principally AMD and recently Arm) have sent shock waves through the company.
You may recall Bob Swan was elevated from interim to permanent CEO last January. Speaking at the Credit Suisse technology conference this December, Swan said he wants change. This quote is from Wccftech:
“We think about having 30 percent share in a $230 [silicon] TAM that we think is going to grow to $300B [silicon] TAM over the next four years. And frankly, I’m trying to destroy the thinking about having 90 percent (CPU market) share inside our company because I think it limits our thinking, I think we miss technology transitions. We miss opportunities because we’re in some ways pre-occupied with protecting 90 instead of seeing a much bigger market with much more innovation going on, both Inside our four walls and outside our four walls.
“So we come to work in the morning with a 30 percent share, with every expectation over the next several years that we will play a larger and larger role in our customers’ success, and that doesn’t just (mean) CPUs. It means GPUs, it means Al, it does mean FPGAs, it means bringing these technologies together so we’re solving customers’ problems. So we’re looking at a company with roughly 30 percent share in a $288 silicon TAM, not CPU TAM but silicon TAM. We look at the investments we’ve been making over the last several years in these kind of key technology inflections: 5G, AI, autonomous, acquisitions, including Altera, that we think is more and more relevant both in the cloud but also AI the network and at the edge.”
There have been key executive changes. Rajeeb Hazra, corporate VP of Intel’s Data Center Group and GM for the Enterprise and Government Group, is retiring and his replacement has not yet been named. Gary Patton is leaving GlobalFoundaries where he was CTO and R&D SVP to become corporate Intel VP and GM of design enablement reporting to CTO Michael Mayberry.
Don’t count Intel out. It has enormous technical and financial capital. At SC19, Intel debuted its new XeGPU line with Ponte Vecchio as the top SKU aimed at HPC. There are plans for many variants aimed at different AI applications with the first parts expected to market this summer in a consumer application. Ponte Vecchio will be in Aurora. Intel’s Optane persistent memory product line is showing early traction. Of course, the Xeon CPU family still dominates the landscape – despite process hiccups; on the minus side AMD is now aiming for double digit market share and Arm is mounting a surge into the datacenter.
So the news on the product front is mixed.
That said, Intel is getting good marks for playing nicely with collaborators for its efforts on OpenHPC, the plug-and-play stack it championed that is now reasonably well established, and the more nascent Compute Express Link (CXL) CPU-to-device interconnect consortium. How oneAPI fares will be interesting to watch. Intel has a lot riding on it for its GPU line and presumably oneAPI will be how one ports aps to it.
Intel has its fingers in so many pies that foreseeing the company’s trajectory is no easy task.
INTEL 2019 HITS
IBM’s challenges are somewhat different. Its gigantic $34B purchase of Red Hat, which closed in July, seems to be working as IBM seeks to embrace all things cloud and many things open source and Linux. The new question, really, is how does HPC fit into IBM’s evolving worldview and strategic plan.
Big Blue made massive bets here in the past. The latest chapter roughly starts with its decision to get out of the x86 business by selling it to Lenovo (servers, ‘13, PCs in ‘05). It gambled on the IBM Power microprocessor line and leading the OpenPOWER Foundation (with Nvidia, Mellanox, Tyan, and Google.) The intent was to create an alternative to the x86 ecosystem.
Where Intel was playing a closed-architecture, one-size fits all game, argued IBM, it would take a more collaborative and open approach. No doubt Intel would dispute that characterization. At SC15 long-time executive IBM Ken King argued the Power/OpenPOWER upside was huge, citing an ambitious 20-30 percent market share target.
By SC16 many of the pieces were in place – OpenPOWER had 250-plus members, Power8+ with NVLink technology was out, work on OpenCAPI had started, IBM had launched the Minsky server with Power8, and Google/Rackspace announced plans for a Power9-based server supporting OCP. A confident King told HPCwire at SC16, “This year (2017) is about scale. We’re [IBM/OpenPOWER] only going to be effective if we get to 10-15-20 percent of the Linux socket market. Being at one or two percent won’t [do it].”
Skip a year and fast forward to ISC 2018 when Summit, the IBM-built supercomputer for the CORAL program, regained the top spot on the Top500 for the U.S. It was a stunning achievement. Summit has held the Top500 crown on the last four editions of the list and it is churning out impactful science.
Problem is, the market traction needed never adequately materialized. We won’t go into all the reasons but cost, effort to port aps, at least early on, and competitor pricing were all factors. Also the rise of accelerators (GPUs mostly) made CPUs look a bit mundane – not a good thing given the development costs needed to keep advancing the Power CPU line. Likewise AI, a persistent rumor but hardly a thunderous echo when IBM made its Power/OpenPOWER bet, burst onto the scene with unexpected force. All this while IBM’s cloud effort lagged its hopes and garnered attention in the C-suite.
This year at SC19 IBM introduced no new Power-based systems and provided no update on Power10 chip plans. The OpenPOWER Foundation has been moved to the Linux Foundation. IBM didn’t win any of the exascale awards; this shocked many observers (re: Summit’s success). Lastly, longtime IBM Dave Turek, now vice president of high performance and cognitive computing, described a startlingly new IBM HPC-AI strategy that sounded unlike typical IBM practice. It emphasizes selling small systems – as small as a single node, said Turek – to a much larger customer base. It’s based on using IBM AI software to analyze host infrastructure and application performance and to speed those efforts with minimum intrusion on the customer’s existing infrastructure.
This is from a Turek interview with HPCwire:
“We’re trying to do strategically is get away from this rip and replace phenomenon that’s characterized HPC since the beginning of time…So we take a solution. It’s a small Power cluster. It’s got the Bayesian software on it. You have an existing, I don’t know, a Haswell cluster, a few years old, running simulations and because your last dose of capital from your enterprise was five years ago, you can’t get another bunch of money. What do you do?
“What we would do is bring in our cluster, put it in the datacenter, and bring up a database and give access to that database from both sides. So [a] simulation runs, it puts output parameters into the database. I know something’s arrived, I go and inspect that, my machine learning algorithms analyze that, and it makes recommendations of the input parameters. So next simulation, rinse and repeat. And as you progress through this, the Bayesian machine learning solution gets smarter and smarter and smarter. And you get to get to your end game quicker and quicker and quicker.”
The results, he says, can be game-changing: “[As a customer] I put it in a four-node cluster adjacent to my 2,000-node cluster, and I make my 2,000-node cluster behave as if it was an 8,000-node cluster? How long do I have to think about this?” You get the picture.
Many wonder if more IBM changes aren’t ahead and what role HPC will have in Big Blue’s future.
IBM’S CHANGING PERSPECTIVE
- Quantum – The Haze is a Little Clearer but Solutions Not Nearer
What would technology be without a few kerfuffles? QC had a dandy around Google’s claim for achieving quantum supremacy to which IBM took vigorous exception and which spawned a tart response on the anonymous Twitter feed Quantum Bullshit Detector. Launched last spring QBD is generally on the lookout for what it detects as folly in QC. No doubt there’s lots to detect in QC. There’s an interesting overview of QBD written by Sophia Chen in Wired.
Here’s a quick reprise of Google’s quantum supremacy claim, published in Nature: Google reported it was able to perform a task (a particular random number generation) using its 53-qubit processor, Sycamore, in 200 seconds versus what would take on the order 10,000 years on a supercomputer. The authors “estimated the classical computational cost” of running the supremacy circuits with simulations on Summit and on a large Google cluster. IBM demurred and argued it discovered a way to do the same task in two days (still hardly 200 seconds) on Summit.
In a sense, who cares. Many argue quantum supremacy is a silly idea to start with. Maybe. Practically, the Google engineering work was important, even if it didn’t constitute achieving quantum supremacy. Let’s move on.
Last year the haze around quantum computing thickened rather than thinned from the previous year. Many feel the smoky scene is even worse now. But maybe not. Last year we were expecting clearer answers. This year we know better. Here’s why:
- Are we there yet? No. Actually, definitively no. There’s broad public agreement from virtually all the key players (IBM, Rigetti, Google, Microsoft, D-Wave) that practical applications lie years, many years, ahead. Leave aside the inevitable hype generated by government attention & funding (i.e. the $1.25B National Quantum Initiative Act signed a year ago). Jim Clarke, who leads Intel’s effort, says eight years is a good bet for reaching Quantum Advantage – the time QC is able to do something sufficiently better than classical computers to warrant switching. He may be optimistic.
- The problem – We love the mystery but don’t understand it. Quantum computing is inherently mysterious and therefore fascinating. But most of the public announcements, including growing qubit counts, don’t really mean much to most us and as importantly don’t mean much in terms catapulting QC forward. Even as very coarse progress milestones the litany of papers and new larger systems and collaborations doesn’t tell us much yet.
Just this week, IBM announced a three-year agreement with Japan, led by the University of Tokyo, to foster QC development. This is the third such IBM international agreement. Are they important? Yes. Will there be practical quantum computing at the end of the first three years of the last IBM-Japan effort? No. In September, D-Wave revealed the named of its forthcoming 5000-qubit quantum annealing system, Advantage, picked to emulate the idea of Quantum Advantage. Will it deliver?
I love this comment from John Martinis, head of Google’s comment on QC (semiconductor-based superconducting) tech’s challenge: “Breaking RSA is going to take, let’s say, 100 million physical qubits. And you know, right now we’re at what is it? 53. So, that’s going to take a few years.” Indeed, Google’s 54-qubit Sycamore chip actually functioned as a 53-qubit device during the supremacy exercise because one of the control wires broke.
What’s clearer today than last year, and more publicly agreed to by most of the QC community, is there won’t be some sudden breakthrough that makes quantum computing a practical tool soon. There’s lots of interesting, important work happening. Hyperscalers are getting more involved a la the AWS three-prong effort (portal for third party QC tech & services; hardware research collaboration; consulting effort to ID potential aps). Intel’s new cryo-controller chip is also interesting. Maybe it will become a component supplier to the QC world. S/W tools are edging forward. But…
Point is QC is far from ready – we should watch it, not with a jaded eye but a patient eye that screens out hype. There’s great QC technology being developed by very many organizations along what will be a long journey.
- Bits and Pieces from around the HPC Community
MLPerf, the AI benchmarking effort gaining traction, issued an inferencing suite to go with its training exercises; the first inferencing results came out in November with Nvidia claiming a good showing. Check them out. Has anyone heard more on Deep500 – after holding a well-attended session at SC18 to discuss formative ideas there was no sign of it at SC19. Deep500’s intent is sort of evident from the title, create an AI benchmarking tool and competition spanning small to very big systems.
Here’s a well-earned kudo. Robert Dennard won SIA’s Robert Noyce Award last year. He is, of course, the father of Dennard Scaling, which sadly has run its course, and perhaps more importantly, the DRAM. Here’s a link to a nice tribute. Dennard has had great impact on our industry.
As always, there were significant personnel changes – the departure of Rajeeb Hazra, longtime HPC exec at Intel is one. It will be interesting to see what he does next. Barry Bolding, a former CSO at Cray, joined AWS as director, global HPC. Someone HPCwire has leaned for insight about life sciences in HPC, Ari Berman, was promoted to CEO of BioTeam consulting – his team advised on the latest design of Biowulf, NIH’s now 20-year-old constantly evolving HPC system. Gina Tourassi was appointed as director of the National Center for Computational Sciences, ORNL. Congrats! Trump announced plans to nominate Sethuraman “Panch” Panchanathan to serve as the 15th director of the NSF
Quick look at market numbers. HPC server sales in the first half of 2019 year totaled $6.7 billion, while 2018 sales grew 15 percent overall according to Hyperion. That’s a strong picture of health. Maybe more impressive (or scary) is that 11 hyperscalers spent more than $1 billion apiece on IT infrastructure in 2018; three spent more than $5 billion and one, Google, broke the $10 billion spend barrier, according to Intersect360 Research. The concentration of buying power with the cloud community is astonishing.
HPE’s Spaceborne computer returned home after 615 days on the International Space Station. It was a 1 TFlops system built with OTS parts to see if they could withstand the radiation. It did in the sense that although error rates were higher than normal on the ground, they were manageable and the system was able to do real work. Nvidia didn’t launch any new monster-size GPUs but it gobbled up interconnect (InfiniBand and Ethernet) specialist Mellanox for $6.9 billion (deal hasn’t closed yet). Nvidia research chief, William Daly, talked about the company’s R&D strategy at GTC19 – perhaps not surprisingly productizing is the central tenet.
Here’s good closing note. Venerable Titan Supercomputer was retired on August 1st. Housed at OLCF, Titan ranked as one of the world’s top 10 fastest supercomputers from its debut as No. 1 in 2012 until June 2019. During that time, Titan delivered more than 26 billion core hours of computing time. When launched, it represented a new approach that combined 18,688 AMD 16-core Opteron CPUs with 18,688 Nvidia Kepler K20 GPUs. OLCF Program Director Buddy Bland recalls, “Choosing a GPU-accelerated system was considered a risky choice.” Job well-done.
Happy holidays to all.