I’m not averse to making predictions about the world of High Performance Computing (and Supercomputing, Cloud, etc.) in person at conferences, meetings, causal conversations, etc.; however, it turns out to be a while since I have stuck my neck out and widely published my predictions for the year ahead in HPC. Of course, such predictions tend to be evenly split between inspired foresight and misguided idiocy. At least some of the predictions will have readers spluttering coffee in indignation at how wrong I am. But, where would the fun in HPC be if we all played safe? So, here goes for the @hpcnotes predictions for HPC in 2018 …
After spending much of 2017 being called out for ambitiously high pricing of Skylake for HPC customers, and following that with the months of Xeon Phi confusion – eventually publicly admitting at SC17 that Knights Hill has been cancelled, still not clear about the future of Phi overall – Intel seems to have continued into 2018 in the worst way, with news of kernel memory hardware bugs flooding the IT news and social media space. [NB: these bugs have now been confirmed to affect CPUs from AMD, ARM and other vendors too.] 2018 will also see widespread availability of AMD EPYC, Cavium ThunderX2, and IBM Power9 processors and so it seems Intel has a tough year ahead. The hardware bug is especially painful here as it negates the “Intel is the safe option” thinking. To be clear, HPC community consensus so far (including NAG’s impartial benchmarking work with customer codes) says Skylake is a very capable and performance leading processor. However, Skylake has three possible let downs: (1) price substantially higher, relative to the benefits gained, than customers are comfortable with; (2) reduced cache per core compared with other CPUs; (3) dependence on a code’s saturation of the vector units to extract the maximum performance. In some early benchmarks, EPYC and TX2 are winning on both price and performance. My prediction is that Intel will meaningfully drop the Skylake price early in 2018 to pull back into a competitive position on price/performance.
AI and ML
Sorry, the media and marketing hype for AI/ML taking over HPC shows no sign of going away. Yes, there are many real use cases for AI and ML (e.g., follow Paige Bailey and colleagues for real examples); however, the aggressive insertion of AI and ML labels into every HPC-related conference agenda (taking over from the mandatory mentions of Big Data) doesn’t add a lot of value, I think. I’m not suggesting that the HPC community (users or providers) ignore AI/ML – indeed, I would firmly advocate that you add these to your portfolio. But, HPC is an exceptionally powerful and widely applicable tool in its own right – it doesn’t need AI/ML to justify itself. My prediction is that AI/ML will continue to hog a share of the HPC marketing noise unrelated to the scale of actual use in the HPC arena.
As noted above, 2018 sees credible HPC processors from AMD (EPYC), Cavium (ThunderX2) and other ARM chips, and IBM (Power9) surge into general availability. In my view, these are not (yet) competing with Intel Xeon; they are competing with each other to be the best of the rest. Depending on how Intel behaves (NB: this is not just about technology) and how well AMD/ARM/IBM and their system partners actually execute on promises, one of these might close out 2018 being a serious competitor to Intel’s dominance of the HPC processor space. Either way, I predict we will see at least one meaningful (i.e., competitively won, large scale, for production use) HPC deployment of each of these processors in 2018. I’m also going to add a second prediction to this section: a MIPS based processor option will start to gain headlines as a real HPC processor candidate in 2018 (not just in China).
In most cases, HPC is still cheaper and more capable through traditional in-house systems than via cloud deployments. No amount of marketing changes that. Time might change it, but not by the end of 2018. However, cloud as an option for HPC is not going away. It does present a real option for many HPC workloads, and not just trivial workloads. I am hopeful we are at the end of the era where the cloud providers hoped to succeed by trying to convince everyone that “HPC in-house” advocates were just dinosaurs. The cloud companies all show signs of adjusting their offerings to the actual needs of HPC users (technical, commercial and political needs). This means that an impartial understanding of the pros and cons of cloud for your specific HPC situation is going to be even more critical in 2018. I am certainly being asked to help address the question of HPC in the cloud by my consulting customers with increasing frequency. Azure has been ramping up efforts in HPC (and AI) aggressively over the last few months through acquisitions (e.g., Cycle Computing) and recruitments (e.g., Developer Advocate teams), and I’d expect AWS and Google to do likewise. My prediction is that all three of the major cloud providers (AWS, Azure, Google) will deliver substantially more HPC-relevant solutions in 2018, and at least one will secure a major (and possibly surprising) real HPC customer win.
Nvidia also got an unwelcome start to 2018 as they tried to ban (via retrospective changes to license conditions) the use of their cheaper GPUs in datacenter (e.g., HPC, AI, …) applications. Of course, it is no surprise that Nvidia would prefer customers to buy the much more expensive high-end GPUs for datacenter applications. However, it doesn’t say much for the supposedly compelling business case or sales success of the high-end GPUs if they have to force people off the cheaper products first. We (NAG) have done enough benchmarking across enough different customer codes to know that GPUs are flat-out the fastest widely available processor option for codes that can take effective advantage of highly parallel architectures. However, when price of the high-end GPUs is taken into account, plus the performance left on the floor for the non-accelerated codes, then the CPUs often look a better overall choice. Ultimately, adapting many codes to use GPUs (not just a selected few codes to show easy wins) is a big effort. So is adapting workflows to the cloud. With limited resources available, I think users will decide that investing effort in cloud porting is a better long-term return than GPUs. Yes – oddly, I think cloud, not CPUs, will be the pressure that limits the success of GPUs! My prediction is that Nvidia’s unfortunate licensing assertions, coupled with marginal gains in performance relative to total cost of ownership (TCO), plus scarcity of software engineering resources, is that fewer newly deployed on-site HPC systems will be based around GPUs. On the other hand, I think use of GPUs in the cloud, for HPC, will grow substantially in 2018.
Yes, really. After all, exascale is within grasping distance now. We will see multiple systems at >0.1 EF in 2018. Exascale is being talked about in terms of when and which site first, rather than how and which country first. As exascale now seems likely to happen without all those disruptive changes that voices across the community foretold would be critical, computer science researchers and supercomputer center managers will need to start using the zettascale label to drive the next round of funding bids for novel technologies. There have already been a few small gatherings on zettascale, at least as far back as 2004 (!), but I predict 2018 will see the first mainstream meeting with a session focused on zettascale – perhaps at SC18?
The consumer world was wracked in 2017 by a range of large scale cybersecurity breaches. The government community has been hit badly in previous years too. Sadly, I see cybersecurity moving up the agenda in the HPC world. Not sad that it is happening, but sad that I think it will be forced to happen by one or more incidents. In general, HPC systems are fairly well protected, largely because they are expensive, capable assets and, in some cases, have regulatory criteria to meet. However, performance and ease-of-use for a predominantly research-led userbase have been the traditional strong drivers of requirements, often meaning the risk management decisions have been tilted towards a minimally compliant security configuration. (Security is arguably one area where HPC-in-the-cloud wins.) My prediction for 2018 is twofold: (1) there will be a major security incident on a high profile HPC system; (2) cybersecurity for HPC will move from a niche topic to a mainstream agenda item for some of the larger HPC conferences.
I saw HPC and related things such as AI, cloud, etc., gain lots of momentum in 2017. This included several technologies heralded in confidence finally coming to fruition, new HPC deployments across public and private sectors customers, a notable uptick in our HPC consulting work, interesting personnel moves, and an overall excitement and enthusiasm in the HPC community that had been dulled recently. My final prediction is that 2018 will see this growth and energy in the HPC community gather pace. I look forward to new HPC sites emerging, to significant new HPC systems being announced, and to the growing attention on the broader aspects of HPC beyond FLOPS – people, business aspects, impact stories, and more.
I hope you enjoyed my HPC predictions for 2018. Please do engage with me via Twitter (@hpcnotes) or LinkedIn (www.linkedin.com/in/andrewjones) if you want to comment on my inspired foresight or misguided idiocy. I’ll be back with a follow-up article in a week or two on how you can exploit these predictions to your advantage.