With 2021 underway, we’re looking to the future of high performance computing and the milestones that are growing ever closer. Every year, HPCwire names its annual list of People to Watch to foster a dialogue about our industry and give our readers a personal look at the hard work, dedication, and contributions from some of the best and brightest minds in HPC. These research efforts, accomplishments and technologies are shaping our future, and these are the people who are making it happen.
We present the HPCwire People to Watch 2021:
Congratulations on being named an HPCwire Person to Watch for the second time! You recently took a new position at Oak Ridge National Lab as the Section Head for Advanced Computer Systems Research and founding director of the Experimental Computing Laboratory (ExCL). Section head is, I think, a new designation at ORNL and it looks like your responsibilities have increased. Perhaps you can describe your new role and the role of section heads more generally? What are the goals of this organizational change and what are your near-term goals in the new role?
Yes, as part of the Reimaging ORNL campaign, ORNL added the position of Section Head of Advanced Computing Systems Research in the Computer Science and Mathematics Division. I applied and interviewed for this position because of my excitement about the future of computing systems. The community is entering a period of disruption, and that atmosphere always makes for great research opportunities. Fortunately, I was selected for the position. My ACSR section has approximately 65 people in six groups: Beyond Moores, Architectures and Performance, Programming Systems, Intelligent Systems and Facilities, Software Engineering, and Applications Engineering. By the way, we are hiring!
As you know, heterogeneous computing architecture – combining CPUs and various accelerators – has emerged as the dominant direction for advancing supercomputers. The forthcoming exascale systems are good examples. How far can these extreme heterogeneous architectures take us? What will be the main drivers – advancing accelerators as is the case today, whether GPU or other specialized processing devices, or will memory or interconnect, optical for example, be more important? What will post-exascale machines look like?
These are the dominant research questions for our section. CMOS is an incredible technology so it will be with us for a long time, but the architectures will be workload specific. We believe that domain specific architectures using technologies such as chiplets, reconfigurable computing, and open-source hardware like RISC-V are what we’ll see over the next 10 years. That said, we are also working in quantum, neuromorphic, cryogenic, and other areas to understand their contributions.
Perhaps, more importantly, one of the major questions (and risks) is performance portability and user productivity on these very diverse architectures. Portable programming systems for this ‘extreme heterogeneity’ is in some ways the most critical question for HPC. If each system is different and more complex, and it takes heroic efforts to (re)program applications to use it effectively, then fewer applications will be able to benefit from HPC. I really believe that we need a significant, national effort to address these software challenges. Otherwise, applications and ultimately science will suffer.
What advice do have for young computational scientists starting out, thinking about directions taken in grad school and in their early careers? What are the must have basic skills and what are the most promising, but less proven areas that represent real opportunities for making contributions in science and advancing early careers?
Listing numerous hot technologies is futile because the technologies are changing so frequently. I expect that some of the hot technologies for 2026 probably do not even exist right now. Rather, I think that it is important to have the right mindset. I would say that this mindset includes being an agile learner, knowing how to work as a team with other experts, and taking initiative when there are a lot of unknowns. Just think where we were with machine learning five years ago. The initial release for Pytorch was in 2016 and now it is one of the hottest technologies in computer science!
Generally speaking, what trends and/or technologies in high-performance computing (and related fields, such as AI) do you see as particularly relevant for the next five years?
Ok. If I had to list a few hot technologies, it would be open hardware IP and toolchains (e.g., RISC-V, OpenRoad), graph neural networks, 5G/6G wireless, LLVM/MLIR, and sensors everywhere.
Outside the professional sphere, what activities, hobbies or travel destinations do you enjoy in your free time?
It has been a strange year, so like everyone else, our family has had to adapt. I enjoy traveling, photography, horticulture, and autocross.
Congratulations on being named a 2021 HPCwire Person to Watch, and congrats on joining Amazon! Tell us about your role at AWS, your areas of responsibility; what is most challenging and most rewarding?
Thank you for the honor.
We are a customer obsessed company, and in the spirit of that philosophy, my role is to listen deeply to our customers and make sure we build the best tools to enable them to exceed their goals. At the same time, we think we have a unique opportunity to cloudify HPC and redefine how the world does technical computing and HPC in a modern way.
Most challenging – This is an industry which moves fast and slow simultaneously. It is critical that we deeply internalize those dynamics and help our customers wherever they are in their journey.
Most rewarding – I have spent over 30 years trying to democratize HPC and technical computing. I believe AWS will make that a reality.
AWS was an early leading innovator in providing HPC cycles and services in the cloud. Now that HPC cloud is more established (and more competitive), what is AWS’s differentiation as an HPC provider?
We see lots of growth in traditional HPC industries and workloads, along with those infused with AI and with emerging, born-in-the-cloud organizations across a broad set of segments. Our focus is to expand our technology and service portfolio, breadth and depth of capacity, usability, and provide market-leading price performance through our own AWS-designed, Arm-based, Graviton2 processors.
How do you see the relationship between HPC and AI, both broadly and more specifically at AWS?
There is a lot of noise about the convergence of AI and HPC– but it seems to miss the point. It’s really about a growing set of capabilities and requirements as well as evolving methods and techniques. HPC is changing due to the influence of new methodologies—such as deep learning—which are accelerating and deepening insights in science, engineering, and analytics in every industry we serve, from Healthcare and Life Sciences and Agriculture to Engineering, EDA, and Investment Banking. AI is not a replacement for, but rather a complement to, traditional simulation and modeling and is enabling scientists, researchers, engineers, and analysts to more profoundly comprehend the world around them. It is helping them make better and more informed decisions and catalyzing new ideas and capabilities. Obviously, massive data coming off of devices and systems is a key element. But equally important is how HPC affects and influences AI. As models grow in scale and complexity, HPC infrastructure becomes increasingly important both to power the computational requirements and to also provide the analysis around provenance and transparency of trained models and their data.
Generally speaking, what trends and/or technologies in high-performance computing (and related fields, such as AI) do you see as particularly relevant for the next five years?
We are deep into a renaissance across just about everything we knew to be true in the past. Individually, these trends are important – but their collective impact cannot be overstated. And I would be remiss if I didn’t mention how COVID is one of the most important overarching trends we are experiencing. While its long-term impact is still unknown, what I can say with certainty is that the role HPC will play in everything from global public health and climate change to food security and cyber terrorism will increase significantly.
There seems to be no stopping the pace of innovation and diversification across the silicon and accelerator networks. It’s exciting. This is so much bigger than workload optimization (which btw is also an important trend), and more about a fundamental redefinition of the underlying computational infrastructure and systems design. The influence of cloud native technologies, massive scale, software defined and disaggregation are having a profound affect well beyond the Cloud platform suppliers, shaping customers and technology providers across all segments.
As I mentioned above, AI is a mega trend which will affect every industry in myriad ways, but I would also add other novel technologies such as neuromorphic and Quantum to that conversation. While QC is early, we are already seeing very interesting results, even with NISQ systems.
Equally important, is how and where we compute. The Cloud will have the most significant long term impact on HPC and technical computing segments. It changes the economics and scale, opening up limitless possibilities. Today most cloud usage is lift and shift. But, over time, through leveraging cloud native technologies, the opportunity to reimagine dramatically new ways to solve problems will catalyze all sorts of innovation. It is literally like a perfect storm. It’s fun. It’s why I joined AWS.
Outside the professional sphere, what activities, hobbies or travel destinations do you enjoy in your free time?
Family is everything. COVID has really put a fine point on that. We love the outdoors – hiking, skiing, and enjoying Martha’s Vineyard (one of our favorite places). And of course food – love to cook.
Alan Edelman PhD ’89 is an applied mathematics professor for the Department of Mathematics and Computer Science and AI Laboratories (CSAIL) and leads the MIT Julia Lab. His research includes high-performance computing, numerical computation, linear algebra, random matrix theory, and scientific machine learning. He is also chief scientist at Julia Computing. His computational thinking class has been widely viewed worldwide because of the unique way Julia combines computer science, mathematics, science, and engineering.
In 2004 he founded Interactive Supercomputing, and has consulted for Akamai, IBM, Pixar, and NKK Japan among other corporations. He teamed up with Jeff Bezanson PhD ’15, Stefan Karpinski, and Viral B. Shah to create the free and open-source Julia programming language, which solved the long-standing two-language problem in computing. Julia is as easy as Python and as fast as C. Julia is used at MIT, Stanford, BlackRock, US Federal Reserve, Federal Aviation Administration, NASA, Pfizer, AstraZeneca, Moderna, and many others for solving large-scale scientific problems.
The Julia creators received the 2019 James H. Wilkinson Prize for Numerical Software. In addition, Edelman has also received the Householder Prize, the Chauvenet Prize, and the Charles Babbage Prize. He received the 2019 IEEE Computer Society Sidney Fernbach Award for outstanding breakthroughs in high-performance computing, linear algebra, and computational science and for contributions to the Julia programming language. Edelman is a fellow of ACM, SIAM, AMS, and IEEE.
Congratulations on being named a 2021 HPCwire Person to Watch, and congrats on the promotion! How do you and other managers at Dell view 2020, taken as a whole?
Thank you. While COVID made 2020 a challenging year for all of us, we’ve seen the amazing impact HPC has made in the fight against this disease. The pivot to work from home drove a lot of demand for technology, and we expect hybrid work environments to be the new normal for many.
As you articulate your strategic roadmap, what are the top 3-5 things you want to accomplish in the next 12 months to hit your strategic roadmap or increase your market share presence?
We are going to focus in high-growth verticals at the intersection of AI and HPC such as healthcare and life sciences, financial services, and government research labs and defense. We are also investing in cloud — we see this as a big growth area for HPC and you will see a push from us to expand our presence in cloud.
Everyone’s talking about AI, machine learning and deep learning these days. What unique capabilities and advantages would you say Dell brings to organizations starting on their AI journeys? How do you differentiate yourselves from other systems providers competing for leadership in the AI market?
We have an extensive and sophisticated HPC & AI Innovation Lab where customers and partners can freely test applications and systems. It’s connected to our worldwide network of Customer Solution Centers with AI Experience Zones, and a network of HPC & AI Centers of Excellence so customers and partners can design, build and test systems and solutions for their specific environments and needs.
We have deep relationships and investments in/with the HPC/AI community, including AI-focused startups to partner and collaborate on the cutting edge of AI-optimized technologies and solutions.
All of these speak to our mission of democratizing, optimizing and advancing HPC and AI so together, we can accelerate time to discovery and innovation.
How does Dell view the exascale era and what is Dell’s role in it?
Exascale is coming and right now, it’s just specialty systems. There are many problems, simulation and data-driven, that will require exascale capabilities. The TACC Frontera and ENI HPC5 systems—both in the Top 10 in the world—show we care deeply about large-scale systems, and we focus on bringing the power of exascale to everyone. The Cambridge Open Exascale Lab (a collaboration with Dell Technologies and Intel) aims at doing just that – developing exascale technologies and making them more accessible.
Generally speaking, what trends and/or technologies in high-performance computing (and related fields, such as AI) do you see as particularly relevant for the next five years?
We’ll see continued growth of optimized CPUs and accelerators as people try to squeeze in even more total performance, performance per watt, and performance per dollar. Driven by a thirst for more compute power for AI, simulation, and analytics, we’re seeing a lot of investment in silicon.
As everyone’s trying to get more and more performance out of the hardware, we’ll also see more direct liquid cooling. The constant thirst for computational power requires more electric power and generates more heat resulting in the need for more liquid cooling. It will become more mainstream, and we already offer liquid-cooled servers, racks, and modular data centers.
We expect to see more optimization through software-defined configurations, composability that includes more virtualization and containers, and even some hardware composability, e.g., from Gen-Z and SmartNICs.
More will embrace hybrid cloud and as-a-service models, where on-premises remains fundamental, but augmented services can provide additional flexibility to meet demand. Project APEX is our strategy for delivering a radically simplified as-a-Service and cloud experience to our customers and partners. This spans PCs and IT infrastructure and it’s all from one trusted partner—unmatched in the industry.
Outside the professional sphere, what activities, hobbies or travel destinations do you enjoy in your free time?
I’m an avid boater and golfer. I also love to travel and have a lot of favorite destinations including Japan.
Congratulations on being named the General Chair of SC21 and a 2021 HPCwire Person to Watch! It’s an interesting year to be chairing SC21, with the conference ostensibly returning to an in-person event following the virtual SC20 conference. Could you tell us a bit about your history with the conference, and how you see its trajectory as we (hopefully) move toward a post-pandemic world?
I first attended SC98 when it was in Orlando, soon after I started working at LLNL. I have been on the Technical Papers Committee numerous times (on about half of the area committees). I have served in several other roles including as the SC17 Exhibits Chair and as the SC19 Finance Chair. I am proud to be involved in the best computer science conference in the world.
For SC21, let me start by saying that I expect that the vaccinations will enable us to have an outstanding in-person event in St. Louis in November. We have some really great things planned already, so if at all possible, all HPCWire readers should join us there.
Next, like everything else in the world, the conference will forever be significantly different from how it was prior to the pandemic. No longer can we view the risk that the conference will not be held in person as too remote to bother planning for (so we are planning for that possibility, despite that I again think it is unlikely). More importantly, SC21 has a unique opportunity to establish how in-person conferences should work. For example, SC21 will have a virtual platform that we will use to augment the in-person experience as well as to support participation by those unable to join us in St. Louis due to travel restrictions or health concerns. We will add unique video content that will only be available through that platform and that will provide more access to the leaders in our field than has been possible with previous conferences. We will use Q&A formats that foster inclusivity and that ensure the most interesting questions are heard. We have many other ideas for how to improve the conference and are actively planning how to implement them.
SC21’s tagline is “Science & Beyond,” which certainly seems appropriate on the heels of 2020. Why was this theme chosen for this year in particular?
While I do see the relationship to recent events, we chose that tagline by November 2019 before the pandemic was upon us. The theme is intended to capture the importance of large-scale computing not only to science (an area that is front and center for many SC attendees) but also to just about every aspect of our lives. Not only is it critical for policy decisions, as the pandemic has reinforced (cf. last year’s agent-based finalist for the Gordon Bell Special Prize) but it also has a role in the humanities and the social sciences. Our plan is to highlight these topics at SC21 as well as its continued contributions to the advancement of science.
It’s also been a busy year for LLNL. The lab deployed some new supercomputers and clusters and, like all of the national labs, hosted a lot of COVID-19 research while impacted by the virus itself. Could you tell us about your experience as CTO of a national lab during a pandemic?
I have spent more time in Livermore in the last year than in any prior year of my life. However, I have spent the least time on the LLNL site in the last year than any of the last 23 years of my life (like so many people, we have been in maximum telecommuting mode and much of my job can be done remotely).
Nonetheless, we continue to get a tremendous amount of work done. Not everyone in Livermore Computing (LC) has been working remotely and others have been on-site just enough to get those systems stood up. The last year has made me appreciate even more the outstanding team that we have and how lucky I am to work with them. As you indicated, we stood up a Top 100 system (Ruby) and we also expanded our Corona system with many AMD GPUs and CPUs, enabling it to have a peak capability of over 10 PFs (double precision). We had a team selected as a finalist for the Gordon Bell Special Prize for HPC-Based COVID-19 Research for significantly advancing our ability to train AI models, in order to assist in efforts that are using many of our new systems to improve our understanding of the virus.
Oh yeah, we have been moving forward aggressively with our plans to field El Capitan, the first NNSA exascale system, for acceptance in 2023. I could spend many interviews talking about nothing else but that, although it can easily be lost in all of our other work.
Heterogeneous computing architecture, largely dependent upon accelerators (GPUs mostly), has become the dominant approach to supercomputing (with the notable exception of Top500 leader Fugaku) and is the backbone of the U.S. exascale program. Where do you see computer architecture headed? What will be the follow-on to today’s dominant heterogeneous (CPU plus accelerator) landscape?
Let me start by saying that the dominant architecture today, as evidenced by ORNL’s Summit system and our Sierra system, is not particularly heterogeneous. Sure, compute nodes have two different types of compute devices (GPUs and CPUs), but the overall systems consist of thousands of compute nodes that are exactly the same – so they are really homogeneous!
Now, did I mention that we integrated a Cerebras CS-1 into our Lassen system and a SambaNova SN10-8R into Corona during the pandemic? Those are significant, novel AI accelerators that are now functioning as part of our two most capable unclassified systems. The key for us is not that they are outstanding systems in their own right (which they are) but rather that their integration into the larger systems enables us to pursue the two most significant directions for the future of large-scale systems.
First, LLNL is aggressively pursuing cognitive simulation, which builds AI/deep-learning models into physics-based simulations to optimize the overall workflow. We have been actively using the models as part of our uncertainty quantification efforts for a few years now (the recent, most successful NIF shot ever benefitted from some of that work). More importantly, take a look at the Best Paper Award winner from SC19 – that work begins to show how we can build AI into the simulation itself. We are now actively determining just how fine-grained we can use those models within our simulations and our expectation is that accelerators like we have deployed will enable their use at the individual time step level.
Second, for several reasons (cost not being the least of them) we would not want to deploy accelerators like those on every compute node. So instead, we are pursuing heterogeneous system architectures that no longer consist of the same compute node deployed as many times as we can afford. Rather, the system architecture is similar to the disaggregated systems used in cloud resources with different types of compute resources. The difference is that we are enabling them to be used in a relatively tightly coupled fashion by deploying them on the same high-speed interconnect and exploring more advanced networking topologies as exemplified by the Rabbit modules that we will deploy in El Capitan. This system model allows us to deploy whatever overall mix of devices will best serve the workload for which we are designing the system. We are developing that system model and the key software such as the Flux resource manager to make this type of architecture work.
We are also exploring critical directions for applications to exploit such systems. The key will be to architect that software for maximum asynchrony, which will facilitate using the right hardware for each part of the workflow. However, it will also involve new challenges for numerics as well as for the system software.
Overall, I think we are on the cusp of one of the most exciting – and challenging – eras in the history of our field.
I guess I already answered this question. You will have to forgive me – I tend to get carried away once I start talking about things that I find exciting.
In my free time, I work for Queen’s University, Belfast in Northern Ireland.
Seriously, I play bridge frequently and I am an avid solver of puzzles. I also really like traveling, which I usually get to do for work on a regular basis, but the pandemic has stopped that. I can’t wait to get back on the road…
Congratulations on being named a 2021 HPCwire Person to Watch! This was a big year for Altair with several acquisitions, including Univa. Could you give a recap of 2020’s major business milestones and explain how these acquisitions have strengthened Altair’s position?
This year has been one of exceptional advancement and investment in terms of how Altair helps customers manage their HPC resources, both on-premises and in the cloud. In addition to acquiring Univa and Ellexus, we released the most significant update to Altair PBS Professional to-date.
By combining Ellexus’ advanced analytics for HPC with the telemetry provided by Altair PBS Works, the Univa product range, and the Altair Accelerator suite, we now offer unmatched insight into job-level telemetry. We’re providing customers in all industries the tools to support a broader spectrum of HPC workloads, and to optimize their enterprise computing investments across new dimensions like input/output (I/O) profiles, storage, software licenses, hybrid cloud resources, and more.
Additionally, Altair continues to expand in industries including finance, life sciences, and oil and gas. We are also experiencing considerable growth in the semiconductor sector. Our HPC investments have augmented our technology and our team of technologists, further advancing the first-rate, commercial-grade experience Altair is known to provide customers across the globe.
While 2020 was filled with unexpected challenges for everyone, it’s been an incredible year for our enterprise computing business. We have seen demand for the optimization of HPC environments increase and expect the trend to continue, and even accelerate, as business operations resume a new “normal.”
You founded and have led a successful technology company for 35 years, taking Altair public in 2017 — what are your guiding principles for maintaining, innovating and growing the company?
Altair’s entrepreneurial spirit and technical prowess has propelled our growth during the last 36 years. We prioritize seeking technology and business firsts, from establishing a simulation-driven design method that has transformed the product development lifecycle to pioneering a flexible, patented units-based licensing model that revolutionized how our customers use software by lowering barriers to adoption and creating broad engagement.
We believe taking calculated risks and bringing outstanding technology to the market will reap the greatest rewards and result in our long-term, sustainable success. Altair leads the pack in R&D investments – exceeding 25 percent of our revenue – to stay on the cutting-edge of technology. Our growing list of partnerships – with companies including NVIDIA, HPE, Intel, Google, AWS, Oracle, and Microsoft – and acquisitions (acquired 30 companies or strategic technologies, including 22 in the last five years) give us a differentiated, open-architecture solution portfolio.
Altair has been putting the pieces in place for our customers to leapfrog their competitors through HPC optimization for decades. Since acquiring the PBS Works workload management suite from NASA in 2003, we have been building the world’s most comprehensive HPC optimization toolset through development and acquisition.
We believe the evolution toward a smart, connected everything is changing the world. No one is better positioned than Altair to lead that evolution through the convergence of simulation, HPC, and artificial intelligence (AI) solutions.
What does 2021 have in store for Altair?
2021 will see the continued integration of our HPC optimization and workload management portfolio. As our customers apply HPC in new ways, CIOs, CTOs, and technical stakeholders are finding diverse HPC requirements converging on the same infrastructures. For example, this year customers representing some of the world’s largest HPC systems have already started implementing the new full-spectrum scheduling functionality available in PBS Professional to handle both long-running, multi-core jobs and incredibly high volume, high-frequency HPC jobs with the same workload management solution.
We expect new dimensions of HPC scheduling, such as storage-aware scheduling, to continue to gain traction in 2021. Customers have been telling us they have an increased need to manage storage more intelligently. We plan to expose more of our customer base to the optimization potential of considering I/O as part of their workload management strategy.
Lastly, we expect 2021 to be a milestone year for the convergence of HPC, AI, and simulation. The next phase of Altair One™, which provides collaborative access to simulation and data analytics technology plus scalable HPC and cloud resource in a single platform, is due out later this year. With best-in-class workload management and data analytics technology built right into the platform to make our software tools more powerful than ever, Altair One is the convergence of Altair’s unique expertise in action.
How does Altair help enable HPC in the cloud?
Altair has a number of solutions for helping customers get to the cloud: cloud bursting for hybrid computing environments, rapid scaling technology to align compute cost with demand, access portals for end users and admins, I/O profiling tools and migration automation technology, and turnkey, fully managed cloud appliances.
The Altair One platform is also essential to Altair’s cloud strategy, enabling a modular cloud journey for our customers – including the option to host everything from licensing, data, software, and compute infrastructure in the cloud. We see customers land everywhere on that spectrum, from spinning up virtual appliances to augment on-prem resources for special projects to managing an entire cloud-only HPC infrastructure – all via Altair One.
As a cloud-agnostic leader in the workload management sector, we find the expertise of our engineers to be just as important to our clients as the technology. Our teams understand the nuances of different cloud providers and can help our customers navigate that complex landscape. Our relationships with all the leading cloud providers are fundamental to our success.
How is Altair responding to the need for AI and data science solutions?
Altair has been leading the application of data analytics and AI to improve business outcomes. Our simulation and AI-driven approach to innovation is powered by our broad portfolio of high-fidelity, physics-based solvers, best-in-class technology for optimization and HPC, and end-to-end platform for developing AI solutions.
In 2018, we acquired Datawatch, and as a result we now offer best-in-class stream processing and visualization and the most complete data preparation and automation solution on the market.
On the HPC front, we are helping customers build infrastructure to support AI workloads, for example through support of Kubernetes via our scheduling solutions. We also explore the application of AI to improve HPC tools – PBS Professional has a “simulate” feature that enables you to make projections about workloads and predict how long projects may take to complete in different infrastructural scenarios, for example on-premises versus in the cloud.
By connecting the Internet of Things (IoT) technology with AI in a combined platform, we are streamlining the entire data pathway from sensors and other edge devices to autonomous control systems, digital twins, and operational decision-making. We give engineers real-time access to performance predictions and optimization for design and operational decision-making through digital twins using AI and simulation.
What trends and/or technologies in high-performance computing (and related technologies, like AI and cloud) do you see as particularly relevant across the next five years?
HPC optimization will become more multi-dimensional, with organizations going beyond scheduling for jobs and CPU and increasingly accounting for GPUs, cloud and hybrid cloud resources, storage, I/O profiles, and licenses.
Major new business systems will incorporate continuous intelligence using real-time context data to improve decisions. The lines between design, manufacturing, and operations will continue to blur as innovators seize new opportunities to apply AI and simulation data at every stage of the product lifecycle.
Exascale computing will enable new insight in areas including weather prediction, climate modeling, healthcare discovery and precision medicine, aeronautics, and space science.
Cloud adoption will continue to increase, especially as organizations apply new system telemetry and AI to automate migration.
Organizations will apply machine learning (ML) and AI to HPC to optimize for both time and cost to solution.
I love to spend as much time as possible with family. I also like to work out and golf.
Hi Jim, congrats on your new position as CTO & President of Tenstorrent and being named an HPCwire Person to Watch for the second time! Tell us about your role at Tenstorrent, your areas of responsibility, and what drew you to the company.
Thank you for this opportunity.
As CTO, I’m working on new technology at Tenstorrent. Following our roadmap, we have a chip (Grayskull) currently starting production. We are powering up on our second-generation part and designing our 3rd and 4th generation of processors as we speak. I’m spending my time working on all of these parts and system designs around them.
As President, I’ve been working with our growing team on business strategy. We’ve gained significant traction with various companies, system builders and their customers, which we can now start translating into revenue.
I was the first investor at Tenstorrent. Ljubisa Bajic (Tenstorrent Founder and CEO) and I go way back. We worked together at AMD and I was always impressed by his approach to building AI silicon. He knows how GPUs work, how the software works, and he also knows the math behind AI, which is a rare combination. That’s why I was interested in investing with him.
Personally, I think the AI revolution is bigger than the Internet. Joining Tenstorrent is a great way for me to contribute to it, and so far it’s been super fun.
With so many startups engaged in designing and commercializing AI silicon, what sets Tenstorrent apart?
There are a few different things to consider. First, and it took us a while to realize this, you have to get right all the basics at a very deep level: memory, compute and network bandwidth together with programmability.
We’ve talked to a number of customers who are frustrated about the current state of AI silicon at its core.
The second thing I really like is the approach to the software. It begins with a unique compiler and software strategy, with hardware designed around it properly.
Some AI chip companies build chips with lots of GFLOPS or TFLOPS, and then they design the software later.
But Tenstorrent has always been different. We build hardware in collaboration with software right from the start.
The original software team consists of people who worked at Altera on FPGA compilers and CAD tools, which are both very complicated problems; we have people from AI and also people who work on HPC computers. There’s a big presence of talent in Toronto from companies and institutions like Intel, Nvidia, AMD and the University of Toronto.
How does the Tenstorrent approach differ in terms of architecture, and in combination of software and hardware. What is “Software 2.0” and how is it important?
What sets Tenstorrent apart is networking, data transformation and math engines of the software stack that work in sync with the hardware.
When you look at the Tenstorrent processor, it looks like an array of math processors, which is pretty common. There’s actually a real matrix multiplier and convolutional engine, so you don’t have to write programs to emulate that kind of math. The Tenstorrent engine does it naturally. It makes the number of programs you have to write for high performance lower because it runs the AI idioms of matrix multiply and convolution natively.
Then there are two units we call “Unpacker” and “Packer”, which are data transformation engines. Rather than writing programs to move bytes around, we have hardware that does it in a very straightforward way and presents a common data format into the math engine, which simplifies the programming.
And finally, networking is built in the Tenstorrent technology from the ground up. When all compute engines do their work, they have to send data somewhere – they send the data packet to the other engine.
We use the same on-chip and off-chip protocol to connect multiple chips together.
The first time I heard about Software 2.0 it was coined by Andrei Karpathy, who is the director of AI and autopilot at Tesla.
His idea was that we’re going from a world where you write programs to modify data to where you build neural networks and then program them with data to do the things you want. So modern computers are literally programmed with data.
It means a very different way of thinking about programming in many places where AI has had so much success. I think in the Software 2.0 future, 90% of computing will be done that way.
There will always be some computing that runs standard C programs but more and more of the actual cycles will be done in AI hardware running what we think of as Software 2.0.
What is the status of Grayskull and Wormhole and what markets and use cases do these chips address?
We’ve started our first production run of Grayskull, which we’re sampling to our customers. Our chip goes on a PCIe card and we have 75 W, 150 W and 300 W form factors. People can buy and plug them into their server infrastructure. We’ve released our inference software, and in a month or so, we are going to release training software. It’s built for a broad variety of AI applications, both training and inference.
Wormhole is our 2nd generation part that is going to take Tenstorrent to the next level because it has native networking between chips and lets us scale from a single chip to many chip systems just using our own network. This greatly improves bandwidth between chips and lowers the cost of building a system.
What excites you most about being a computer architect right now?
I’m sort of amazed by this but I’ve been building and designing computers for 40 years. The complexity of the computers that we build today is so far past what we did or even considered hard 40 years ago.
The reason we can build these computers is that modern tools and software have gotten so much better. You can think of an idea, write down RTL, synthesize it and build it into a chip with a really small team.
People at one point thought there’d be so many transistors and things would be so complicated we wouldn’t be able to build silicon because it’d be too expensive. But the opposite is true. Tenstorrent built Grayskull and Wormhole as a very small team of really great people. They took a very clever approach to modularity and design. We have a relatively small number of units that we put together to make a very complex chip. The amount of change I’ve seen in the last 5 or 10 years of computer design is probably greater than the previous 20.
We’ve been through a lot of revolutions. I think the AI revolution is going to be the biggest one so far.
I like to be active and fairly physical – I kitesurf, snowboard. I like to run and workout. I find it’s almost meditative, especially when I’m working on a hard problem. I get the problem loaded up in my head and I go run or snowboard for four hours. Somehow or other, it sorts itself out.
I like to travel. I went to Egypt with my kids a couple of years ago, it was great. I went to Serbia last year, we had a really great time there before Serbia got shut down due to the pandemic. I often go to Hawaii to surf, and I really enjoy the beach. The last year has been tough on travel so we’ll see about next year.
Congratulations on being named a 2021 HPCwire Person to Watch! Benchmarking has long played a prominent role in HPC, e.g. the Top500, the Green500, etc. In recent years a variety of efforts to benchmark deep learning and machine learning technology have emerged, including, for example MLPerf.org. What’s your sense of the value of these efforts, will we see an analog of the Top500 for DL/ML, such as the Deep500 effort you’ve worked on, or will there be a proliferation of smaller programs such as MLPerf.org?
Thank you! When the Top500 was started, there was a general agreement that a LINPACK-style problem would be a major driver for progress in scientific computing. It took decades before the problem sizes grew too big and it stopped being representative of average application performance. Machine learning models today are changing at a much faster pace – just in the last decade, we moved from fully-connected to convolutional to now transformer models, with numerous architectural steps in between. A benchmark designed with any single model would be outdated within months. Deep500, an effort led by Tal Ben-Nun in my lab, is a framework to benchmark deep learning systems, architectures, and workloads. We often describe it as “500 ways to run your deep learning workload” – fixing a single benchmark for a ranking is hard. MLPerf and AI500 are well established – but will they need to evolve with the workloads to stay relevant. I hope that we can get historical information similar to what the Top500 list provides us with!
The Scalable Parallel Computing Laboratory (SCPL at ETH), which you direct, has a wide compass. Can you give us a sense its mission, what some of its key current projects are, and how important industry is as a collaborator? Do you see its emphasis as primarily on research institution or on technology creation-and-transfer or both?
SPCL covers all topics around compute performance – we focus on high-performance (1) deep learning and climate applications, (2) in-network processing, and (3) accelerator programming models. The goal is to make big steps in each of these areas. We love industry collaboration – this is where much of our inspiration comes from and where we, as a research institution, hope to transfer our discoveries back. Big cloud providers are an especially interesting opportunity today and in the future.
Heterogeneous computing architecture, largely dependent upon accelerators (GPUs mostly), has become the dominant approach to supercomputing yet the current Top500 leader (Fugaku) has a slightly different approach. Where do you see computer architecture headed? What will be the follow-on to today’s dominant heterogeneous (CPU plus accelerator) landscape?
Fugaku offers very high memory bandwidth, similar to GPU systems, which seems to be the most significant advance over other CPU-based architectures. In addition, it uses Cray-style vectors (ARM SVE), which similar to SIMT (GPU) and the more limited SIMD (x86 SSE) models utilizes on-chip parallelism.
In the near future, computer architecture is moving towards higher specialization, particularly optimized for artificial intelligence workloads. Matrix-multiplication units (aka. tensor-cores) will be added to CPUs very soon and more generic extensions such as our (indirection) Stream Semantic Registers will likely follow. I hope that the scientific computing community can benefit from those developments by understanding how to utilize low precision datatypes and (sparse) tensor operations efficiently.
Besides the architectural trends mentioned above, I see two additional developments that may revolutionize the field: (1) Mega-datacenter cloud providers will realize that HPC is not far from their main business of selling compute at low cost and push into the market. Economy of scale will consume all low and medium workloads and may also attract some larger ones. (2) I sincerely hope that HPC will become much more accessible – for example, through the usage of high-productivity languages such as Python and reproducibility frameworks such as Jupyter notebooks.
I am a passionate mountain runner and I’m fortunate to have research and technical reading as my hobbies. My favorite place to be is New Mexico and I wish I could spend more time there.
Congratulations on being named a 2021 HPCwire Person to Watch, and congrats on the promotion! How do you and other managers at HPE view 2020, taken as a whole?
Thank you! I am so honored to be leading this incredibly talented team. Our High Performance Computing (HPC) and Mission Critical Solutions (MCS) business is a strong and strategic growth business for Hewlett Packard Enterprise, and an area where we have clear differentiation through our solution portfolio and commitment to R&D. I’m looking forward to our next chapter of growth and continuing to drive the best outcomes for our customers.
It goes without saying that the world massively changed in 2020, more than we ever could have imagined.
2020 was also a year that tested our resiliency and how quickly we could respond and adapt to the “new world”. If there was one big takeaway that we took from it all was that we truly banded together to become a greater force for good on all fronts. We worked closely with our customers and partners, along with others in the government, academic and scientific communities, to address various needs. I am proud of the teams across HPE that stepped up to support a number of critical issues.
Here are just a few:
- Turned a cruise ship into makeshift hospitals with reliable networking in just five days
- Accelerated vaccine discovery with the use of high performance computing that reduces the drug development cycle from months to weeks, saving researchers hundreds of thousands of dollars
- Signed an Open COVID Pledge, which granted free access to all of HPE’s patented technologies for the purpose of diagnosing, preventing, containing, and treating COVID-19
- Quickly and efficiently supported a sudden, growing remote workforce worldwide by transitioning onsite, enterprise networks to secure, easy-to-manage virtual solutions
- Began bridging the digital divide by bringing high-speed wireless connectivity – a critical and necessary resource for the delivery of remote education, health and mental health services, and job opportunities – to rural areas across the U.S. as a member of the American Connection Broadband Coalition
Do you see foresee the future strategy for enterprise installations as a hybrid of on-premises and cloud solutions and what would that look like?
We are seeing more enterprises adopt a hybrid IT model, which spans traditional on-premises environments to everything as-a-service. I anticipate this trend will continue as organizations adopt new workload types that require higher performance, scale, and most importantly, agility to efficiently transition to new models. We also believe it’s important to ensure a consistent experience no matter the hosting environment for data and apps.
We are already addressing technologies as-a-service through HPE GreenLake, which combines an agile, elastic, pay-per-use cloud experience with our high-performing, secure servers and systems.
We recently announced HPE GreenLake Cloud Services for High Performance Computing (HPC), which we will make generally available this spring, to allow our customers to easily adopt HPC solutions through an on-demand platform. Through this hybrid model, we’re enabling any enterprise to gain fully managed HPC services allowing them to run compute and data-intensive workloads and train new AI and machine learning models to achieve outcomes faster.
How do you see the relationship between HPC and AI, both broadly and more specifically at HPE?
Over the last few years, there has been a growing trend of converging HPC and AI to tackle various types of workloads. By combining AI, machine learning or analytics with HPC workloads such as modeling and simulation, scientists, engineers and researchers can significantly accelerate time-to-insight and advance R&D across a number of industries.
We truly believe this approach can offer many benefits and open new opportunities for next-generation use of AI. For example, by applying AI or machine learning algorithms to simulations, researchers can improve image-resolution in complex simulations such as hurricanes, or gain accuracy in whittling down variables for potential drug candidates when conducting research for a vaccine or drug treatment.
We also see HPC being used to train AI and machine learning models. One of our customers, Pawsey Supercomputing Centre in Australia, focuses on a range of complex scientific research and engineering. Researchers benefit from HPC capabilities that enable key workloads, such as computational fluid dynamics, to develop models used to simulate fish swimming styles, including how fish increase speed and change direction. They then apply these capabilities in machine learning algorithms to train neural networks on underwater robotics.
We think HPC will continue playing a critical role in optimizing AI and machine learning efforts. It will continue being used to accelerate insights from supercomputing-led research to developing and training models that can be applied to innovation to improve our everyday lives and understanding of the world around us.
How does HPE’s leading position in exascale computing benefit the wider scientific and commercial computing community?
It is not just about having the biggest and fastest computer on the planet. Exascale computing represents a whole new era of opportunities we never could have imagined five years ago.
So we’re collaborating closely with the scientific and commercial computing community, such as the U.S. Department of Energy’s Exascale Computing Project and the European High Performance Computing Joint Undertaking.
Together we share a vision that exascale computing will have tremendous impact on the global community and solve some of the world’s toughest challenges. The exascale magnitude of power and scale will help advance our approaches to research and speed insights to answer questions we never even knew to ask.
Exascale’s massive performance will also allow researchers and engineers to tackle new types of workloads, especially those involving AI and machine learning. That will greatly increase accuracy in simulations or better train models that people can use in R&D or in developing products to take to market.
More importantly, and I think many of us in the community would agree on this, is that the biggest, broader benefit to government, academia and commercial sectors is the new opportunity for science it will enable.
Almost every industry you can think of already benefits from supercomputing and that’s largely attributed to scientific breakthroughs that lead to innovation across sectors. For example, being able to simulate at deep molecular levels across chemicals and biological components helps discover new materials to improve safety, efficiency, and sustainability of products we depend on every day. Whether its materials to help advance car tires, airplane wings or even sunscreen lotion, these innovations all map back to scientific breakthroughs made possible by supercomputers and being able to model and simulate at the exascale level will increase those discoveries.
While I can’t tell you exactly how the HPC industry will look in the next five years, I can tell you current and upcoming trends.
HPC customers of all sizes are looking for many of the capabilities and innovations enjoyed by contemporary cloud native infrastructure so they can accelerate the integration of AI and ML into their solutions for faster and more accurate insights. HPE is again leading this market with the announcement last December of our HPC aaS solution through HPE GreenLake. These pre-configured services, which will be released starting this spring, speed deployment of HPC projects by up to 75 percent while enabling customers to get the benefits of cloud usage and HPE’s unique technologies including the HPE Cray Programming Environment.
HPC is becoming more broadly adopted by enterprises. We have deployed our HPC systems for a number of scientific and commercial workload needs, spanning across various organizations such as large national laboratories, oil and gas companies, and AI software start-ups.
By bringing fully managed services to HPC, we think it will further accelerate adoption for more mainstream enterprise use. Supporting new, diverse architectures across CPUs, GPUs, FPGAs and other types of accelerators will also continue to drive initiatives involving complex compute and data needs.
As mentioned earlier, HPC can also be used to optimize AI, machine learning, and analytics. HPC systems are ideally suited to target deep learning workloads, and overall great for ingesting and processing large amounts of data that can be visualized into insights.
We see AI and machine learning playing a big role in driving adoption for HPC. Our customers are already deploying HPE Apollo systems to use HPC capabilities for training and optimizing a range of inferencing needs to benefit broader use of AI applications.
The growing use of purpose-built storage and networking solutions for HPC will also tackle a new wave of demanding workloads, including in next-generation supercomputing such as with exascale-class systems that will require additional support for optimal performance and efficiency.
These technologies will be essential to support growing data needs. Networking technologies can also address demands for higher speed and congestion control for data-intensive workloads, which is what we are targeting with HPE Slingshot. Advanced storage features will also be a growing trend to meet high-performance storage requirements of any size with significantly fewer drives, and we are addressing this with the Cray ClusterStor E1000 storage system.
I am really looking forward to traveling and seeing people in person again, once it is safe enough to do so. In my free time, I love to spend as much time as I can outdoors with my family.
Margaret Martonosi is the Hugh Trumbull Adams ’35 Professor of Computer Science at Princeton University, where she has been on the faculty since 1994. She is also Director of the Princeton’s Keller Center for Innovation in Engineering Education, and an A. D. White Visiting Professor-at-Large at Cornell University. From August 2015 through March, 2017, Martonosi was a Jefferson Science Fellow within the U.S. Department of State.
Martonosi’s research interests are in computer architecture and mobile computing. Her work has included the widely-used Wattch power modeling tool and the Princeton ZebraNet mobile sensor network project for the design and real-world deployment of zebra tracking collars in Kenya. Her current research focuses on computer architecture and hardware-software interface issues in both classical and quantum computing systems.
Martonosi is a Fellow of IEEE and ACM. Her papers have received numerous long-term impact awards including: 2015 ISCA Long-Term Influential Paper Award, 2017 ACM SIGMOBILE Test-of-Time Award, 2017 ACM SenSys Test-of-Time Paper award, and the 2018 (Inaugural) HPCA Test-of-Time Paper award. Other notable awards include the 2018 IEEE Computer Society Technical Achievement Award, 2010 Princeton University Graduate Mentoring Award, the 2013 NCWIT Undergraduate Research Mentoring Award, the 2013 Anita Borg Institute Technical Leadership Award, and the 2015 Marie Pistilli Women in EDA Achievement Award. In addition to many archival publications, Martonosi is an inventor on seven granted US patents, and has co-authored two technical reference books on power-aware computer architecture. Martonosi completed her Ph.D. at Stanford University.
Congratulations on being named a 2021 HPCwire Person to Watch! Both the NIH and scientific HPC applications certainly took center stage over the past year. As the point person for HPC within the NIH, can you tell us a bit about your experience helping such a critical institution navigate the pandemic?
Thank you, I am thrilled to be named a 2021 HPCwire Person to Watch. HPCwire is a great forum to learn the latest and greatest on the intersection of data, computing, AI, and modeling and simulations. I’m equally thrilled to help lead NIH in our data and computing strategies. These fields, and the industries behind AI and computing, changes rapidly and so most of my role at NIH focuses on coordinating and seeding change to take advantage of new opportunities-such as Cloud Platform Interoperability and Research Auth Services.
The advanced technologies that accelerated research during the pandemic are receiving greater-than-ever attention and funding. On the whole, are we witnessing a paradigm shift? How persistent are these changes in the way we conduct medical and pharmaceutical research, particularly at the federal level?
I would say that the COVID-19 pandemic has illuminated a bright light on a number of challenges that we are rapidly working to address-such as finding patient data across multiple data platforms, harmonizing clinical data from many contributors, and streamlining access to those data. I would anticipate that the work we are doing because of COVID will have long lasting and positive impacts on research well into the future.
Another big trend in HPC over the past year has been large-scale collaboration between institutions and the streamlining of those processes. Can you talk about the roles that you and the NIH have played in fostering those collaborations and how they play into the proliferation of advanced technologies?
I sincerely commend the HPC Community on coming together to provide HPC resources to address COVID-19, through the COVID-19 High Performance Computing (HPC) Consortium. Here at NIH we have pulled together most of our Institutes and Centers to address COVID in a number of impactful ways including RADx-the Rapid Acceleration of Diagnostics to speed innovation in the development, commercialization, and implementation of technologies for COVID-19 testing. And recently we launched the PASC initiative-a new initiative to identify the causes and means of prevention and treatment of Post-Acute Sequelae of SARS-CoV-2 infection. Underpinning these efforts are the data that are a result of the research efforts, clinical trials, and clinical observations. Making these data harmonized, findable and accessible is an important goal that we are addressing though activities such as the N3C (National COVID Cohort Collaborative) and new activities in PASC for a data resource core and RADx to develop a data hub for sharing testing and related de-identified data with the research community.
An interesting challenge is to harness the ability of AI to address research questions that require bringing clinical, electronic health related data, and basic biological data together for new discoveries such as addressing health equality and fairness, developing transparent models with confidence and reliability, understand temporal disease progress, predicting adverse reactions, and finally providing a fertile ground for new investigator initiated questions that require these data and algorithms.
These grand challenges are not just data challenges, these are computing challenges, privacy challenges, ethical challenges, and new challenges for educators and researchers. I would like to see if we can shift the paradigm of research to integrates ideas from industry, academia, and communities.
I have a lot of fun in and outside of my professional life. For example, I love gardening and each year I add a new rose bush to my expanding collection. I love beer and have been a home brewer for over 20 years. My last batch was a modified Dodgy Wanker-it’s my take on an English Pale Ale. Pre COVID I loved to travel with my husband and two adult kids (18 and 21). One special place that we return to every few years is Cape Hatteras village on the outer banks of North Carolina. My husband and I met and married in graduate school and our favorite time together during grad school was camping in the various campgrounds on the outer banks. There is probably noting quite as wonderful as a cold beer, a pot of fresh caught crabs (yes I do crab!) and a really great mesh tent to keep out the many mosquitos.
Congratulations on being named a 2021 HPCwire Person to Watch, and congrats on the recent promotion to VP at Intel. Tell us about your new role, your areas of responsibility; what is most challenging and most rewarding?
Thank you very much for this honor. I am humbled to be recognized as a 2021 HPCwire Person to Watch.
I lead a team responsible for all AI, HPC and data center Accelerator Solutions within the sales organization at Intel. Our team works with a variety of customers to deploy Intel-based technologies from the intelligent edge to the data center. In this role, I have the pleasure to collaborate with our customers to create and deploy world-changing technologies that enrich the lives of every person on earth.
The most challenging aspect of my job is also the most enjoyable and rewarding – continuing to innovate to meet our partners and customers’ needs. Our customer obsession is paramount in everything we do. Our team is dedicated to executing flawlessly to our customer commitments.
In your role, you have a direct impact on HPC and AI. What are your expectations for 2021 with regard to these areas?
We see the AI and HPC market continuing to accelerate in 2021. The convergence of HPC, AI and analytics provides a tremendous opportunity for our customers to make significant scientific discoveries. 2020 was a watershed moment for high-performance computing, as we saw the scientific community around the world tap into HPC systems for things like drug discovery, treatments, and genomics research. I anticipate even more significant findings will take place this year that will have a positive impact on all of us. I’m also excited for the work being done in exascale computing. Achieving exascale computing has been a long journey for the industry, and the work that we’re doing on the hardware and software side are making it a reality.
With more Xe GPUs coming out, it’s surely an exciting time to be managing datacenter accelerator solutions at Intel. Can you give us an abbreviated update of what this accelerated solutions portfolio encompasses? And what’s the status of the Xe roll-out?
Intel’s Xe architecture represents a full portfolio of GPUs that scale in performance to support a broad range of market segments and workloads – from integrated graphics in PCs to the most demanding data center applications [alt: workloads] such as HPC and AI. Intel’s Xe architecture can be categorized into four micro-architectures. They are Xe-LP optimized for low power, Xe-HP optimized for high performance, Xe-HPG optimized for enthusiasts and gaming, and Xe-HPC optimized for HPC/AI acceleration.
Three products based on Xe-LP microarchitecture are in market today for client and are inclusive of the Intel Server GPU that enables high-density, low-latency Android cloud gaming and high-density media transcode/encode for real-time over the top (OTT) video streaming. We have also made significant progress this last month with bring-up and validation of Ponte Vecchio. With over 100B transistors in a single package, Ponte Vecchio is a peta flop scale AI computer that fits in the palm of your hand. Our CEO Pat Gelsinger highlighted Ponte Vecchio at his Intel Unleashed presentation the other week as excitement continues to build as it will help power Argonne National Laboratory’s Aurora exascale computer- driving foundation scientific breakthroughs, innovation and discovery.
How have your customers reacted to the delays for 7nm? What can you say about Intel’s intentions?
As we’ve said before, 7nm is progressing well. Customers continue to rely on Intel to solve their most challenging business needs with innovative solutions. We are committed to being the leader in every category in which we compete and executing flawlessly to our commitments to our customers.
What trends and/or technologies in high-performance computing (and related technologies, like AI and cloud) do you see as particularly relevant across the next five years?
I see 2 critical inflection points in the industry – pervasive, affordable AI in every application using technologies such as Intel Deep Learning Boost and the use of large, in-memory data access using technologies such as Intel Optane PMEM instead of splitting compute/data across nodes or swapping from a local and/or remote storage – resulting in a huge increase in productivity. In the longer term, low power, neuromorphic computing can revolutionize AI at the edge, and quantum computing can provide the exponential increase in computing that can drive exciting new revolutionary advances in materials and science. I would especially like to call out the extensive effort that Intel and our ecosystem is investing in the open standards, heterogenous scalable software stacks that work from the intelligent edge to Cloud.
Beyond work, my passions revolve around sports (weekend golf and cricket), supporting my wife in achieving her weight-lifting goals, and spending time with my two beautiful children and furry four-legged family members.
Congratulations on being named a 2021 HPCwire Person to Watch! EuroHPC made the most of 2020, with a slew of major system announcements landing toward the end of the year. Could you talk about the trajectory for the Joint Undertaking as these pre-exascale machines start to become reality and the exascale era nears?
First of all, let me thank you for selecting me as an ‘HPCwire Person to Watch in 2021’, I am honoured!
Last year, two years after its establishment, the European High Performance Computing Joint Undertaking became independent from the Commission which helped to set it up. Since September 2020, we have signed procurement contracts for seven world-class supercomputers that will contribute to the EU Digital Decade agenda and strengthen Europe’s digital autonomy.
Vega, the petascale system that is built in Slovenia in collaboration with the Institute of Information Science in Maribor, was delivered on 10 March and will be operational early April. It will be the first EuroHPC-JU supercomputer to come online, less than half a year after contract signature and despite Covid-19 constraints.
The Vega system is the first of eight supercomputers to be installed in Europe funded under the current EuroHPC-JU Regulation. We also expect the supercomputers in Luxembourg, Czech Republic and Bulgaria to come online soon.
I took over the management of the EuroHPC JU in September 2020 and have also been busy getting to know the members of the Joint Undertaken which are the Participating States, the Private Members and the Commission. The Joint Undertaking will only function if all these stakeholders are in agreement. This is not always easy considering the different interests they all have, but is critical if EuroHPC-JU is to achieve its objectives.
Looking forward, the new Regulation that is currently being discussed at EU level, will further be a step up in our efforts for a world-class ecosystem in Europe. The proposal will allocate €8 billion of EU funding to continue building a world leading European HPC ecosystem.
With these funds, we will not only develop a world-class exascale and post-exascale HPC infrastructure, but also deploy a quantum computing and quantum simulation infrastructure, and make these resources accessible to public and private users across Europe. We will also support the development of technologies and applications to underpin the supercomputing ecosystem, and exploit the synergies of HPC with AI, big data, and cloud technologies. We will improve the awareness, knowledge, and training in HPC. All in all, we will ensure the development of top-of-the-range HPC infrastructures and technologies in Europe for the next decade, to maintain Europe’s position in the global race towards exascale, post-exascale and quantum computing capabilities.
The pandemic has presented both challenges and opportunities for the European HPC community. How did EuroHPC transform throughout the pandemic, and how, if at all, do you think COVID-19 has affected the future of the Joint Undertaking?
Generally, the pandemic has indeed been a challenge in the last year, affecting supply chains, projects, events and in general, the way we work.
As in many organizations, the pandemic most certainly accelerated the digital transformation of EuroHPC-JU. The JU became an autonomous institution in the midst the pandemic, and I have yet to physically meet many of my staff, and we have yet to all come to the EuroHPC-JU offices or meet in the same room. This means that all processes and procedures for the JU have been established within the constrains of the pandemic and using technology to the outmost possible extent. While I cannot wait for the pandemic to be over, and for us all to be able to meet physically, I am confident that we will retain many of the working practices we have built-up due to the pandemic, and that the EuroHPC-JU will be a stronger and more agile organization, because we were setup and started our operations during the pandemic.
It has also been impressive to see the how the HPC community has pulled together to undertake the great research to isolate the COVID-19 virus and identify drugs that can treat COVID symptoms. As none of the EuroHPC-JU machines were operational at the time, EuroHPC cannot claim any of the credit but was able to witness up close this work.
EuroHPC is aiming to build a homegrown hardware ecosystem for Europe while maintaining competitive, leadership standing. How is the JU balancing those dual priorities?
EuroHPC-JU will continue to foster research and development of European HPC technology that can compete on the global HPC market. The ambition is to lead in a number of technology areas where Europe feels it has fallen behind and to continue to strengthen its capacities further in areas where Europe is strong. By pooling European and national resources together, we are building a European HPC ecosystem which will be composed of a mix of European and non-European technologies to allow us to provide the best possible HPC ecosystem for our user communities. For example, Europe has a very strong ecosystem of HPC applications which ultimately deliver the value of HPC. By following a co-design approach there is great potential to develop competitive hard- and software that will align with the requirements of the European public and private sector.
Through the new MFF programmes for 2021-2027, with Horizon Europe, the Digital Europe Programme and the Connecting Europe Facility funds, EU Member States will reach together the next supercomputing frontier. It will allow the Union to equip itself with a world-class federated, secure and hyper-connected supercomputing and quantum computing service and data infrastructure, and to develop the necessary technologies, applications and skills for reaching exascale capabilities and a quantum computing innovation ecosystem.
In the next few years, Europe’s leading role in the data economy, its scientific excellence, and its industrial competitiveness will increasingly depend on its capability to develop key HPC technologies and its present excellence in HPC applications. To make this happen, a pan-European strategic approach and a coordination by EuroHPC-JU is essential.
The trend to look at performance per watt is here to stay and will only become even more relevant as we move into exascale systems. I also believe that we will see more accelerators specific to particular applications or algorithms being deployed on HPC systems in the years to come. The challenge will be to cope with the further increase in the complexity of the programming model that such accelerators will introduce to a model that is already very complicated. My hope is that developments on the software side will start to address this complexity and make HPC more accessible to non-expert users.
The link between data and HPC will be also increasingly important in the future. Larger and larger amounts of data are constantly being generated, and as a result, the nature of computing is changing, with an increasing number of data-intensive critical applications. HPC will be key to processing and analysing this growing volume of data, and to making the most of it for the benefit of citizens, businesses, researchers and public administrations.
I very much relax by enjoying outdoor activities, mainly hiking with my family and friends, and when the opportunity presents itself, sailing. Luxembourg is a fantastic place to live when it come to hiking, unfortunately the same cannot be said for my passion for sailing.
Since joining EuroHPC-JU, there hasn’t been much free time, however when things calm down, I hope to get back to my two hobbies, electronics and model trains, which I have conveniently combined. I have equipped my trains with homegrown electronics using microcontrollers with WIFI and my own software to control the trains. My trains are unfortunately still in moving boxes, but when they will finally get unpacked, it will be high time for an upgrade, and I am really looking forward to restarting this work.
Hi Bill, congratulations on being named a 2021 HPCwire Person to Watch! You joined Google last year to lead their HPC strategy. Please give a brief overview of your role and what you’ve been working on.
Since joining, I’ve spent a fair amount of time simply discovering and learning. Google is a unique company, and I wanted to take the time to get to know its diverse products, technologies, and people. I’ve of course only begun to scratch the surface, but I now feel better equipped to understand the key challenges HPC users face with cloud and the opportunity that Google Cloud has to better serve the HPC community.
An important part of my role is to help drive Google Cloud’s HPC strategy and customer success, and I’m lucky to work with teams inside and beyond Google. I work closely with our product and engineering teams to ensure we have the right roadmap to address existing and emerging HPC use cases. I also spend time bringing HPC user perspective and requirements into the broader Google Cloud product portfolio. I collaborate with our partnership teams, to forge and deepen partnerships within the HPC ecosystem. And, of course, I spend a lot of time with customers, sharing our HPC vision and incorporating their feedback into our plans.
In the last few years, the tier one cloud providers have stepped up their adoption of HPC technologies (and talent!). Why is HPC important to Google, and what is Google Cloud’s differentiation as a provider of HPC in the cloud?
HPC’s impact in tackling the world’s most challenging problems is undeniable, yet it still remains available to relatively few. Google Cloud hopes to address that and bring the power of HPC to everyone via a simple, compatible, and open Cloud. We are focused on both the needs of today’s HPC workloads and on opening new horizons, via powerful Cloud capabilities, such as AI.
To meet the unique needs of HPC workloads, Google Cloud offers several specific machine types, such as compute-optimized instances, which have fixed virtual-to-physical core mapping and OS-visible NUMA architecture, critical to many HPC workloads. We also offer machine types tailored for memory-intensive HPC workloads and GPU-accelerated workloads. We have improved MPI scalability through tuned MPI libraries, HPC-optimized machine images, available 100 Gbps networking, and new placement policies that co-locate compute instances in your application or workload or a job.
Beyond infrastructure, we are making many open source contributions and forming key partnerships in the HPC ecosystem. Our goal is to simplify the deployment of compatible environments on Google Cloud, enabling hybrid HPC environments where applications and workloads run unmodified. We have a number of enhancements planned to make HPC even easier, faster, and more affordable.
It’s notable that Google Cloud is built on the same infrastructure and technologies that power Google’s globally available services, used by 1 billion+ users every day . The extreme demands of our services have driven innovations in scalability, availability, networking, and security that are now available to HPC users worldwide. Google’s private network is among the best, providing superior quality of service, end-to-end encryption, and low latency that enables teams to effectively and confidently collaborate around the world.
And, of course, Google Cloud is helping bring the power of AI to the HPC community.
How do you see the relationship between HPC and AI, both broadly and more specifically at Google?
I think it’s well understood that much of AI can be considered an HPC workload, in that it benefits from high-performance infrastructure. At the same time, AI is a powerful new tool for the HPC community, with scientists, engineers, and others seeking ways to apply AI to gain deeper insight into their HPC simulations, improve their productivity, and even directly accelerate simulations.
Google pioneered the popular TensorFlow machine-learning framework and has broad strengths in AI and analytics. We make these capabilities available, along with highly-tuned infrastructure, to HPC users via Google Cloud. We often hear that HPC users and even HPC centers don’t have a need for round-the-clock AI training. As such, specialized on-premise AI hardware can often lie underutilized. The cloud provides an ideal environment to implement HPC and AI, since the resources can be tailored to the workload, adjusting dynamically as the workflow progresses. With inexpensive cloud archival storage and automated lifecycle management, HPC users no longer face the difficult decision to discard large data sets that might yield future insights.
What new/emerging technologies are you most closely tracking? What trends and/or technologies in high-performance computing (and related fields, such as AI) do you see as particularly relevant for the next five years?
I was actually trained in quantum physics, so I have a keen interest in the fascinating and rapidly-advancing field of quantum computing. While it will likely be many years before the full power of quantum computing is realized, it is interesting to watch the near-term applications being developed. Now is also a good time to consider how quantum computers will be deployed and accessed. Much like supercomputers, I see quantum computers as specialized resources, available primarily through shared HPC centers or the public cloud.
Nearer term, I am watching the rapid advances in silicon packaging and interconnect technology. Just as these advances have allowed the integration of high-bandwidth DRAM on CPUs, I see the potential for new product classes that combine the strengths of CPUs, data-parallel accelerators, and networking. These could drive significant advances in both programmability and performance for HPC.
I’d be remiss if I didn’t admit cloud is an obvious and major trend on my mind. I see a future where the line between workstation and supercomputer disappears, with compute seamlessly scaling to address the problem at hand. We have the exciting opportunity to bring the power of HPC to a whole new class of users, propelling their productivity to new heights.
Outside of work, I enjoy skiing with friends, woodworking, and traveling with my wife, Lisa. We plan to visit Spain, Portugal, and France as soon as it’s safe.