People to Watch 2018

Dan Stanzione
Executive Director
Texas Advanced Computing Center

Dr. Stanzione is the Executive Director of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. A nationally recognized leader in high performance computing, Stanzione served as Deputy Director since June 2009 and assumed the Executive Director post on July 1, 2014.

He is the principal investigator (PI) for several leading projects including a multimillion-dollar National Science Foundation (NSF) grant to deploy and support TACC’s Stampede supercomputer over four years. Stanzione is also the PI of TACC’s Wrangler system, a supercomputer designed specifically for data-focused applications. He served for six years as the co-director of CyVerse, a large-scale NSF life sciences cyberinfrastructure in which TACC is a major partner. In addition, Stanzione was a co-principal investigator for TACC’s Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione previously served as the founding director of the Fulton High Performance Computing Initiative at Arizona State University and served as an American Association for the Advancement of Science Policy Fellow in the NSF’s Division of Graduate Education.

Stanzione received his bachelor’s degree in electrical engineering and his master’s degree and doctorate in computer engineering from Clemson University, where he later directed the supercomputing laboratory and served as an assistant research professor of electrical and computer engineering.

HPCwire: You have one of the most diverse and successful HPC programs in academia with, by our latest count (and please confirm) 15 HPC systems at TACC, including the world-class Stampede2, the petascale Lonestar 5, as well as experimental cloud and viz infrastructure. What guides your vision and execution?

Dan Stanzione: Yes, that’s accurate. I think what we’ve seen is and part of the reason we have so many systems, is that there really isn’t one system that fits all the potential use cases that are out there in what they broadly call HPC or large scale scientific computing. We have machines where we’re more concerned about data, we have machines where we’re more concerned about interactive use rather than keeping utilization high, we have people who want to work in the cloud, we have people who would rather be optimized for performance at all costs. So really, it’s that we’ve built this sort of diversity of systems, first and foremost because of user demands to have different classes of systems that meet different kind of user needs rather than have every user use one kind of system. It might be different in the hardware perspective or just different in the policies or how we run it, so with some you want long running jobs, but if you have long running jobs you can’t do interactive jobs as well. So we wanted to strike that balance of what the needs are.

The second thing is that there’s so many great technologies out there, and unless we have some of them in house we’re never going to have a good enough understanding to make good decisions about them. So we have more experimental platforms like FPGAs which aren’t necessarily ready for the general case of HPC, but they’re not going to get there if we don’t try them or expose users to them. Keeping abreast of that technology landscape is really the other reason, and then of course a minor reason is funding. We have some systems that are federally funded so we make them open to everybody, some we can do cost recovery on for industry partners, which is hard to do on a federally funded one, so there’s probably one or two more just because of the way all of the funding gets sorted out.

HPCwire: What are the some of the trends you are seeing in scientific computing? How do you support both traditional HPC workflows and long-tail science?

So the first answer is we have more than one system, and we build solutions for that. One trend: we’re seeing lots of new apps and new ways of doing things but they almost always come with new usage modality and new programming models. For instance, we’ve been doing a lot of stuff with geospatial data lately for people who want to do some large-scale machine learning or look at geology type questions where they might be using more Python code. So we’re seeing a divergence in the user bases too, between the more traditional fields where we can push toward the biggest scale and do traditional kinds of performance optimization and then the sorts of users who are maybe a little less sophisticated in terms of their computational background, but still need systems at scale to support them. That divergence is actually sort of problematic, because there’s no one path to say, “Alright, we’re going to try and get ten million cores and that’s going to solve everybody’s problems.” It solves some people’s problems, but not everyone’s.

Another trend is that we’re building more and more web-based platforms that support these things and hide the complexity of the architecture. I am concerned that the whole community isn’t doing enough in terms of performance optimization where we could. We’re living with very low yields versus what the systems could do if all the codes were in better shape. We’re largely building platforms where people can use HPC through web services and through things like Jupiter where they want to do interactive Python. We’re seeing a lot of growth in those sorts of areas, particularly in the informatics spaces – health, bio, economics – all those things that are sort of informatics driven.

The rise of machine learning and the mix of techniques would be the other trend – ultimately I think scaling that will affect scalable HPC. So those are both trends and the notion of platforms and separate systems is how we’re doing the long-tail and the large-scale at the same time.

HPCwire: What do you hope to see from the HPC community in the coming year?

There are things I’m concerned about and things I’m hopeful we’re going to do. My top concern is the overall complexity of the whole ecosystem. We’re driven by commodity processors in a lot of ways and commodity technologies in HPC – and that means Linux is getting way more complicated, the application stacks are way more complicated, these rich multicore processors are getting way more complicated, the firmware systems that run them are getting way more complicated, and then we’re building bigger and more complicated supercomputers on top of all those uses. So I’m worried about frailty, security, and whether we can really manage this complexity well.

I’m starting to see people using machine learning to treat them as control systems to figure out where to place tabs on processors because we just don’t have the ability to engineer or deterministically get an answer any more. There’s sort of this crazy open loop control system that we don’t really understand. I’m worried that that complexity is going to make it harder for us to get great uptimes and reliability and to give the experiences that they want.

At the same time, I think we need to embrace some areas where we can simplify things, maybe not as many layers of memory hierarchy as we’re sort of pushing towards. Maybe taking a new look at how we manage firmware servers, maybe getting rid of the vast majority of the features that are only used at the edges in special cases – sort of starting over in a simpler ecosystem to make what we’re doing a more sustainable.

As far as the industry, I really do feel like there’s so many things that we’re doing where we could perform better and scale better on the systems that we have. Machine learning is a great example right now. There are so many potentially useful things that we can do with AI, but we’re sort of thinking of AI more or less at desktop scale, or even in the cloud — we’re not really thinking about how to scale those up to big problems. If we want our AI to be really, really intelligent we need to do this stuff at scale, I think there’s even more potential to do great things there.

Finally, with all of these techniques that we’re building I think we’ve only scratched the surface of how we can translate them into different problem areas across society, into manufacturing and a lot of fields of basic science. In healthcare there’s so much more we can do with this data to get better outcomes and to improve costs. Another one in Texas that matters right now is both planning for and responding to disasters – aiding in building a more resilient infrastructure, in mitigating potential flood zones, potential drought zones, all of these natural disasters that we’re facing. The last hurricane we had cost 100 million dollars in Texas alone for one storm. So, there’s so much we can bring computing to and just data in general to make good data driven decisions on everything from land use, to building codes, to planning drainage zones, to planning for floods, to building walls and infrastructure to divert water and so on. I think there’s huge potential for us to take what we’ve been doing very well in HPC and bring it into many more domains that are immediately impactful on people’s lives. I’d like to see us continue to grow and make those translations.

HPCwire: Outside of the professional sphere, what can you tell us about yourself – personal life, family, background, hobbies, etc.? Is there anything about you your colleagues might be surprised to learn?

I don’t know why I have such trouble answering this question, but not really. I have an old sports car, I have weird hobbies like reading about economics and science policy, but aside from that I sort of live for this. I’m sort of a political junkie.

Antonio Neri
HPE
Bob Picciano
IBM
Brett Tanzer
Microsoft
Dan Stanzione
TACC
Diane Bryant
Google
Doug Kothe
ECP
Earl Joseph
Hyperion
Frederick Kohout
Cray
Ian Colle
Amazon
Jysoo Lee
KAUST
Marc Hamilton
NVIDIA
Ralph McEldowney
DoD
Rick Stevens
ANL
Scott Aylor
AMD
Trish Damkroger
Intel

Leading Solution Providers

Contributors

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

HPCwire