How NASA Is Meeting the Big Data Challenge
As the scientific community pushes past petaflop into exascale territory, it is imperative that the tools to support ever-more data-intensive workloads keep pace. No where is this more true than at the storied NASA research complex. With 100 active missions supporting cutting-edge science, NASA knows more than most about compute- and data-driven challenges.
A recent paper from Piyush Mehrotra and L. Harper Pryor with NASA’s Advanced Supercomputing (NAS) Division sheds light on how NAS has assisted the diverse workflow of its users, including discovery, access, transportation, management, and dissemination of big data, as well as providing the tools to transform data into insight and knowledge.
“As NASA’s flagship site for computational science and engineering at scale, NAS supports a user base that is at the forefront of data intensive and data driven science,” write Mehrotra and Harper. “Our users’ codes use and generate very large datasets and analyzing these datasets to extract knowledge is a fundamental part of their workflows.”
To get a better understanding of the kinds of challenges faced by their user population, NAS officials went directly to their user base. They then grouped the challenges by the main elements of the workflows, ie “discovery of data and tools, access to and movement of data, storage and management of data, algorithms/tools for performing the analysis/analytics and finally dissemination of the results.”
Discovery hinges on data, which is challenging for NASA based on sheer volume and the distributed nature of the storage archives. Users require tools that support large-scale data movement. There is also the looming need to develop platforms that meet the computational and analytic requirements of the coming exascale era.
With user interviews and several studies to guide them, NAS officials added several initatives to their architecture roadmap. The paper’s authors describe two of these that address user needs:
1) higher level support for scientific workflows to make the challenges of working with big data and big compute more transparent to the user, and
2) tighter integration of compute engines with analytic engines.
The first of these directly relates to the implementation of the NASA Earth Exchange last year. The NASA Earth Exchange (NEX) is a collaborative research platform that brings together advanced supercomputing, earth system modeling, workflow management, and NASA remote-sensing data. It enables users to explore and analyze large earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and share results. To support data-driven workflows, NEX uses VisTrails on Pleiades, NASA’s flagship supercomputer. ParaView is also available as a companino tool to VisTrails. The system will support wide-area workflows encompassing NASA and other agencines, including USGS, NOAA and DOE.
“Our vision is to provide an environment capable of capturing the workflow so that it can be shared with colleagues who can then repeat the experiment and/or tweak the input data/algorithms to generate new knowledge,” write the authors.
The second initiative aims to integrate analytic capability – more specifically visualization – with compute capability. This speeds up what was traditionally a sequential process. In the past, visualization was a post-processing activity that could only be performed after the computation phase. Now NASA’s visualization engine (hyperwall) has been integrated via the same InfiniBand fabric as the Pleiades supercomputer, so that they share storage resources in their Lustre filesystem. Data streams can be directed from computation nodes to the visualization nodes via the InfiniBand I/O fabric while the code is running. The intermediate data can be examined concurrently with execution (to steer computation) or stored for later analysis. This benefit is temporal fidelity at much lower storage cost.
Going forward, NAS aims to continue to optimize the data workflow and they use data knowledged to guide this process. “We don’t want to touch all of the data if we don’t have to,” the authors write. “We know a lot about the structure of the data that might be used to steer the computation toward the subsets of the data that are applicable to the query – and not use the subsets we know are not relevant. This is the good news side…the bad news side is that there is a lot of complexity hiding behind the data and this complexity is critical to using it properly.”
An example of this complexity is remote sensing of atmosphere and land temperatures from space. A satellite does not really measure temperature, it measures radiance, and getting this reading requires a lot of knowledge about the sensor itself. Or take a satellite that is nominally in a sun synchronous orbit, what if the orbit has drifted, they ask. With all this information and metadata being crucial for the discovery challenge, the task at hand is making it all more accessible to the user. A good place to start, according to the authors, is determining what approaches (representation, tools and algorithms) best support the orchestration of metadata. And as always, they emphasize the importance of “never los[ing] sight of the fact that our product is the scientific and engineering knowledge that we extract from big data.”