The trend in high performance supercomputer design has evolved – from providing maximum compute capability for complex scalable science applications, to capacity computing utilizing efficient, cost-effective computing power for solving a small number of large problems or a large number of small problems. The next generation of supercomputers will be optimized for workflow computing, utilizing system software and hardware infrastructure to schedule and coordinate an ensemble of applications solving problems too complex to even be modeled with a single application. Whereas traditional HPC is dominated by batch scheduled jobs running to completion or failure, workflow computing requires frameworks that can be used to enable runtime interactions between applications and users. IBM developed the IBM Data Broker [open sourced at github.com/IBM/data-broker] to facilitate the runtime exchange of data between applications deployed in a workflow.
One emerging type of workflow, is Intelligent Simulation in which machine learning techniques are used to inform complex simulations to execute more efficiently and reduce time to solution. IBM and Lawrence Livermore National Laboratory partnered to develop an intelligent simulation workflow to more efficiently schedule and execute multi-scale simulations of cell membrane models relevant to studying RAS-initiated cancers.
As part of this project, the IBM Data Broker was deployed on the Sierra supercomputer at very large scale and used to exchange data between multi-scale simulations written in C and multiple machine learning and analytics applications written in Python. In the course of standing up this workflow, the teams developed a better understanding of bandwidth and latency requirements in creating an optimal workflow solution at scale. As system failures are a reality in any large scale cluster, the importance of saving and restoring a stateful workflow is critical. Traditionally, checkpoint/restart mechanisms were important to long running simulations, but they are equally important to stateful workflow computing solutions.
Figure 1 contrasts the IBM Data Broker approach with traditional Posix file I/O. Although file I/O can be used to exchange data at runtime, the lower latency and higher bandwidth requirements of a complex workflow including simulation and AI make the Data Broker solution a better solution.
The IBM Data Broker was conceived with several fundamental design goals:
- Simple programming interface for applications to use to put and get data from a distributed data store
- Leverage a data store that resided in memory, SSD, or hard disk and avoid the latency and bandwidth constraints of exchanging data in a workflow using Posix file I/O
- Support for defining and managing multiple namespaces
- Capable of leveraging both software or hardware accelerated object stores
- Access across on premise and external systems including public clouds
Figure 2 shows the high level architecture of the Data Broker. The picture highlights a user library that provides that client API currently supporting both C/C++ and Python. In addition, a system library enables support of different backend key-value stores without recompiling the application. The current version supports Redis and GasNet with prototyping of hardware accelerated backend.
Applications, as shown in Figure 3, use the client API to form connections with the backend key-value store and create/destroy, attach/detach to namespaces shared by one or more of the applications in the workflow. The applications use a simple key-value API to put/get data to/from the Data Broker. The Data Broker system library interfaces with the backend distributed key-value store to locate the data across the distributed data store. For example, when Redis is used as a data store, the correct hash slot is computed from a CRC16(key) modulo 216 . Other backend data stores may use different algorithms to distribute keys. In any case, the IBM Data Broker abstracts the location of the hash slot and key from the application.
The IBM Data Broker is still in early development stages and additional function is being added to provide a more robust save/restore system as well as reduce latency of data exchanges. The initial work has been open sourced at github.com/IBM/data-broker.
With contributions from Carlos Costa, Claudia Misale and Lars Scheidenbach
IBM Cognitive and Cloud Solutions
Data Centric Systems
IBM T. J. Watson Research