In the world of HPC, computing workloads with massive data volumes typically go hand in hand. In the semiconductor and electronics industry, computer aided engineering (CAE) and electronic design automation (EDA) workloads have driven the need for computing and data management on a massive scale. Another case in point is genomic sequencing in life sciences, where the secondary analysis methods comprised of multiple steps generate between 500 MB to a TB of raw genomic data per step. More recently, with the emergence of artificial intelligence come machine learning and deep learning workloads requiring extremely large volumes of data to train models.
As we discussed in the article “Making Clouds Fly“, multicloud computing is increasingly being adopted by organizations to provide additional computing capacity to supplement on-premise resources. Managing the data for workloads in a multicloud environment is critical to performance and cost, and there are a number of challenges to consider including the following:
- Data not being available on the compute resources when needed
- Data transfers coupled with workload leading to poor utilization as CPUs wait for transfers
- Understanding state of data transfers for given workloads
- Wasted bandwidth and storage on duplicate transfers of data
- Single user copying the same data repeatedly
- Multiple users working on the same project transferring the same data repeatedly
Therefore, when implementing a multicloud strategy, the placement of jobs to the appropriate compute resources, whether on-premise or in the cloud is only part of the equation. Workloads require data to process, so data “awareness” for the workload management system is of a paramount importance to ensure effective use of the multicloud infrastructure. When choosing a workload management solution for a multicloud infrastructure, some things to look for are:
- Ability to move data out of band from workloads
- Caching mechanism to avoid redundant movement of data
- Configurable data transfer mechanism
- Intelligent workload scheduling, factoring in data availability
IBM Spectrum LSF family provides advanced capabilities for data management and cloud environments. One approach to leveraging cloud infrastructure is to deploy an LSF cluster on-premises, and one or more LSF clusters on cloud resources. LSF can be configured so jobs are forwarded from the on-premises cluster remote clusters when there are insufficient resources to run them locally.
LSF Data Manager manages the data required by jobs by managing a data cache in each LSF cluster. When a job is submitted to a cluster, the user can specify the data requirements of the job. When deciding where to forward a job, LSF will compare the data requirements of the job with what is already cached in the remote clusters. Once forwarded, the job will not be dispatched until the data transfer has arrived in the remote cache. LSF provides a transparent mechanism for the job to access the data, regardless of whether it runs on-premises or in a remote cluster. By deduplicating transfers, and performing transfers of data out of band, overall resource utilization can be improved.
To learn more about the data management capabilities in IBM Spectrum LSF, visit the IBM Knowledge Center.