Enabling distributed workflows is a challenge the HPC and Research Communities have been struggling with for decades. At the center of this issue is consolidating data, enabling categorization, searching data and then managing access. This is easier said than done. Organizations usually look at front-end applications, data movers and cloud-based services as solutions; however, advancements in object storage are making it easier than ever to enable distributed workflows and accelerate time to discovery.
The first step to providing efficient data management is visibility, and that starts with consolidating data. One of the top priorities for many research organizations is enabling efficient data access over HTTP, specifically via S3. The primary challenge is that many existing devices and research applications don’t support S3 directly. Therefore, many storage solutions have developed interfaces that can store data via traditional protocols like SMB and NFS and then access that data via S3. The caution here is that solutions that aren’t natively based on HTTP (e.g., file-system based solutions) will need to convert data before transmitting over HTTP. For many HPC use cases and high-performance workflows, this translation takes too long to satisfy performance requirements. So, if S3 access is at the top of your priority list, you need to look for an object-storage solution that enables high throughput via S3. And, if cost is a consideration for your organization, you will want to find a solution that does not require expensive Solid State Drives (SSDs).
Categorizing and Searching Data
Once data is on a single storage platform, the next challenge is to categorize the data and make it searchable. For many research organizations funded by grants or that have long budget approval cycles, purchasing a robust asset management solution for hundreds to thousands of users isn’t an option.
Fortunately, many object-storage solutions have integrated the ability to customize metadata. The best-of -breed solutions do this in a way where it isn’t necessary to maintain or manage an additional database. When metadata is stored with the data itself, the data can be searchable programmatically via an Application Programming Interface (API) or using a web-based user interface (UI). If data categorization and search are at the top of your priority list, make sure that you use an object-storage solution that doesn’t require the maintenance of an additional database and that it uses an open-source NoSQL solution like Elasticsearch for easy integration into data visualization applications.
Providing & Managing Secure Access to Data
Once data is consolidated, categorized and searchable, the next challenge is to provide secure access to hundreds to thousands (or maybe even millions) of users in a way that is easy to manage and control. While many workflows still use FTP or manual processes to manage access (like storing who has access to what in a spreadsheet); this is not a scalable model and will ultimately hinder time to discovery while consuming valuable resources. For secure access internally and externally, you need an object-storage solution that integrates both tenant management (for storage metering and easy access control on your server) and file sharing via an API and web-based UI.
Learn How the JASMIN Project Manages Data
The JASMIN facility is a “super-data-cluster” at the UK’s STFC Rutherford Lab that delivers infrastructure for data analysis. They chose to use an object-storage solution to expand storage capacity while intelligently managing both the data and the data access on JASMIN. Learn more about how JASMIN benefits from using object storage.