A Blueprint for Centralized Research Data Storage and Sharing

By Becky Yeager, Thomas Hauser, Peter Ruprecht, and Dan Milroy, University of Colorado, Boulder

March 3, 2014

The University of Colorado Boulder PetaLibrary storage system was recently deployed by the CU Research Computing (RC) group to address the increasing challenges that researchers face regarding large-scale data storage and data management. The PetaLibrary, in part funded by the National Science Foundation, provides a variety of services to campus researchers including high-performance short-term storage, long-term archive storage, and the ability to share data with collaborators at CU-Boulder and across the country.

The PetaLibrary offers several petabytes of data storage using an expandable and modular hardware design. Currently, more than a dozen research groups are using over 100 TB of data on the PetaLibrary system. Researchers and Data Scientists in disciplines ranging from Humanities to Biology, as well as the University of Colorado Boulder Libraries (CU-Boulder Libraries), are using the PetaLibrary storage services. These researchers all have one thing in common, the need for large-scale and low-cost data storage. Usage of the PetaLibrary is expected to double in the next few months.

The two main categories of service offered to customers of the PetaLibrary are Active storage for data that needs to be accessed frequently and Archive storage for data that is accessed infrequently. Active data is always stored on disk and is accessible to researchers on compute resources managed by RC. Archive storage consists of a two level hierarchical storage management (HSM) solution, with disk storage for data that is more likely to be accessed and tape for data that is less likely to be accessed frequently. The HSM configuration was developed in collaboration between RC and a consultant from Re-Store LLC to produce a cost effective solution for allowing automatic transfer between disk and tape.  For data whose importance warrants multiple copies, options for replication to separate tape cartridges or even to a disk-based storage system in a remote datacenter are available.

Disk storage for the PetaLibrary resides on scalable high-density DDN SFA10K and IBM DCS3700 RAID-6 systems.  These are grouped into GPFS clusters for high performance and reliability.  The tape storage system consists of an IBM TS-3584 library with four LTO-6 drives.  We use Tivoli Storage Manager to move data to and from tape.  TSM’s HSM module, plus a number of custom scripts, enables policy-based migration of files between the GPFS filesystem and the tape storage.

Large scale storage on its own is only useful if the associated network infrastructure is designed with large data transfers in mind. Therefore, RC in collaboration with CU’s Office of Information Technology, has deployed a ScienceDMZ, funded by a NSF CC-NIE grant. The core of this science network can perform at 80 Gbps and data on the PetaLibrary is accessed through secure, high-performance file transfer programs. With a fast science network, data can be easily retrieved and sent directly to each researcher’s desktop.  In order to facilitate web-mediated transfers the PetaLibrary utilizes tools provided by Globus. Globus makes robust file transfer capabilities, traditionally available only on expensive, special-purpose software systems, accessible to any researcher with an Internet connection and a laptop.  It also facilitates sharing data between collaborators both on- and off-campus. The current ScienceDMZ, is a 10 Gbps ethernet dedicated layer-2 network serving as a critical infrastructure for a number of data transfer services provided by RC to the CU-Boulder campus community. The NSF funded improvements of the ScienceDMZ include upgraded border routers with 100 Gbps and OpenFlow capabilities, up to 80 Gbps for the DMZ core, performance monitoring and security monitoring.

Clients of the PetaLibrary have been pleased with the services they have received so far. The CU-Boulder Libraries was one of the early adopters of the PetaLibrary services. As one of the larger users, the CU-Boulder Libraries uses the PetaLibrary to build digital collections in a variety of media types for research and study. According to digital initiatives librarian Holley Long, “The CU-Boulder Libraries digitizes audio, video, images, text and soon 3D objects, according to nationally-accepted archival standards.” Large-scale storage is important to these projects because of the initial size of the uncompressed files, often as large as 120 GB per hour of digitized video. In 2014 the estimated production capacity for the library’s digital collections could exceed 80 TB (https://content.cu.edu/digitallibrary/cuAuraria.html).

The University of Colorado Museum of Natural History is using the services provided by the PetaLibrary to store digitized copies of their entire collection. The collection includes 4.5 million objects, including the oldest documented Navajo textile, the Aiken bird collection, and Colorado’s largest collection of bees, along with the metadata associated with each object. The metadata for each distinct object includes notes on who found it, where it was found, when it was located, what it is, and pictures of what it looks like. Every object in the museum has its own interesting backstory, one that comes to life when an object is viewed in relationship with its complex metadata. Because we now live in a digital age the museum is attempting to democratize their exhibits (http://cumuseum.colorado.edu/research/databases). This means that every visitor to the museum will have the opportunity to view the collection in its entirety in a digital format.

Being able to digitally store their entire collection provides the museum with the best of both worlds. Pat Kociolek, Director of the Museum of Natural History, describes the importance of the PetaLibrary to their archives, “It allows the museum the opportunity to make these digital dreams come to life. Visitors can physically view individual items, and when our work is complete visitors will also be able to access the entire collection online. Digital collections also allow remote visitors such as teachers, scientists, and students the chance to browse the collection even if they are unable to visit the museum in person”.  As the data needs of the museum reached over 100 TB, they could no longer rely on local storage resources. The PetaLibrary became an important resource for the museum staff allowing them to archive, and keep safe, those digital resources that have been developed as a way to serve all of their constituents. As a centralized facility on campus the PetaLibrary can provide the museum with the security they need to store these items and to share them widely.

On the CU-Boulder campus researchers are producing large amounts of data in diverse areas such as digital humanities, simulation studies, to global climate modeling. Researchers on campus need ways to preserve this data and to make the data accessible to others. Transparency and the ability to share data and resources with others are important parts of any research plan. The PetaLibrary provides the campus with a centralized location to consolidate this data, and the means to share this data with others through Globus Connect Server.

The PetaLibrary is an important part of the evolving data management ecosystem on campus. It allows researchers to use a high-speed network to move data in and out of storage across campus, and around the nation. The Globus software suite makes it easy to transfer data sets, and to share securely with collaborators. A common practice has been for researchers to store data sets on PCs in labs or on USB-connected drives.  The PetaLibrary, by contrast, provides the security of enterprise storage systems with redundant disk arrays in data centers with environmental and access controls, at a comparable cost through the subsidies of the NSF Grant.

The future vision for the PetaLibrary is to expand the storage capabilities of the system and to enable tools that will help with metadata management and data discovery, and enhance sharing options on campus and with public facing data portals.  The PetaLibrary is an important new service that is at the forefront of the campus discussion on how to deal with the challenges of research data, and it is helping to address the current research needs of the campus and the growth that is anticipated in these areas. A newly created faculty committee is discussing the how to bring additional services to researchers including data curation, metadata management, and data management planning, at a reasonable and sustainable cost.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

AI in the News: Rao in at Intel, Ng out at Baidu, Nvidia on at Tencent Cloud

March 26, 2017

Just as AI has become the leitmotif of the advanced scale computing market, infusing much of the conversation about HPC in commercial and industrial spheres, it also is impacting high-level management changes in the industry. Read more…

By Doug Black

Scalable Informatics Ceases Operations

March 23, 2017

On the same day we reported on the uncertain future for HPC compiler company PathScale, we are sad to learn that another HPC vendor, Scalable Informatics, is closing its doors. Read more…

By Tiffany Trader

‘Strategies in Biomedical Data Science’ Advances IT-Research Synergies

March 23, 2017

“Strategies in Biomedical Data Science: Driving Force for Innovation” by Jay A. Etchings is both an introductory text and a field guide for anyone working with biomedical data. Read more…

By Tiffany Trader

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HFT Firms Turn to Co-Location to Gain Competitive Advantage

High-frequency trading (HFT) is a high-speed, high-stakes world where every millisecond matters. Finding ways to execute trades faster than the competition translates directly to greater revenue for firms, brokerages, and exchanges. Read more…

Google Launches New Machine Learning Journal

March 22, 2017

On Monday, Google announced plans to launch a new peer review journal and “ecosystem” Read more…

By John Russell

Swiss Researchers Peer Inside Chips with Improved X-Ray Imaging

March 22, 2017

Peering inside semiconductor chips using x-ray imaging isn’t new, but the technique hasn’t been especially good or easy to accomplish. Read more…

By John Russell

LANL Simulation Shows Massive Black Holes Break ‘Speed Limit’

March 21, 2017

A new computer simulation based on codes developed at Los Alamos National Laboratory (LANL) is shedding light on how supermassive black holes could have formed in the early universe contrary to most prior models which impose a limit on how fast these massive ‘objects’ can form. Read more…

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

New Japanese Supercomputing Project Targets Exascale

March 14, 2017

Another Japanese supercomputing project was revealed this week, this one from emerging supercomputer maker, ExaScaler Inc., and Keio University. The partners are working on an original supercomputer design with exascale aspirations. Read more…

By Tiffany Trader

Nvidia Debuts HGX-1 for Cloud; Announces Fujitsu AI Deal

March 9, 2017

On Monday Nvidia announced a major deal with Fujitsu to help build an AI supercomputer for RIKEN using 24 DGX-1 servers. Read more…

By John Russell

HPC4Mfg Advances State-of-the-Art for American Manufacturing

March 9, 2017

Last Friday (March 3, 2017), the High Performance Computing for Manufacturing (HPC4Mfg) program held an industry engagement day workshop in San Diego, bringing together members of the US manufacturing community, national laboratories and universities to discuss the role of high-performance computing as an innovation engine for American manufacturing. Read more…

By Tiffany Trader

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

Leading Solution Providers

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

Intel and Trump Announce $7B for Fab 42 Targeting 7nm

February 8, 2017

In what may be an attempt by President Trump to reset his turbulent relationship with the high tech industry, he and Intel CEO Brian Krzanich today announced plans to invest more than $7 billion to complete Fab 42. Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This