Data-intensive science is not a new phenomenon as the high-energy physics and astrophysics communities can certainly attest, but today more and more scientists are facing steep data and throughput challenges fueled by soaring data volumes and the demands of global-scale collaboration. With data generation outpacing network bandwidth improvements, moving data digitally from point A to point B, whether it’s for processing, storage or analysis, is by no means a solved problem as evidenced by the continuation or what could even be called the revitalization of sneakernets.
Even for those scientists fortunate to have access to the highest-speed networks, like the 100 Gigabit Ethernet research and education infrastructure, Internet2, it takes a certain level of expertise to maximize data transfers. Recognizing that their advanced networking capabilities were not always fully exploited, a group of Clemson University researchers has come up with a way to optimize transfers for everyone.
Not surprisingly the work is coming out of the Clemson genetics and biochemistry department, which has had a front row seat to the past decade’s data deluge. In a news writeup, Clemson’s Jim Melvon observes that while high-energy physics is often cited as the poster child for data-intensive science, genomics is catching up. And as in the computational physics community, long distance data sharing and collaboration is essential for life science researchers.
To maximize data transfer times across the Internet2 backbone and the attached campus network, the Clemson scientists developed an open-source software platform called Big Data Smart Socket (BDSS). As described in the Clemson media release, “the groundbreaking software takes advantage of specialized infrastructure such as parallel file systems, which distribute data across multiple servers, and advanced software-defined networks, which allow administrators to build, tune and curate groups of researchers into a virtual organization.”
“What used to take days now takes hours – or even less,” said Alex Feltus, associate professor in genetics and biochemistry in Clemson University’s College of Science. The software runs on any computer and although it was designed to optimize the transfer of large bioscience data sets, Feltus says the same methods will work for any large modern data sets.
As users generate data transfer requests, BDSS rewrites the request in a more optimal way, adding parallelism through to the hard drives, enabling faster and more efficient data transfers.
“We’ve found the right buffer size, number of parallel data streams and the optimal parallel file system to perform the transfers,” said Feltus, who is director of the Clemson Systems Genetics Lab. “It’s very important that end-to-end data movement – and not just network speed – is optimized. Otherwise, bottlenecks on the sending or receiving side can slow transfers to a crawl. Our BDSS software enables researchers to receive data – optimized for the architecture of their own computer systems – far more quickly than before. Previously, researchers were having to move rivers of information through small pipes at the sending and receiving ends. Now, we’ve enhanced those pipes, which vastly improves information flow.”
Read the Clemson announcement and find links to related papers here.
A brief video offers additional details: