The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
February 08, 2008
Version 1.0 offers new generation of distributed data management power
Feb. 7 -- In the Information Age, the freedom to easily generate and share digital forms of information is driving life-changing advances in science and medicine, dramatic expansions in communications, big gains in business productivity, and a new flowering in video, music, and other cultural expressions.
At the same time, the digital data we all love is growing explosively. In 2006, humanity produced 161 exabytes of digital data -- that's 161 billion billion bytes, or 12 stacks of books stretching from the Earth to the Sun -- more data than our capacity to store it.
This deluge of data is bringing with it unprecedented challenges in organizing, accessing, sharing, and preserving digital information. To meet these challenges, the Data-Intensive Computing Environments (DICE) group at the San Diego Supercomputer Center (SDSC) at UC San Diego has released version 1.0 of iRODS, the Integrated Rule-Oriented Data System, a powerful new open-source approach to managing digital data.
"iRODS is an innovative data grid system that incorporates and moves beyond ten years of experience in developing the widely used Storage Resource Broker (SRB) technology," said Reagan Moore, director of the DICE group at SDSC. "iRODS equips users to handle the full range of distributed data management needs, from extracting descriptive metadata and managing their data to moving it efficiently, sharing data securely with collaborators, publishing it in digital libraries, and finally archiving data for long-term preservation."
The most powerful new feature, for which the Integrated Rule-Oriented Data System is named, is an innovative "rule engine" that lets users easily accomplish complex data management tasks. Users can automate enforcement, or "virtualize" data management policies by applying rules that control the execution of all data access and manipulation operations. Rather than having to hard code these actions or workflows into the software, the user-friendly rules let any group easily customize the iRODS system for their specific data management needs.
For example, when astronomers take new photographs in a sky survey and enter them into a data collection, the researchers can set up iRODS rules to automatically extract descriptive information and record it in the iRODS Metadata Catalog (iCAT), replicate a copy to another repository for backup, create a thumbnail for a Web-based gallery, and run an analysis program to identify related images.
An organization's archivist can configure iRODS rules to identify and retain a collection of digital records for five years, and then move them to another site or destroy them. And if someone requests these records, the archivist can confirm that the current digital copy is indeed an authentic copy of the original. iRODS rules are being developed that will validate the trustworthiness of digital repositories.
Users can apply the growing set of existing rules or write new ones. Rules can also be developed as community-wide policies to manage data.
"One reason policy-based data management is important is that it lets communities integrate across different types of collection structures," said Moore. "What this means is that iRODS lets one community talk to any other community independent of what data management system the other community is using. No matter which technology you pick you aren't isolated."
iRODS is designed to be flexible, growing seamlessly from small to very large needs.
"You can start using it as a single user who only needs to manage a small stand-alone data collection," said Arcot Rajasekar, who leads the iRODS development team. "The same system lets you grow into a very large federated collaborative system that can span dozens of sites around the world, with hundreds or thousands of users and numerous data collections containing millions of files and petabytes of data -- it's a true full-scale distributed data system." A petabyte is one million gigabytes, about the storage capacity of 10,000 of today's PCs.
At SDSC alone iRODS and its predecessor SRB technology are already managing one petabyte of data and two hundred million files for 5,000 users.
"It's an advantage that the new iRODS system is open source," added Rajaseker. "This is bringing in collaborators from the US and as far away as France, the UK, Japan, and Australia who are contributing code, so iRODS will quickly add more features."
"We also find that users like the open source approach and have more confidence in adopting the new technology. Open source software makes it possible to assemble a larger development team and interact with a wider range of user communities. This increases user confidence that the iRODS system will be around in the future."
Currently the iRODS team is working with partners to help a number of projects apply the technology, including the National Archives and Records Administration (NARA), the Ocean Observatories Initiative (OOI), the National Science Digital Library, the Temporal Dynamics of Learning Center (TDLC), the UC Humanities, Arts and Social Sciences (HASS) grid and the Testbed for the Redlining Archives of California's Exclusionary Spaces (T-RACES) project, and numerous others.
Version 1.0 of iRODS is supported on Linux, Solaris, Macintosh, and AIX platforms, with Windows coming soon. The iRODS Metadata Catalog (iCAT) will run on either the open source PostgreSQL database (which can be installed via the iRODS install package) or Oracle. And iRODS is easy to install -- just answer a few questions and the install package automatically sets up the system.
Under the hood, the iRODS architecture stores data on one or more servers, which may be widely separated geographically; keeps track of system and user-defined information describing the data with the iRODS Metadata Catalog (iCAT); and offers users access through clients (currently a command line interface and Web client, with more to come). As directed by iRODS rules, the system can process data where it is stored using applications called "micro-services" executed on the remote server, making possible smaller and more targeted data transfers.
"Because it's a second generation effort, IRODS isn't like a new, untested product since we have the knowledge from years of experience with dozens of projects using the SRB," said iRODS software architect Mike Wan. "iRODS includes the familiar functions from the SRB, so people can jump in and easily start using the new system."
Added iRODS senior software engineer Wayne Schroeder, "For a 1.0 release it has a large number of features -- we already knew where we were going as we developed it, and this has made it cleaner and faster."
To help users get started with iRODS, the DICE group is offering several tutorials and workshops in the US and internationally. Following on the very popular Society of American Archivists (SAA) workshop at SDSC last summer, there will be two SAA sessions this summer, with additional tutorials in the US, Europe, and Asia.
The DICE team plans to continue supporting the widely used SRB system well into the future. But as SRB users decide to upgrade, the team is developing a seamless migration path to the more capable and faster iRODS system. As part of this, for a digital data collection at NARA the iRODS team has already migrated one million files from an SRB data grid to an iRODS data grid.
"We migrated not only the data files but also the metadata, access controls, and directory structure," said Moore. "This is an important demonstration that users can migrate collections to different choices of data grid technology without any problem."
In addition to Moore, Rajasekar, Wan, and Schroeder, group members who contributed to the iRODS system include: Sheau-Yen Chen, Lucas Gilbert, Chien-Yi Hou, Arun Jagatheesan, George Kremenek, Sifang Lu, Richard Marciano, Dave Nadeau, Antoine de Torcy, and Bing Zhu. Other collaborators in the iRODS project include the French Institut National de Physique Nucléaire et de Physique des Particules (IN2P3), the UK e-Science Data Management Group at Rutherford Appleton Laboratory, and the High Energy Accelerator Research Organization, KEK, in Japan.
iRODS is funded by NARA and the National Science Foundation (NSF). More information, the iRODS software download, and documentation are available at http://irods.sdsc.edu/. For more information on SRB, visit http://www.sdsc.edu/srb/.
-----
Source: San Diego Supercomputer Center
Even though the cost of servers still dominates the datacenter budget, storage is actually on a steeper growth curve. HPC storage, in particular, is being singled out as high-growth opportunity. Vendors are scrambling to keep up.
Read More...
Google datacenters most energy efficient; Cluster Resources to demo Moab Hybrid Cluster; Red Hat Linux releases HPC distro. John West recaps those stories and more in our weekly wrap-up.
Read More...
Last week, IBM and King Abdullah University of Science and Technology announced a collaboration to build "Shaheen," a 222 teraflop Blue Gene/P supercomputer. When deployed in 2009, it will represent the most powerful computer in the Middle East and one of the top systems in the world.
Read More...
Oct 06 | The Register | Does the HP Oracle Database Machine represent InfiniBand's big chance to break out its HPC niche? Read more...
Oct 06 | BusinessWeek | A body scan can save a lot of time in the fitting room, and fields from medicine to architecture are adopting 3D computing applications. Read more...
Oct 03 | UCSD News | Despite the evolution of computer science over the past 30 years, structural engineering -- hindered by a reluctance to adapt to digital innovations -- has remained relatively unchanged as a discipline. Read more...
Oct 02 | New York Times | Silcon Valley is starting to feel the effects of the credit crunch. Read more...
Oct 01 | Data Center Knowledge | Google today disclosed details of its data center energy usage, confirming that it operates some of the most efficient facilities in the world. Read more...
Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.