HPCwire

The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing

HPCwire >> Features

An Open Source Run-Time for Distributed HPC


Page:  1  of  4
1 | 2 | 3 | 4   All  »  

OpenRTE (www.open-rte.org) is an open source project that is designed to provide a portable distributed computing run-time environment for HPC workloads. Ralph Castain, of Los Alamos National Laboratory, leads the development of this project and has been conducting research in large-scale distributed computing as a research scientist at Colorado State University for some time. HPCwire got a chance to speak with Ralph about OpenRTE. In this Q&A, he describes what it is, how it works, and what it could mean to the HPC community.

HPCwire: What are the origins and goals of OpenRTE?

Castain: OpenRTE began as a sub-project within the Open MPI initiative that is aimed at creating a free, open source, peer-reviewed, production-quality complete MPI-2 implementation. Open MPI strives to provide an extremely high, competitive performance  system by directly involving the HPC community with external development and feedback -- vendors, 3rd party researchers, users, etc.

Building an MPI layer capable of meeting those objectives required the development of an equally ambitious run-time environment that would be capable of providing the necessary infrastructure. OpenRTE evolved from that effort. We recognized early on that we had an opportunity to provide an infrastructure that could transparently support extended features for MPI -- something that would transcend the traditional MPI environment of running an application on a single cluster. For example, we could use the infrastructure we were developing to transparently "stitch" together multiple applications, or to extend applications to span multiple clusters.

The goals of OpenRTE therefore became:

(a) Provide a seamless, transparent distributed computing environment that allows users to execute their applications in a single cluster or on multiple clusters, and/or to integrate individual applications together, all without changing their application code. Our current computing environments require that the user recompile their code, and perhaps even make changes in the source code itself, when moving from a cluster to a Grid or some other venue. In addition, connecting multiple applications generally requires that the integration be done at the source code level. Our objective is to make this as transparent as possible. Users should be able to execute the same code on a cluster or a Grid without changes, and be able to connect applications at run-time -- again, without changing the source code.

(b) Create a robust, production-quality platform upon which high-performance computing applications can execute. The run-time must do its job and then get out of the way -- it cannot impact critical timing loops. At the same time, it has to provide a rock-solid foundation (the run-time must never fail; instead, it must provide error messages and, if necessary, gracefully exit) that includes support for response to system faults to enable the computing application to continue executing; and

(c) Create an extensible system based on a component architecture that allows developers to "overload" any OpenRTE function, thus enabling the necessary research to support new enhanced features in a production environment. We recognized up-front that OpenRTE's goals are ambitious, and that, as the research literature demonstrates, there are multiple ways of implementing just about every major functional block in the system. We wanted to create an architecture for OpenRTE that would make it easy for a researcher to replace one of those functional blocks with their own idea on how to implement it -- without that researcher having to write all the rest of the blocks required to make a system operational. Thus, a researcher interested in distributed data storage -- such as is found in the OpenRTE's registry -- can "overload" that functional block with their own implementation, and then test the results in a production environment without having to write code to launch and monitor processes.
 
HPCwire: Can you describe the architecture of OpenRTE and how it works?

Castain: OpenRTE consists of four major subsystem blocks, all built upon a component architecture. Sitting at the core of the system is a publish/subscribe general purpose registry (GPR) that is used to synchronize events across the system. The GPR is designed to asynchronously notify subscribers of events such as data being entered or changed in the registry, and to transmit along with that notification whatever data the subscriber requests. This is the underlying mechanism supporting, among other things, the exchange of communication connection data among processes.

Subsystems in the Resource Management block make heavy use of the GPR to discover computing resources that may be available to the user, allocate those resources for use by a particular application, map the application's processes to specific allocated resources, and launch the processes to begin execution. In this block, the GPR is primarily used as an intermediate storage medium -- the resource discovery service, for example, places entries on the registry with information identifying the resources it has found, their state of operation, how much of their capability, if any, has been reserved for this user, etc. This information is then used by the allocation service to determine if additional resources need to be requested for the user and from where they might come. The mapper service then takes the allocated resources and maps processes to them so that the launch service can start the application.

Page:  1  of  4
1 | 2 | 3 | 4   All  »  

HPCwire on Twitter

Article Tools

  • Print This Page
  • Bookmark This Article

Share Options

(Digg, Technorati, more)


Subscribe

Discussion

There are 0 discussion items posted.  

HPC in the Cloud Part 2
People to Watch 2010


Top Headlines

Australia Commissions Cray Supercomputer

Mar 19 | OfficialWire | New super to support intelligence work Down Under. Read more...

Intel Partners See 'Easy' Upgrade Path With Xeon 5600 Chips

Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...

AMD: OEMs primed for Opteron 6100s

Mar 17 | The Register | But what about the tier ones? Read more...

Arrival of the Desktop Supercomputer

Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...

Scheduling HPC In The Cloud

Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...

Featured Whitepapers

Virtualization for Aggregation And The vSMP Architecture™

Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.

Copper Cable Technologies for High Performance Computing

Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.

Multimedia

Webcast: Virtualized Data Center Roundtable

Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.

Webcast: Watch SC09 Birds of a Feather Video: Scalable Fault-Tolerant HPC Supercomputers

Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.

Webcast: High Performance Computing for a Smarter Planet

LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html

SC09 HPC in the Cloud

Newsletters

Stay informed! Subscribe to HPCwire email Newsletters.






HPC Job Bank


Featured Events

HPC User Forum DICE
2010 High Performance Computing Linux Financial Markets
Cloud Computing Expo
Cloud Lab
ESC
DEISA PRACE Symposium