Two articles appeared last week on Oracle Grid Engine (aka Sun Grid Engine) which speculated about its future at Oracle: Douglas Eadline wrote in Linux Magazine about The State of Oracle/Sun Grid Engine, and Nicole Hemsoth’s Oracle Placing GridEngine on New Track, here in HPC in the Cloud. These articles encouraged me to contribute my own thoughts to this discussion, especially with a focus on Grid Engine’s contribution to Clouds.
Over the last years the Sun Grid Engine team and the Grid Engine open source community have continuously added more features to the already powerful distributed resource management system which Sun Acquired in 2000 together with German/US based Gridware. Recently, the software got into Oracle’s hands, and many have speculated that Oracle will drop the HPC related bits and pieces which came with the acquisition of Sun. Some might doubt that Oracle management will ever be ever able to recognize the value of HPC technology for mainstream IT, but especially the value of resource managers like Grid Engine for any kind of distributed resource environment.
More and more features appeared which were going far beyond the narrow focus of HPC, e.g. policy and priority management, scalability of hundreds of thousands of jobs – large and small, user authentication and access control, resource assignment across persistent services, managing software build, test, and verify, data management, aligning resource usage with business policies, an accounting and reporting module ideally suited for the Cloud’s pay-per-use, and many more; features which are very useful for minimizing cost and maximizing the business value of an organizations computing resources and software assets, not only in HPC.
Then, about three years ago, discussions and developments started in the Grid Engine open source community about resource elasticity in clusters, culminating in the Service Domain Manager providing any cluster size on the fly according to an application’s requirements. Through the Grid Engine Service Domain Manager software, an administrator can configure service-level objectives to govern service levels in the managed clusters.
In addition, the Service Domain Manager software is able to remove unused resources from managed clusters and place them in a spare pool of resources. Resources in this spare pool optionally can have power management applied to reduce a cluster’s overall power consumption during off-peak periods.
Should no free resources be available locally, the Service Domain Manager software also has the ability to provision resources from a compute cloud provider, such as Amazon’s Elastic Compute Cloud (EC2), to add to an overloaded cluster (a feature known as Cloud Bursting). This Cloud Connectivity allocates nodes on the Cloud (EC2) on demand, providing full elasticity: compute resources allocated through it can go from 0 to whatever is needed and covered by the user’s budget, fully policy controlled, no user intervention required. It includes Secure Communication: OpenVPN, part of EC2 AMI and of OGE instance running on user laptop or desktop. Beyond that, Grid Engine now offers deep integration to other technologies commonly being used in the cloud, such as Apache Hadoop, a powerful tool designed for deep analysis and transformation of very large data sets. Or the UniCloud environment from UnivaUD. In fact, because of Grid Engine’s standard APIs, enhancement with almost any other management tool seems possible.
With all this in mind, I suggest to call OGE the ‘Oracle Resource Engine’, where ‘Resource’ includes hardware, whether in-house, in Grids, or in Clouds, and workloads, applications, data, and users, individual and in real or virtual organizations. This goes far beyond HPC, and therefore I suggest that this Oracle Resource Engine will survive another 10 years, as it survived the 10 previous years even (or because of) Sun Microsystems