Over the past 12 months, major grid projects have been studied to better understand how to successfully design, build, manage and operate large Community Grids, based on the experience of early adopters and on case studies and lessons learned from these grid projects. For this purpose, we have selected and analyzed the UK e-Science Programme, the US TeraGrid, Naregi in Japan, the ChinaGrid, the European EGEE, and the German D-Grid initiative.
More details can be found in the corresponding report; please see weblink at the end of this article. The report provides answers on what is a grid and how does it function, it lists the benefits of grid computing for research and industry, explains the business and services side of grids, discusses the grid projects investigated, offers a look into the future of grids, and finally compiles a list of lessons learned and recommendations for those who intend to build grids in the near future. This first part of the article summarizes some key statistics of the grid initiatives investigated, discusses in details the lessons learned, and summarizes the recommendations which have been offered by the grid project leaders. The second part of the article, in next week's GRIDtoday, will present some additional information about these six grid initiatives.
Major Grid Initiatives
Our research is based on information from project Web sites, project reports, interviews with representatives of all these grid initiatives, and our own hands-on experience in helping to build the German D-Grid. Major focus of our research and of the interviews was on strategic directions, applications, government and industry funding, national and international cooperation, strengths and weaknesses of the grid projects as described by the interviewees, sustainability of the resulting grid infrastructure, commercial services, and the future of e-Science. All information provided is already out-dated now, having a time stamp of Fall 2006.
In the following we briefly summarize six of the major grid projects around the world, and present statistics and history. More information can be found in the report mentioned above or collected from the Web (as I did it). First, the following table presents the different phases of the projects, their funding (in $M), approximate number of experts involved, and type of users (from research or industry):
Initiative Time Funding People Users
UK e-Science-I: 2001 - 2004 180 900 Res.
UK e-Science-II: 2004 - 2006 220 1100 Res. Ind.
TeraGrid-I: 2001 - 2004 90 500 Res.
TeraGrid-II: 2005 - 2010 150 850 Res.
ChinaGrid-I: 2003 - 2006 3 400 Res.
ChinaGrid-II: 2007 - 2010 15 1000 Res.
NAREGI-I: 2003 - 2005 25 150 Res.
NAREGI-II: 2006 - 2010 40 250 Res. Ind.
EGEE: 2004 - 2006 40 900 Res.
EGEE-II: 2006 - 2008 45 1100 Res. Ind.
D-Grid: 2005 - 2008 32 220 Res.
D-Grid-II: 2007 - 2009 35 440 Res. Ind.
Lessons Learned
In the following, we summarize the most important results and lessons learned from the grid projects analyzed and from the interviews:
Most of the successful projects in the early days had a strong focus on just one topic (middleware OR application) or a few selected aspects and requirements, and were more pragmatic, and mostly application and user driven, with a focus on the development of standard and commodity components, open source, and results easy to understand and to use. Application-oriented and grid-enabled workflows and the separation of middleware and application layer helped the projects to deliver more sustainable results, and usability and integration became relevant. It seems to be very important that application scientists closely collaborate with computer scientists. Professional service centers proved successful. E.g. in the UK, National Grid Service (NGS), Grid Operation Support Center (GOSC) and Open Middleware Institute (OMII) are extremely important factors to guaranty sustainability of the project results.
However, there were also problems and challenges, especially with the early initiatives and projects:
There was a lot of hype especially in 2001 and 2002, and thus too high expectation in the projects and their results. Projects which focused on both applications and infrastructure faced a high risk. Almost all projects in the early days developed their own infrastructure because middleware in those days (e.g. Globus, Condor, SRB, with new releases every 6 – 12 months) turned out to be immature. Middleware developed in these projects was often proprietary. In the early days, an integration of individual projects into a larger community or environment was not yet possible. Later projects either focused on the infrastructure with the applications as a driver, or focused on the application using existing core grid building blocks. One of the main reasons of failure was a sudden change in 2003 from the classical, more proprietary grid technologies to standard web services. Also, missing software engineering methods and especially low usability resulted in low acceptance of project results. The user point-of-view is paramount — a “build it, they will come approach” will not work. It is important to work with the user communities to ensure the resulting system is of a general nature and not limited in scope to a small number of applications.
A lot of the grid middleware currently promoted is really intended for research and demonstrations but needs significant effort to be made suitable for large-scale production usage. Standards are evolving slowly and it is likely that initiatives to improve inter-operability between existing grids will produce meaningful results of benefit to the user communities on a shorter time scale. The experience gained with this inter-operability work will help identify the highest-priority points for standardization as well as a meaningful way to test if candidate standards can be implemented and deployed.
It is challenging (but important) to establish an environment of constructive competition such that good ideas and performance are recognized and rewarded. There are still many areas where the “captive user” approach is viewed as a competitive advantage.
Recommendations
In this paragraph, we summarize major results and conclusions from 'lessons learned', and present recommendations especially for those who intend to start or fund a new grid initiative. Some of the recommendations seem trivial, but are still worth mentioning. They all result from our analysis and findings and from the evaluation of the interviews:
In any grid project, during development as well as during operation, the core grid infrastructure should be modified/improved only in large time cycles if necessary, because applications and users depend on this infrastructure. Continuity and sustainability especially for the infrastructure part of grid projects are extremely important. Therefore, additional funding should be available also after the end of the project, to guarantee service and support and continuous improvement and adjustment to new developments. Close collaboration in the grid development phase between the grid infrastructure developers and the application developers is mandatory for the applications to utilize the core grid services of the infrastructure and to avoid application silos.
For new grid projects, we recommend a close collaboration among grid-experienced computer scientists who build the (generic) grid infrastructure and the driving users who define their set of requirements for the grid infrastructure services. Application communities shouldn't start developing a core infrastructure from scratch on their own, but should — together with grid-experienced computer scientists — decide on using and integrating existing grid building blocks to avoid building proprietary application silo architectures and to focus more on the real applications.
In their early stage, grid projects need enough funding to get over the early-adopter phase into a mature state with a rock-solid grid infrastructure such that other communities can join easily. We estimate this funding phase currently to be in the order of three years, with more funding in the beginning for the grid infrastructure, and later more funding for the application communities. Included in such a grid infrastructure funding are Centers of Excellence for building, managing and operating grid centers, for middleware tools, application support, and for training. Thus, parallel developments with re-inventing wheels can be avoided and funding efficiently spent.
After a generic grid infrastructure has been built, projects should focus first on one or only a few applications or specific services, to avoid complexity and re-inventing wheels. Usage of software components from open-source and standards initiatives is highly recommended to enable interoperability especially in the infrastructure and application-oriented middleware layer. For interoperability reasons, focus on software engineering methods especially for the implementation of protocols and the development of standard interfaces is important.
New application grids (community grids) should utilize the (existing) components of a generic grid infrastructure to avoid re-inventing wheels and building of silos. The infrastructure building block should be user-friendly to enable easy adoption for new (application) communities. In addition, the infrastructure group should offer an installation, operation, support and training services. Centers of Excellence should specialize on specific services, e.g. middleware development and maintenances, integration of new communities, grid operation, training, utility services, etc. In case of more complex projects, e.g. consisting of an integration and several application or community projects, a strong management board should steer coordination and collaboration among the projects and the working groups. The management board (Steering Committee) should consist of leaders of the different sub-projects. Success, especially in early-stage technology projects, is strongly proportional to the personality and leadership capabilities of the leaders.
We recommend to implement an utility computing paradigm only in small steps, starting from enhancing existing service models moderately, and testing utility models and accounting and billing concepts first as pilots. Experience in this field and in its mental, legal and regulatory barriers is still missing. Very often, today's existing government funding models are counter-productive when establishing new and efficient forms of utility services. Today's funding models in research and education are often project based and thus not ready for a utilitarian approach where resource usage is based on a pay-as-you-go approach. Old funding models first have to be adjusted accordingly before a utility model can be introduced successfully.
Finally, participation of industry should be industry-driven. A push from the outside, even with government funding, doesn't seem to be promising. Success will come only from natural needs e.g. through already existing collaborations with research and industry, as a first step. For several good reasons, industry in general is still in a wait-state with building and applying global grids, demonstrated by the moderate success so far in existing industrial global grid initiatives around the world. We recommend to closely work with the industry to develop appropriate funding and collaboration models which take into account the different technological, mental and legal requirements when adjusting the existing research community oriented approaches, ideally starting with already existing and successful research-industry collaborations. If there are good reasons to create your own grid (on a university campus or in an enterprise) rather than join an existing one, better start with cluster based cycle savaging and when the users and their management are convinced of the value of sharing resources then extend the system to multiple-sites.
Try to study, copy and/or use an existing grid if possible and connect your own resources once you are convinced of its value. There is much useful experience to learn from partners. Learn/keep up with what your peers have done/are doing. Focus on understanding your user community and their needs. Invest in a strong communication/participation channel towards the leaders of that group to engage. Instrument your services so that you collect good data about who is using which services and how. Analyze this data and learn from watching what's really going on, in addition to what users report as happening. Plan for an incremental approach and lots of time talking out issues and plans. Social effects dominate in non-trivial grids.
Acknowledgement:
This report has been funded by the Renaissance Computing Institute RENCI at the University of North Carolina in Chapel Hill. I want to thank all the people who have contributed to this report and who are listed in the report on http://www.renci.org/publications/reports.php.
About the Author:
Wolfgang Gentzsch is heading the German D-Grid Initiative. He is an adjunct professor at Duke and a visiting scientist at RENCI at UNC Chapel Hill, North Carolina. He is Co-Chair of the e-Infrastructure Reflection Group and a member of the Steering Group of the Open Grid Forum.