Grid Initiatives Part 1

By By Wolfgang Gentzsch, D-Grid, Duke University, and RENCI

January 29, 2007

Over the past 12 months, major grid projects have been studied to better understand how to successfully design, build, manage and operate large Community Grids, based on the experience of early adopters and on case studies and lessons learned from these grid projects. For this purpose, we have selected and analyzed the UK e-Science Programme, the US TeraGrid, Naregi in Japan, the ChinaGrid, the European EGEE, and the German D-Grid initiative.

More details can be found in the corresponding report; please see weblink at the end of this article. The report provides answers on what is a grid and how does it function, it lists the benefits of grid computing for research and industry, explains the business and services side of grids, discusses the grid projects investigated, offers a look into the future of grids, and finally compiles a list of lessons learned and recommendations for those who intend to build grids in the near future. This first part of the article summarizes some key statistics of the grid initiatives investigated, discusses in details the lessons learned, and summarizes the recommendations which have been offered by the grid project leaders. The second part of the article, in next week's GRIDtoday, will present some additional information about these six grid initiatives.

Major Grid Initiatives

Our research is based on information from project Web sites, project reports, interviews with representatives of all these grid initiatives, and our own hands-on experience in helping to build the German D-Grid. Major focus of our research and of the interviews was on strategic directions, applications, government and industry funding, national and international cooperation, strengths and weaknesses of the grid projects as described by the interviewees, sustainability of the resulting grid infrastructure, commercial services, and the future of e-Science. All information provided is already out-dated now, having a time stamp of Fall 2006.

In the following we briefly summarize six of the major grid projects around the world, and present statistics and history. More information can be found in the report mentioned above or collected from the Web (as I did it). First, the following table presents the different phases of the projects, their funding (in $M), approximate number of experts involved, and type of users (from research or industry):

Initiative         Time        Funding     People      Users

UK e-Science-I: 2001 - 2004 180 900 Res.
UK e-Science-II: 2004 - 2006 220 1100 Res. Ind.

TeraGrid-I: 2001 - 2004 90 500 Res.
TeraGrid-II: 2005 - 2010 150 850 Res.

ChinaGrid-I: 2003 - 2006 3 400 Res.
ChinaGrid-II: 2007 - 2010 15 1000 Res.

NAREGI-I: 2003 - 2005 25 150 Res.
NAREGI-II: 2006 - 2010 40 250 Res. Ind.

EGEE: 2004 - 2006 40 900 Res.
EGEE-II: 2006 - 2008 45 1100 Res. Ind.

D-Grid: 2005 - 2008 32 220 Res.
D-Grid-II: 2007 - 2009 35 440 Res. Ind.

Lessons Learned

In the following, we summarize the most important results and lessons learned from the grid projects analyzed and from the interviews:

Most of the successful projects in the early days had a strong focus on just one topic (middleware OR application) or a few selected aspects and requirements, and were more pragmatic, and mostly application and user driven, with a focus on the development of standard and commodity components, open source, and results easy to understand and to use. Application-oriented and grid-enabled workflows and the separation of middleware and application layer helped the projects to deliver more sustainable results, and usability and integration became relevant. It seems to be very important that application scientists closely collaborate with computer scientists. Professional service centers proved successful. E.g. in the UK, National Grid Service (NGS), Grid Operation Support Center (GOSC) and Open Middleware Institute (OMII) are extremely important factors to guaranty sustainability of the project results.

However, there were also problems and challenges, especially with the early initiatives and projects:

There was a lot of hype especially in 2001 and 2002, and thus too high expectation in the projects and their results. Projects which focused on both applications and infrastructure faced a high risk. Almost all projects in the early days developed their own infrastructure because middleware in those days (e.g. Globus, Condor, SRB, with new releases every 6 – 12 months) turned out to be immature. Middleware developed in these projects was often proprietary. In the early days, an integration of individual projects into a larger community or environment was not yet possible. Later projects either focused on the infrastructure with the applications as a driver, or focused on the application using existing core grid building blocks. One of the main reasons of failure was a sudden change in 2003 from the classical, more proprietary grid technologies to standard web services. Also, missing software engineering methods and especially low usability resulted in low acceptance of project results. The user point-of-view is paramount — a “build it, they will come approach” will not work. It is important to work with the user communities to ensure the resulting system is of a general nature and not limited in scope to a small number of applications.

A lot of the grid middleware currently promoted is really intended for research and demonstrations but needs significant effort to be made suitable for large-scale production usage. Standards are evolving slowly and it is likely that initiatives to improve inter-operability between existing grids will produce meaningful results of benefit to the user communities on a shorter time scale. The experience gained with this inter-operability work will help identify the highest-priority points for standardization as well as a meaningful way to test if candidate standards can be implemented and deployed.

It is challenging (but important) to establish an environment of constructive competition such that good ideas and performance are recognized and rewarded.  There are still many areas where the “captive user” approach is viewed as a competitive advantage.

Recommendations

In this paragraph, we summarize major results and conclusions from 'lessons learned', and present recommendations especially for those who intend to start or fund a new grid initiative. Some of the recommendations seem trivial, but are still worth mentioning. They all result from our analysis and findings and from the evaluation of the interviews:

In any grid project, during development as well as during operation, the core grid infrastructure should be modified/improved only in large time cycles if necessary, because applications and users depend on this infrastructure. Continuity and sustainability especially for the infrastructure part of grid projects are extremely important. Therefore, additional funding should be available also after the end of the project, to guarantee service and support and continuous improvement and adjustment to new developments. Close collaboration in the grid development phase between the grid infrastructure developers and the application developers is mandatory for the applications to utilize the core grid services of the infrastructure and to avoid application silos.

For new grid projects, we recommend a close collaboration among grid-experienced computer scientists who build the (generic) grid infrastructure and the driving users who define their set of requirements for the grid infrastructure services. Application communities shouldn't start developing a core infrastructure from scratch on their own, but should — together with grid-experienced computer scientists — decide on using and integrating existing grid building blocks to avoid building proprietary application silo architectures and to focus more on the real applications.

In their early stage, grid projects need enough funding to get over the early-adopter phase into a mature state with a rock-solid grid infrastructure such that other communities can join easily. We estimate this funding phase currently to be in the order of three years, with more funding in the beginning for the grid infrastructure, and later more funding for the application communities. Included in such a grid infrastructure funding are Centers of Excellence for building, managing and operating grid centers, for middleware tools, application support, and for training. Thus, parallel developments with re-inventing wheels can be avoided and funding efficiently spent.

After a generic grid infrastructure has been built, projects should focus first on one or only a few applications or specific services, to avoid complexity and re-inventing wheels. Usage of software components from open-source and standards initiatives is highly recommended to enable interoperability especially in the infrastructure and application-oriented middleware layer. For interoperability reasons, focus on software engineering methods especially for the implementation of protocols and the development of standard interfaces is important.

New application grids (community grids) should utilize the (existing) components of a generic grid infrastructure to avoid re-inventing wheels and building of silos. The infrastructure building block should be user-friendly to enable easy adoption for  new (application) communities. In addition, the infrastructure group should offer an installation, operation, support and training services. Centers of Excellence should specialize on specific services, e.g. middleware development and maintenances, integration of new communities, grid operation, training, utility services, etc. In case of more complex projects, e.g. consisting of an integration and several application or community projects, a strong management board should steer coordination and collaboration among the projects and the working groups. The management board (Steering Committee) should consist of leaders of the different sub-projects. Success, especially in early-stage technology projects, is strongly proportional to the personality and leadership capabilities of the leaders.

We recommend to implement an utility computing paradigm only in small steps, starting from enhancing existing service models moderately, and testing utility models and accounting and billing concepts first as pilots. Experience in this field and in its mental, legal and regulatory barriers is still missing. Very often, today's existing government funding models are counter-productive when establishing new and efficient forms of utility services. Today's funding models in research and education are often project based and thus not ready for a utilitarian approach where resource usage is based on a pay-as-you-go approach. Old funding models first have to be adjusted accordingly before a utility model can be introduced successfully.

Finally, participation of industry should be industry-driven. A push from the outside, even with government funding, doesn't seem to be promising. Success will come only from natural needs e.g. through already existing collaborations with research and industry, as a first step. For several good reasons, industry in general is still in a wait-state with building and applying global grids, demonstrated by the moderate success so far in existing industrial global grid initiatives around the world. We recommend to closely work with the industry to develop appropriate funding and collaboration models which take into account the different technological, mental and legal requirements when adjusting the existing research community oriented approaches, ideally starting with already existing and successful research-industry collaborations. If there are good reasons to create your own grid (on a university campus or in an enterprise) rather than join an existing one, better start with cluster based cycle savaging and when the users and their management are convinced of the value of sharing resources then extend the system to multiple-sites.

Try to study, copy and/or use an existing grid if possible and connect your own resources once you are convinced of its value. There is much useful experience to learn from partners. Learn/keep up with what your peers have done/are doing. Focus on understanding your user community and their needs. Invest in a strong communication/participation channel towards the leaders of that group to engage. Instrument your services so that you collect good data about who is using which services and how. Analyze this data and learn from watching what's really going on, in addition to what users report as happening. Plan for an incremental approach and lots of time talking out issues and plans. Social effects dominate in non-trivial grids.

Acknowledgement:

This report has been funded by the Renaissance Computing Institute RENCI at the University of North Carolina in Chapel Hill. I want to thank all the people who have contributed to this report and who are listed in the report on http://www.renci.org/publications/reports.php.

About the Author:

Wolfgang Gentzsch is heading the German D-Grid Initiative. He is an adjunct professor at Duke and a visiting scientist at RENCI at UNC Chapel Hill, North Carolina. He is Co-Chair of the e-Infrastructure Reflection Group and a member of the Steering Group of the Open Grid Forum.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable quantum memory framework. “This work provides a promising Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Point. The system includes Intel's research chip called Loihi 2, Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Research senior analyst Steve Conway, who closely tracks HPC, AI, Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, and this day of contemplation is meant to provide all of us Read more…

Intel Announces Hala Point – World’s Largest Neuromorphic System for Sustainable AI

April 22, 2024

As we find ourselves on the brink of a technological revolution, the need for efficient and sustainable computing solutions has never been more critical.  A computer system that can mimic the way humans process and s Read more…

Empowering High-Performance Computing for Artificial Intelligence

April 19, 2024

Artificial intelligence (AI) presents some of the most challenging demands in information technology, especially concerning computing power and data movement. As a result of these challenges, high-performance computing Read more…

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable Read more…

Intel’s Silicon Brain System a Blueprint for Future AI Computing Architectures

April 24, 2024

Intel is releasing a whole arsenal of AI chips and systems hoping something will stick in the market. Its latest entry is a neuromorphic system called Hala Poin Read more…

Anders Dam Jensen on HPC Sovereignty, Sustainability, and JU Progress

April 23, 2024

The recent 2024 EuroHPC Summit meeting took place in Antwerp, with attendance substantially up since 2023 to 750 participants. HPCwire asked Intersect360 Resear Read more…

AI Saves the Planet this Earth Day

April 22, 2024

Earth Day was originally conceived as a day of reflection. Our planet’s life-sustaining properties are unlike any other celestial body that we’ve observed, Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel’s Xeon General Manager Talks about Server Chips 

January 2, 2024

Intel is talking data-center growth and is done digging graves for its dead enterprise products, including GPUs, storage, and networking products, which fell to Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire