Cloud Control: Outsourcing an HPC Cluster

By Scott Clark

August 5, 2010

So, thus far in this series of posts, we have discussed the following issues:

– IT is not a core competency of the business, so we should look to outsource if we can outsource without jeopardizing the business.

– We should look to cloud computing to bring costs under control and to deliver cost efficiencies over time, not as an immediate cost reduction activity.

– In order to outsource IT, we must trust the suppliers and vendors involved, which means developing relationships, not better bludgeoning weapons. And we have already done an extremely similar divestiture in our past, so we have a model to look at that says it can be done successfully

Now we need to talk about what an organization would need to look like in order to properly manage the outsourcing of your HPC cluster. So what would that look like? Well, we should assume that all technical and operational capabilities necessary to execute the infrastructure are included in the outsource. The supplier is expected to provide the entirety of the technical function and carry out all operational duties. That is not to say that the customer is off the hook technically, just the opposite. The customer needs to assemble a small team of technically savvy, business minded (specific to the core product of your company) individuals to measure and manage the outsource. This team needs to be very strong technically in order to vet and gauge any available technologies for potential use as well as identify flaws in solutions or methodology of solutions delivered. The size of the team would be dependent on the size of the company (and therefore the size of the outsource).

Functionally, the outsource management team is the control point for the outsourcing of your infrastructure. Through this group, you maintain control over your infrastructure, and therefore can have full trust in your outsource partner (because you know exactly what you want, and you know how to measure if you are getting it). The intent of this team is to stay abreast of the constantly changing needs of the business, understand the continuously evolving capabilities of technology, and combine the two awareness’ to understand how the company should be leveraging technology to maximize benefit to the business and control costs. With that combined awareness, you now hold the outsource accountable for delivering an appropriate solution to your company’s need.

This is not to say that all responsibility falls to the customer outsource team. The supplier will need to have a disciplined focus in the specific space that your company does business, and be innovating their solution to specifically solve the problems of that industry. If they do not, then they will probably not be a cost competitive, viable supplier long term.

You will see many functions that fall under the customer outsource team. And remember, this team needs to remain small in order to avoid paying too much for your solution. There will be a constant loop for the outsource team to:

1. Quantitatively measure the current solution

2. Analyze cost and benefits of the current solution

3. Assessment of best practices

4. Revision of current solution

5. Loop back to 1

There will be several technical responsibilities that the outsource team will participate in jointly. The supplier should be doing most of this work for the customer, but how do you know if the data they are presenting is 100% accurate or appropriate for your solution. When in doubt, the outsource team will generate their own data, and share that data with the supplier to derive a more accurate solution. In that, the outsource team will do some amount of, but not every facet of:

1. Technical and cost Benchmarking

2. Technical advisory / liaison (IT industry to customer business)

3. Technical architects – Designing architecture of applications and services that are appropriate for the company’s consumption

There are many responsibilities of the outsource team that will fall into the relationship management arena. This team will be the primary point of contact and control between the customer and the supplier, and I can’t say enough how important having a positive relationship with the supplier is to the quality of the product you consumer or the price you pay for that product. The outsource team will be responsible for communicating current and future requirements to the supplier, and many of those will take on the form of Service Level Agreements (SLAs) which we will talk about in a moment. The outsource team will also be responsible for how technology is being consumed by the customer company. The outsource team needs to make sure the company is getting the appropriate solution from the supplier company at an appropriate price with appropriate constraints / limitations / boundaries.

Another very important responsibility of the outsource team will be to maintain flexibility from a quality of solution as well as a cost perspective. In this, staying standards based is very important. It is not an absolute requirement, there may be solutions that are proprietary that solve a problem much more efficiently or cost effectively. What you need to consider in this case is when the vendor thinks they have you locked in, and start raising the price because they think you can’t get out of their solution, what is your plan for defeating them. So, where possible, use industry standards so that you can move from vendor to vendor without losing time, money, or critical features. Where that is not possible, what is the plan for using one vendor’s proprietary solution but being able to migrate to another vendor’s solutions without impact to maintain negotiating position.

Finally, there is the new component to infrastructure management. The outsource team will need to learn how to define and measure service level agreements (SLAs). The definition stage will have several components. What is the service level expectation (defines success and failure criteria)? This will sometimes have many different components for a single solution. An example would be storage: is there enough capacity, do we get enough IOPs, and is there enough throughput. All of these are different measurements, but critical to a storage infrastructure for HPC. How will this service level be measured and how often? We have all seen many improper SLA measurements where IT informs the engineer that they have 99.997% availability of the environment, but the engineer knows that there were several outages that had him or her non-productive for days at a time. So do you measure component level availability or solution level? How frequently are the polls for availability? Is availability the right measure? This is all part of the definition. And then, what happens when a failure criteria is met? This is where there is a lot of work happening in the industry. It is not sufficient to refund the months colo fees when a power outage cost the company 6 weeks worth of work. There is a cost to failure, and that is usually very specific to the industry. An outage on a cluster for an EDA company has different implications than an outage to a scientific computing cluster for a university. The recourse needs to be negotiated based impact. Does this at all sound familiar? Any insurance people reading this?? Well, that is one of the solutions the industry is exploring, is having insurance policies behind the supplier. Finally, we need to look at how service levels re-assessed over time. As the technology evolves, so should the service levels.

The fabless semiconductor industry is fairly mature in it’s process for outsourcing the fabrication function. They have cost models and laws (Rock’s Law for cost of a fab over time) that help decision processes, they have a collaborative (FSA) for arriving at better process, and they have an established track record that this can be accomplished very successfully and with cost benefit. The HPC Cloud industry needs to mature. That will just take time.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Supercomputers Streamline Prediction of Dangerous Arrhythmia

June 2, 2020

Heart arrhythmia can prove deadly, contributing to the hundreds of thousands of deaths from cardiac arrest in the U.S. every year. Unfortunately, many of those arrhythmia are induced as side effects from various medicati Read more…

By Staff report

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of computing capability in support of data analysis and AI workload Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been instrumental to AMD’s datacenter market resurgence. Nanomet Read more…

By Doug Black

Supercomputer-Powered Protein Simulations Approach Lab Accuracy

June 1, 2020

Protein simulations have dominated the supercomputing conversation of late as supercomputers around the world race to simulate the viral proteins of COVID-19 as accurately as possible and simulate potential bindings in t Read more…

By Oliver Peckham

HPC Career Notes: June 2020 Edition

June 1, 2020

In this monthly feature, we'll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it's a promotion, new company hire, or even an accolade, we've got Read more…

By Mariana Iriarte

AWS Solution Channel

Computational Fluid Dynamics on AWS

Over the past 30 years Computational Fluid Dynamics (CFD) has grown to become a key part of many engineering design processes. From aircraft design to modelling the blood flow in our bodies, the ability to understand the behaviour of fluids has enabled countless innovations and improved the time to market for many products. Read more…

Supercomputer Modeling Shows How COVID-19 Spreads Through Populations

May 30, 2020

As many states begin to loosen the lockdowns and stay-at-home orders that have forced most Americans inside for the past two months, researchers are poring over the data, looking for signs of the dreaded second peak of t Read more…

By Oliver Peckham

Indiana University to Deploy Jetstream 2 Cloud with AMD, Nvidia Technology

June 2, 2020

Indiana University has been awarded a $10 million NSF grant to build ‘Jetstream 2,’ a cloud computing system that will provide 8 aggregate petaflops of comp Read more…

By Tiffany Trader

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

By Doug Black

COVID-19 HPC Consortium Expands to Europe, Reports on Research Projects

May 28, 2020

The COVID-19 HPC Consortium, a public-private effort delivering free access to HPC processing for scientists pursuing coronavirus research – some utilizing AI Read more…

By Doug Black

$100B Plan Submitted for Massive Remake and Expansion of NSF

May 27, 2020

Legislation to reshape, expand - and rename - the National Science Foundation has been submitted in both the U.S. House and Senate. The proposal, which seems to Read more…

By John Russell

IBM Boosts Deep Learning Accuracy on Memristive Chips

May 27, 2020

IBM researchers have taken another step towards making in-memory computing based on phase change (PCM) memory devices a reality. Papers in Nature and Frontiers Read more…

By John Russell

Hats Over Hearts: Remembering Rich Brueckner

May 26, 2020

HPCwire and all of the Tabor Communications family are saddened by last week’s passing of Rich Brueckner. He was the ever-optimistic man in the Red Hat presiding over the InsideHPC media portfolio for the past decade and a constant presence at HPC’s most important events. Read more…

Nvidia Q1 Earnings Top Expectations, Datacenter Revenue Breaks $1B

May 22, 2020

Nvidia’s seemingly endless roll continued in the first quarter with the company announcing blockbuster earnings that exceeded Wall Street expectations. Nvidia Read more…

By Doug Black

Microsoft’s Massive AI Supercomputer on Azure: 285k CPU Cores, 10k GPUs

May 20, 2020

Microsoft has unveiled a supercomputing monster – among the world’s five most powerful, according to the company – aimed at what is known in scientific an Read more…

By Doug Black

Supercomputer Modeling Tests How COVID-19 Spreads in Grocery Stores

April 8, 2020

In the COVID-19 era, many people are treating simple activities like getting gas or groceries with caution as they try to heed social distancing mandates and protect their own health. Still, significant uncertainty surrounds the relative risk of different activities, and conflicting information is prevalent. A team of Finnish researchers set out to address some of these uncertainties by... Read more…

By Oliver Peckham

[email protected] Turns Its Massive Crowdsourced Computer Network Against COVID-19

March 16, 2020

For gamers, fighting against a global crisis is usually pure fantasy – but now, it’s looking more like a reality. As supercomputers around the world spin up Read more…

By Oliver Peckham

[email protected] Rallies a Legion of Computers Against the Coronavirus

March 24, 2020

Last week, we highlighted [email protected], a massive, crowdsourced computer network that has turned its resources against the coronavirus pandemic sweeping the globe – but [email protected] isn’t the only game in town. The internet is buzzing with crowdsourced computing... Read more…

By Oliver Peckham

Global Supercomputing Is Mobilizing Against COVID-19

March 12, 2020

Tech has been taking some heavy losses from the coronavirus pandemic. Global supply chains have been disrupted, virtually every major tech conference taking place over the next few months has been canceled... Read more…

By Oliver Peckham

Supercomputer Simulations Reveal the Fate of the Neanderthals

May 25, 2020

For hundreds of thousands of years, neanderthals roamed the planet, eventually (almost 50,000 years ago) giving way to homo sapiens, which quickly became the do Read more…

By Oliver Peckham

DoE Expands on Role of COVID-19 Supercomputing Consortium

March 25, 2020

After announcing the launch of the COVID-19 High Performance Computing Consortium on Sunday, the Department of Energy yesterday provided more details on its sco Read more…

By John Russell

Steve Scott Lays Out HPE-Cray Blended Product Roadmap

March 11, 2020

Last week, the day before the El Capitan processor disclosures were made at HPE's new headquarters in San Jose, Steve Scott (CTO for HPC & AI at HPE, and former Cray CTO) was on-hand at the Rice Oil & Gas HPC conference in Houston. He was there to discuss the HPE-Cray transition and blended roadmap, as well as his favorite topic, Cray's eighth-gen networking technology, Slingshot. Read more…

By Tiffany Trader

Honeywell’s Big Bet on Trapped Ion Quantum Computing

April 7, 2020

Honeywell doesn’t spring to mind when thinking of quantum computing pioneers, but a decade ago the high-tech conglomerate better known for its control systems waded deliberately into the then calmer quantum computing (QC) waters. Fast forward to March when Honeywell announced plans to introduce an ion trap-based quantum computer whose ‘performance’ would... Read more…

By John Russell

Leading Solution Providers

SC 2019 Virtual Booth Video Tour

AMD
AMD
ASROCK RACK
ASROCK RACK
AWS
AWS
CEJN
CJEN
CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
IBM
IBM
MELLANOX
MELLANOX
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
SIX NINES IT
SIX NINES IT
VERNE GLOBAL
VERNE GLOBAL
WEKAIO
WEKAIO

Contributors

Tech Conferences Are Being Canceled Due to Coronavirus

March 3, 2020

Several conferences scheduled to take place in the coming weeks, including Nvidia’s GPU Technology Conference (GTC) and the Strata Data + AI conference, have Read more…

By Alex Woodie

Exascale Watch: El Capitan Will Use AMD CPUs & GPUs to Reach 2 Exaflops

March 4, 2020

HPE and its collaborators reported today that El Capitan, the forthcoming exascale supercomputer to be sited at Lawrence Livermore National Laboratory and serve Read more…

By John Russell

‘Billion Molecules Against COVID-19’ Challenge to Launch with Massive Supercomputing Support

April 22, 2020

Around the world, supercomputing centers have spun up and opened their doors for COVID-19 research in what may be the most unified supercomputing effort in hist Read more…

By Oliver Peckham

Cray to Provide NOAA with Two AMD-Powered Supercomputers

February 24, 2020

The United States’ National Oceanic and Atmospheric Administration (NOAA) last week announced plans for a major refresh of its operational weather forecasting supercomputers, part of a 10-year, $505.2 million program, which will secure two HPE-Cray systems for NOAA’s National Weather Service to be fielded later this year and put into production in early 2022. Read more…

By Tiffany Trader

15 Slides on Programming Aurora and Exascale Systems

May 7, 2020

Sometime in 2021, Aurora, the first planned U.S. exascale system, is scheduled to be fired up at Argonne National Laboratory. Cray (now HPE) and Intel are the k Read more…

By John Russell

Summit Supercomputer is Already Making its Mark on Science

September 20, 2018

Summit, now the fastest supercomputer in the world, is quickly making its mark in science – five of the six finalists just announced for the prestigious 2018 Read more…

By John Russell

Fujitsu A64FX Supercomputer to Be Deployed at Nagoya University This Summer

February 3, 2020

Japanese tech giant Fujitsu announced today that it will supply Nagoya University Information Technology Center with the first commercial supercomputer powered Read more…

By Tiffany Trader

Australian Researchers Break All-Time Internet Speed Record

May 26, 2020

If you’ve been stuck at home for the last few months, you’ve probably become more attuned to the quality (or lack thereof) of your internet connection. Even Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This