Cloud Control: Outsourcing an HPC Cluster

By Scott Clark

August 5, 2010

So, thus far in this series of posts, we have discussed the following issues:

– IT is not a core competency of the business, so we should look to outsource if we can outsource without jeopardizing the business.

– We should look to cloud computing to bring costs under control and to deliver cost efficiencies over time, not as an immediate cost reduction activity.

– In order to outsource IT, we must trust the suppliers and vendors involved, which means developing relationships, not better bludgeoning weapons. And we have already done an extremely similar divestiture in our past, so we have a model to look at that says it can be done successfully

Now we need to talk about what an organization would need to look like in order to properly manage the outsourcing of your HPC cluster. So what would that look like? Well, we should assume that all technical and operational capabilities necessary to execute the infrastructure are included in the outsource. The supplier is expected to provide the entirety of the technical function and carry out all operational duties. That is not to say that the customer is off the hook technically, just the opposite. The customer needs to assemble a small team of technically savvy, business minded (specific to the core product of your company) individuals to measure and manage the outsource. This team needs to be very strong technically in order to vet and gauge any available technologies for potential use as well as identify flaws in solutions or methodology of solutions delivered. The size of the team would be dependent on the size of the company (and therefore the size of the outsource).

Functionally, the outsource management team is the control point for the outsourcing of your infrastructure. Through this group, you maintain control over your infrastructure, and therefore can have full trust in your outsource partner (because you know exactly what you want, and you know how to measure if you are getting it). The intent of this team is to stay abreast of the constantly changing needs of the business, understand the continuously evolving capabilities of technology, and combine the two awareness’ to understand how the company should be leveraging technology to maximize benefit to the business and control costs. With that combined awareness, you now hold the outsource accountable for delivering an appropriate solution to your company’s need.

This is not to say that all responsibility falls to the customer outsource team. The supplier will need to have a disciplined focus in the specific space that your company does business, and be innovating their solution to specifically solve the problems of that industry. If they do not, then they will probably not be a cost competitive, viable supplier long term.

You will see many functions that fall under the customer outsource team. And remember, this team needs to remain small in order to avoid paying too much for your solution. There will be a constant loop for the outsource team to:

1. Quantitatively measure the current solution

2. Analyze cost and benefits of the current solution

3. Assessment of best practices

4. Revision of current solution

5. Loop back to 1

There will be several technical responsibilities that the outsource team will participate in jointly. The supplier should be doing most of this work for the customer, but how do you know if the data they are presenting is 100% accurate or appropriate for your solution. When in doubt, the outsource team will generate their own data, and share that data with the supplier to derive a more accurate solution. In that, the outsource team will do some amount of, but not every facet of:

1. Technical and cost Benchmarking

2. Technical advisory / liaison (IT industry to customer business)

3. Technical architects – Designing architecture of applications and services that are appropriate for the company’s consumption

There are many responsibilities of the outsource team that will fall into the relationship management arena. This team will be the primary point of contact and control between the customer and the supplier, and I can’t say enough how important having a positive relationship with the supplier is to the quality of the product you consumer or the price you pay for that product. The outsource team will be responsible for communicating current and future requirements to the supplier, and many of those will take on the form of Service Level Agreements (SLAs) which we will talk about in a moment. The outsource team will also be responsible for how technology is being consumed by the customer company. The outsource team needs to make sure the company is getting the appropriate solution from the supplier company at an appropriate price with appropriate constraints / limitations / boundaries.

Another very important responsibility of the outsource team will be to maintain flexibility from a quality of solution as well as a cost perspective. In this, staying standards based is very important. It is not an absolute requirement, there may be solutions that are proprietary that solve a problem much more efficiently or cost effectively. What you need to consider in this case is when the vendor thinks they have you locked in, and start raising the price because they think you can’t get out of their solution, what is your plan for defeating them. So, where possible, use industry standards so that you can move from vendor to vendor without losing time, money, or critical features. Where that is not possible, what is the plan for using one vendor’s proprietary solution but being able to migrate to another vendor’s solutions without impact to maintain negotiating position.

Finally, there is the new component to infrastructure management. The outsource team will need to learn how to define and measure service level agreements (SLAs). The definition stage will have several components. What is the service level expectation (defines success and failure criteria)? This will sometimes have many different components for a single solution. An example would be storage: is there enough capacity, do we get enough IOPs, and is there enough throughput. All of these are different measurements, but critical to a storage infrastructure for HPC. How will this service level be measured and how often? We have all seen many improper SLA measurements where IT informs the engineer that they have 99.997% availability of the environment, but the engineer knows that there were several outages that had him or her non-productive for days at a time. So do you measure component level availability or solution level? How frequently are the polls for availability? Is availability the right measure? This is all part of the definition. And then, what happens when a failure criteria is met? This is where there is a lot of work happening in the industry. It is not sufficient to refund the months colo fees when a power outage cost the company 6 weeks worth of work. There is a cost to failure, and that is usually very specific to the industry. An outage on a cluster for an EDA company has different implications than an outage to a scientific computing cluster for a university. The recourse needs to be negotiated based impact. Does this at all sound familiar? Any insurance people reading this?? Well, that is one of the solutions the industry is exploring, is having insurance policies behind the supplier. Finally, we need to look at how service levels re-assessed over time. As the technology evolves, so should the service levels.

The fabless semiconductor industry is fairly mature in it’s process for outsourcing the fabrication function. They have cost models and laws (Rock’s Law for cost of a fab over time) that help decision processes, they have a collaborative (FSA) for arriving at better process, and they have an established track record that this can be accomplished very successfully and with cost benefit. The HPC Cloud industry needs to mature. That will just take time.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Advancing Modular Supercomputing with DEEP and DEEP-ER Architectures

February 24, 2017

Knowing that the jump to exascale will require novel architectural approaches capable of delivering dramatic efficiency and performance gains, researchers around the world are hard at work on next-generation HPC systems. Read more…

By Sean Thielen

Weekly Twitter Roundup (Feb. 23, 2017)

February 23, 2017

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

HPE Server Shows Low Latency on STAC-N1 Test

February 22, 2017

The performance of trade and match servers can be a critical differentiator for financial trading houses. Read more…

By John Russell

HPC Financial Update (Feb. 2017)

February 22, 2017

In this recurring feature, we’ll provide you with financial highlights from companies in the HPC industry. Check back in regularly for an updated list with the most pertinent fiscal information. Read more…

By Thomas Ayres

HPE Extreme Performance Solutions

O&G Companies Create Value with High Performance Remote Visualization

Today’s oil and gas (O&G) companies are striving to process datasets that have become not only tremendously large, but extremely complex. And the larger that data becomes, the harder it is to move and analyze it – particularly with a workforce that could be distributed between drilling sites, offshore rigs, and remote offices. Read more…

Rethinking HPC Platforms for ‘Second Gen’ Applications

February 22, 2017

Just what constitutes HPC and how best to support it is a keen topic currently. Read more…

By John Russell

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

IDC: Will the Real Exascale Race Please Stand Up?

February 21, 2017

So the exascale race is on. And lots of organizations are in the pack. Government announcements from the US, China, India, Japan, and the EU indicate that they are working hard to make it happen – some sooner, some later. Read more…

By Bob Sorensen, IDC

ExxonMobil, NCSA, Cray Scale Reservoir Simulation to 700,000+ Processors

February 17, 2017

In a scaling breakthrough for oil and gas discovery, ExxonMobil geoscientists report they have harnessed the power of 717,000 processors – the equivalent of 22,000 32-processor computers – to run complex oil and gas reservoir simulation models. Read more…

By Doug Black

Advancing Modular Supercomputing with DEEP and DEEP-ER Architectures

February 24, 2017

Knowing that the jump to exascale will require novel architectural approaches capable of delivering dramatic efficiency and performance gains, researchers around the world are hard at work on next-generation HPC systems. Read more…

By Sean Thielen

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

IDC: Will the Real Exascale Race Please Stand Up?

February 21, 2017

So the exascale race is on. And lots of organizations are in the pack. Government announcements from the US, China, India, Japan, and the EU indicate that they are working hard to make it happen – some sooner, some later. Read more…

By Bob Sorensen, IDC

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Drug Developers Use Google Cloud HPC in the Fight Against ALS

February 16, 2017

Within the haystack of a lethal disease such as ALS (amyotrophic lateral sclerosis / Lou Gehrig’s Disease) there exists, somewhere, the needle that will pierce this therapy-resistant affliction. Read more…

By Doug Black

Azure Edges AWS in Linpack Benchmark Study

February 15, 2017

The “when will clouds be ready for HPC” question has ebbed and flowed for years. Read more…

By John Russell

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

D-Wave SC16 Update: What’s Bo Ewald Saying These Days

November 18, 2016

Tucked in a back section of the SC16 exhibit hall, quantum computing pioneer D-Wave has been talking up its new 2000-qubit processor announced in September. Forget for a moment the criticism sometimes aimed at D-Wave. This small Canadian company has sold several machines including, for example, ones to Lockheed and NASA, and has worked with Google on mapping machine learning problems to quantum computing. In July Los Alamos National Laboratory took possession of a 1000-quibit D-Wave 2X system that LANL ordered a year ago around the time of SC15. Read more…

By John Russell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Leading Solution Providers

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

Dell Knights Landing Machine Sets New STAC Records

November 2, 2016

The Securities Technology Analysis Center, commonly known as STAC, has released a new report characterizing the performance of the Knight Landing-based Dell PowerEdge C6320p server on the STAC-A2 benchmarking suite, widely used by the financial services industry to test and evaluate computing platforms. The Dell machine has set new records for both the baseline Greeks benchmark and the large Greeks benchmark. Read more…

By Tiffany Trader

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

What Knights Landing Is Not

June 18, 2016

As we get ready to launch the newest member of the Intel Xeon Phi family, code named Knights Landing, it is natural that there be some questions and potentially some confusion. Read more…

By James Reinders, Intel

  • arrow
  • Click Here for More Headlines
  • arrow
Share This