HPC and the Colocation Datacenter – a Bridge Too Far?

By Clive Longbottom, Quocirca

April 7, 2017

In this guest commentary, industry analyst Clive Longbottom offers a European perspective on the current capability of colocation datacenters to meet the growing requirements of HPC users.

A more standardised HPC platform approach is making the running of HPC projects within increasing financial reach. But this still leaves the dilemma of how organizations can cost-justify building dedicated datacenter facilities for supporting such platforms that may become surplus to requirement in just a year or two.

The obvious alternative is to turn to a colocation datacenter provider. However, in the UK and in many areas of Europe, the reality may not be quite so simple, as fit-for-purpose colocation facilities tend to be few and far between. HPC users will find it challenging to find colocation providers capable of meeting their specific and increasingly IoT-driven big data processing and analytical demands, especially when it comes to the powering and cooling of highly-dense and complex platforms.

Is the only answer therefore to either accept the risk of going colo, or continue building expensive in-house datacenters?

Perhaps for some, particularly the not-for-profit science research sector, a best of both worlds alternative is already available whereby HPC resources are shared. Certainly in the UK and some other countries in Europe such government backed solutions are on offer. For example, only last month it was announced that six UK universities are each to host HPC centers with $25 million of funding from the Engineering and Physical Sciences Research Council, a UK government body. This is to bridge the gap between the computing capabilities currently available to researchers in many UK universities and the state-of-the-art HPC resources accessible via the UK National Supercomputing Service (ARCHER).

But for many commercial organizations, be they in financial services, manufacturing, retail, oil & gas, pharmaceuticals and so on, the hard choice remains whether to self-build or buy space in colo datacenters. For those where self-build has been ruled out for reasons of sheer capital expense and where HPC project timescales are just too short to warrant a dedicated facility, colocation seems inevitable. This then presents a further dilemma. There are many colos to choose from but the majority have insufficient power and cooling for HPC densities and inadequate back up and auxiliary power services to meet continuity requirements.

Faced with such constraints, some HPC users turn to the general public cloud as a scale-out option. However, public cloud is generally unsuitable for true HPC workloads despite cloud computing’s premise of elasticity for providing additional at-will compute resource for specific workloads.

Cloud may be fine for standard workloads where the amount of CPU, storage or network resources necessary for a specific workload are generally quite definable. However, with HPC it is considerably more complex as there is a need for different CPU and GPU server capabilities; for highly engineered interconnects between all the various systems and resources; for storage latencies to be maintained in the low milli, micro or even nanoseconds.  All of this requires highly specialised workload orchestration that is not available on general public cloud platforms.

Attempting to create a true HPC environment on top of a general public cloud is therefore untenable. So yet again, organizations tend to find themselves back at square one, deciding on or reverting to a self-build solution, or making the best of what colo has to offer. A real catch 22 situation!

Key colocation considerations for HPC users:

If the consensus is to take the colocation option the following decision criteria may serve as a useful guide:

Power

Hyper-dense HPC equipment needs high power densities, far more than the average colocation facility in Europe currently provides. The average power per rack for a ‘standard’ platform rarely exceeds 8kW per rack – in fact the average for colocation facilities is closer to 5kW. A dense HPC platform will typically draw around 12kW per rack and in some cases 30kW or more. Can the colocation facility provide that extra power now – not just promise it for the future?  Will it charge a premium price for routing more power to your system?  Furthermore, do the multi-cabled power aggregation systems required include sufficient power redundancy?

Careful consideration must therefore be given to future-proofing when it comes to power availability to avoid the potential for unplanned downtime, or the disruption and cost involved in the event of migration/de-installation should the facility become power-strapped. Clearly, PUE and carbon emissions credentials will also need evaluation from a cost, carbon tax and CSR perspective.

Back Up

There will always be some form of immediate failover power supply in place which is then replaced by auxiliary power from diesel generators.  However, such immediate power provision is expensive, particularly when there is a continuous high draw, as is required by HPC.  UPS and auxiliary power systems must be capable of supporting all workloads running in the facility at the same time, along with overhead and enough redundancy to deal with any failure within the emergency power supply system itself. This is not necessarily accommodated in colocation facilities looking to move up from general purpose applications and services to supporting true HPC environments.

Cooling

With HPC requiring highly targeted cooling, simple computer room air conditioning (CRAC) or free air cooling systems (such as swamp or adiabatic coolers) may not have the capabilities required.

Even where a modern HPC system may be using in-row cooling, so removing the need for adequate in-facility cooling, removing the heat generated in an effective manner may be a problem. Hot and Cold Aisle cooling systems are increasingly inadequate for addressing the heat created by larger HPC environments which will require specialized and often custom built cooling systems and procedures.

This places increased emphasis for ensuring there are on-site engineering personnel on hand with demonstrable knowledge in designing and building bespoke cooling systems such as direct liquid cooling for highly efficient heat removal and avoiding on board hot spots. This will reduce the problems of high temperatures without excessive air circulation which is both expensive and noisy.

Fiber Connectivity/Latency

Consider the availability of diverse high speed on-site fibre cross connects. Basic public connectivity solutions will generally not be sufficient for HPC systems so look for providers that have specialized connectivity solutions.

The HPC platform may be working well; all access devices may be working; the public internet is working.  However, what if the link between the organization or the public internet and the colocation facility goes down and there is no capability for failover?  As many problems with connectivity come down to physical damage, such as caused by cables being broken during roadworks, ensuring that connectivity is through multiple diverse connections from the facility is crucial.

Other areas where a colocation provider should be able to demonstrate capabilities include specialized connections to public clouds, such as Microsoft Azure ExpressRoute and AWS Direct Connect. These bypass the public internet to enable more consistent and secure interactions between the HPC platform and other workloads the organization may be operating.

Location

Last but not least, the physical location of the datacenter will impact directly on rack space costs and power availability. In the case of colocation there are often considerable differences in rack space rents between regional facilities and those based in or around large metro areas such as London. Perhaps of more concern to HPC users, the availability and reliability of power supply will likely vary from region to region. The majority are not directly connected to the grid and several pylon hops from sub-stations. Some facilities in power-strapped areas are already pushed to supply 4kW per rack.

Fortunately, the ever decreasing cost of high speed fiber is providing more freedom to build modern colo facilities much further away from metro areas but without incurring the latency issues of old. Examples here include locations such as the NGD mega data facility in South Wales, where renewable power is in abundant supply (180 MW) and is directly connected to the national grid; and of course some of the emerging facilities in the Nordic region where hydroelectric power is plentiful and low cost.

In summary, look closely enough and commercial HPC users will find a few fit for purpose colocation choices already available in the UK and Europe. Provided, that is, they carefully evaluate the ability of their would-be partners to guarantee the power and back up contingencies required for the duration of the project, and with high levels of redundancy on tap should needs suddenly change or for mitigating risk of any unplanned downtime. Ensuring the engineering team is capable of understanding and delivering bespoke rack configurations and specialized cooling environments is also a major prerequisite.

About the Author

Clive Longbottom is the founder and research director of Quocirca, the UK-based pan-European market analyst firm. Clive covers areas as diverse as storage, servers, operating systems, IT platforms, datacenters, systems management, on-line services, big data and analytics.

Trained as a Chemical Engineer, Clive understands that everything within a business is predicated on process, and that the only point of technology is in making sure that the processes run efficiently and smoothly.  As a research engineer for Johnson Matthey he worked on several projects, including anti-cancer drugs, efficient NoX/SoX burners and a long period working on primary energy generation via fuel cells.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight. Hyperion Research analyst and noted storage expert Mark No Read more…

GigaIO Gets $14.7M in Series B Funding to Expand Its Composable Fabric Technology to Customers

September 16, 2021

Just before the COVID-19 pandemic began in March 2020, GigaIO introduced its Universal Composable Fabric technology, which allows enterprises to bring together any HPC and AI resources and integrate them with networking, Read more…

What’s New in HPC Research: Solar Power, ExaWorks, Optane & More

September 16, 2021

In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programming to exascale to quantum computing, the details are here. Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

AWS Solution Channel

Supporting Climate Model Simulations to Accelerate Climate Science

The Amazon Sustainability Data Initiative (ASDI), AWS is donating cloud resources, technical support, and access to scalable infrastructure and fast networking providing high performance computing (HPC) solutions to support simulations of near-term climate using the National Center for Atmospheric Research (NCAR) Community Earth System Model Version 2 (CESM2) and its Whole Atmosphere Community Climate Model (WACCM). Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Why HPC Storage Matters More Now Than Ever: Analyst Q&A

September 17, 2021

With soaring data volumes and insatiable computing driving nearly every facet of economic, social and scientific progress, data storage is seizing the spotlight Read more…

Cerebras Brings Its Wafer-Scale Engine AI System to the Cloud

September 16, 2021

Five months ago, when Cerebras Systems debuted its second-generation wafer-scale silicon system (CS-2), co-founder and CEO Andrew Feldman hinted of the company’s coming cloud plans, and now those plans have come to fruition. Today, Cerebras and Cirrascale Cloud Services are launching... Read more…

AI Hardware Summit: Panel on Memory Looks Forward

September 15, 2021

What will system memory look like in five years? Good question. While Monday's panel, Designing AI Super-Chips at the Speed of Memory, at the AI Hardware Summit, tackled several topics, the panelists also took a brief glimpse into the future. Unlike compute, storage and networking, which... Read more…

ECMWF Opens Bologna Datacenter in Preparation for Atos Supercomputer

September 14, 2021

In January 2020, the European Centre for Medium-Range Weather Forecasts (ECMWF) – a juggernaut in the weather forecasting scene – signed a four-year, $89-million contract with European tech firm Atos to quintuple its supercomputing capacity. With the deal approaching the two-year mark, ECMWF... Read more…

Quantum Computer Market Headed to $830M in 2024

September 13, 2021

What is one to make of the quantum computing market? Energized (lots of funding) but still chaotic and advancing in unpredictable ways (e.g. competing qubit tec Read more…

Amazon, NCAR, SilverLining Team for Unprecedented Cloud Climate Simulations

September 10, 2021

Earth’s climate is, to put it mildly, not in a good place. In the wake of a damning report from the Intergovernmental Panel on Climate Change (IPCC), scientis Read more…

After Roadblocks and Renewals, EuroHPC Targets a Bigger, Quantum Future

September 9, 2021

The EuroHPC Joint Undertaking (JU) was formalized in 2018, beginning a new era of European supercomputing that began to bear fruit this year with the launch of several of the first EuroHPC systems. The undertaking, however, has not been without its speed bumps, and the Union faces an uphill... Read more…

How Argonne Is Preparing for Exascale in 2022

September 8, 2021

Additional details came to light on Argonne National Laboratory’s preparation for the 2022 Aurora exascale-class supercomputer, during the HPC User Forum, held virtually this week on account of pandemic. Exascale Computing Project director Doug Kothe reviewed some of the 'early exascale hardware' at Argonne, Oak Ridge and NERSC (Perlmutter), while Ti Leggett, Deputy Project Director & Deputy Director... Read more…

Ahead of ‘Dojo,’ Tesla Reveals Its Massive Precursor Supercomputer

June 22, 2021

In spring 2019, Tesla made cryptic reference to a project called Dojo, a “super-powerful training computer” for video data processing. Then, in summer 2020, Tesla CEO Elon Musk tweeted: “Tesla is developing a [neural network] training computer called Dojo to process truly vast amounts of video data. It’s a beast! … A truly useful exaflop at de facto FP32.” Read more…

Berkeley Lab Debuts Perlmutter, World’s Fastest AI Supercomputer

May 27, 2021

A ribbon-cutting ceremony held virtually at Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) today marked the official launch of Perlmutter – aka NERSC-9 – the GPU-accelerated supercomputer built by HPE in partnership with Nvidia and AMD. Read more…

Google Launches TPU v4 AI Chips

May 20, 2021

Google CEO Sundar Pichai spoke for only one minute and 42 seconds about the company’s latest TPU v4 Tensor Processing Units during his keynote at the Google I Read more…

Esperanto, Silicon in Hand, Champions the Efficiency of Its 1,092-Core RISC-V Chip

August 27, 2021

Esperanto Technologies made waves last December when it announced ET-SoC-1, a new RISC-V-based chip aimed at machine learning that packed nearly 1,100 cores onto a package small enough to fit six times over on a single PCIe card. Now, Esperanto is back, silicon in-hand and taking aim... Read more…

Enter Dojo: Tesla Reveals Design for Modular Supercomputer & D1 Chip

August 20, 2021

Two months ago, Tesla revealed a massive GPU cluster that it said was “roughly the number five supercomputer in the world,” and which was just a precursor to Tesla’s real supercomputing moonshot: the long-rumored, little-detailed Dojo system. “We’ve been scaling our neural network training compute dramatically over the last few years,” said Milan Kovac, Tesla’s director of autopilot engineering. Read more…

CentOS Replacement Rocky Linux Is Now in GA and Under Independent Control

June 21, 2021

The Rocky Enterprise Software Foundation (RESF) is announcing the general availability of Rocky Linux, release 8.4, designed as a drop-in replacement for the soon-to-be discontinued CentOS. The GA release is launching six-and-a-half months after Red Hat deprecated its support for the widely popular, free CentOS server operating system. The Rocky Linux development effort... Read more…

Intel Completes LLVM Adoption; Will End Updates to Classic C/C++ Compilers in Future

August 10, 2021

Intel reported in a blog this week that its adoption of the open source LLVM architecture for Intel’s C/C++ compiler is complete. The transition is part of In Read more…

Iran Gains HPC Capabilities with Launch of ‘Simorgh’ Supercomputer

May 18, 2021

Iran is said to be developing domestic supercomputing technology to advance the processing of scientific, economic, political and military data, and to strengthen the nation’s position in the age of AI and big data. On Sunday, Iran unveiled the Simorgh supercomputer, which will deliver.... Read more…

Leading Solution Providers

Contributors

AMD-Xilinx Deal Gains UK, EU Approvals — China’s Decision Still Pending

July 1, 2021

AMD’s planned acquisition of FPGA maker Xilinx is now in the hands of Chinese regulators after needed antitrust approvals for the $35 billion deal were receiv Read more…

Hot Chips: Here Come the DPUs and IPUs from Arm, Nvidia and Intel

August 25, 2021

The emergence of data processing units (DPU) and infrastructure processing units (IPU) as potentially important pieces in cloud and datacenter architectures was Read more…

Julia Update: Adoption Keeps Climbing; Is It a Python Challenger?

January 13, 2021

The rapid adoption of Julia, the open source, high level programing language with roots at MIT, shows no sign of slowing according to data from Julialang.org. I Read more…

10nm, 7nm, 5nm…. Should the Chip Nanometer Metric Be Replaced?

June 1, 2020

The biggest cool factor in server chips is the nanometer. AMD beating Intel to a CPU built on a 7nm process node* – with 5nm and 3nm on the way – has been i Read more…

HPE Wins $2B GreenLake HPC-as-a-Service Deal with NSA

September 1, 2021

In the heated, oft-contentious, government IT space, HPE has won a massive $2 billion contract to provide HPC and AI services to the United States’ National Security Agency (NSA). Following on the heels of the now-canceled $10 billion JEDI contract (reissued as JWCC) and a $10 billion... Read more…

Intel Launches 10nm ‘Ice Lake’ Datacenter CPU with Up to 40 Cores

April 6, 2021

The wait is over. Today Intel officially launched its 10nm datacenter CPU, the third-generation Intel Xeon Scalable processor, codenamed Ice Lake. With up to 40 Read more…

Quantum Roundup: IBM, Rigetti, Phasecraft, Oxford QC, China, and More

July 13, 2021

IBM yesterday announced a proof for a quantum ML algorithm. A week ago, it unveiled a new topology for its quantum processors. Last Friday, the Technical Univer Read more…

Frontier to Meet 20MW Exascale Power Target Set by DARPA in 2008

July 14, 2021

After more than a decade of planning, the United States’ first exascale computer, Frontier, is set to arrive at Oak Ridge National Laboratory (ORNL) later this year. Crossing this “1,000x” horizon required overcoming four major challenges: power demand, reliability, extreme parallelism and data movement. Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire