Cloud Use Cases Group Takes SLAs to Task

By Doug Tidwell

September 17, 2010

Service Level Agreements (SLAs) are crucial to any nontrivial use of the cloud. Once an organization understands its requirements, it needs a guarantee from its cloud provider that those requirements will be met. Cloud consumers trust cloud providers to deliver some or all of their infrastructure services, so it is vital to define what those services are, how they are delivered and how they are used.

In addition to the prose that defines the relationship between the consumer and provider, an SLA contains Service Level Objectives (SLOs) that define objectively measurable conditions for the service. The consumer must weigh the terms of the SLA and its SLOs against the goals of their business to select a cloud provider.

Simply put, an SLA is the foundation of the consumer’s trust in the provider. A well-written SLA codifies the provider’s reputation.

A Community Effort Targeted at SLAs

Since mid-2009, the open, free Cloud Computing Use Cases Group has led a conversation about cloud computing. The community has collaborated on a paper that documents the ways consumers are using the cloud today and how they hope to use it tomorrow. 

The success of cloud computing will rest on the cloud being open and having effective controls in place to mitigate risk. The open discussion group, made up of individuals from business and academia from around the world, worked together to produce a white paper on Cloud Computing Use Cases in support of open cloud computing.

The first three editions of the paper covered the basic requirements for cloud computing, the needs of developers and the security controls cloud vendors must have in place. The following are some of the group’s findings.
Characteristics of SLAs

An SLA contains several things:

♦ A set of services the provider will deliver

♦ A complete, specific definition of each service

♦ The responsibilities of the provider and the consumer

♦ A set of metrics to determine whether the provider is delivering the service as promised

♦ An auditing mechanism to monitor the service

♦ The remedies available to the consumer and provider if the terms of the SLA are not met

♦ How the SLA will change over time

The marketplace features two types of SLAs: off-the-shelf agreements and negotiated agreements between a provider and consumer to meet that consumer’s specific needs. Off-the-shelf agreements are usually far less expensive, but it is unlikely that any consumer with critical data and applications will be able to use them. Therefore the consumer’s first step in approaching an SLA (and the cloud in general) is to determine how critical their data and applications are.

Business Level Objectives

Debating the terms of an SLA is meaningless if the organization has not defined its business level objectives. A consumer must select providers and services based on the goals of the organization. Defining exactly what services will be used is worthless unless the organization has defined why it will use the services in the first place.

Consumers should know why they are using cloud computing before they decide how to deploy and use cloud.

Service Level Objectives

An SLO defines a characteristic of a service in precise, measurable terms. Here are some sample SLOs:

♦ The system should never have more than 10 pending requests.

♦ Throughput for a request should be less than 3 seconds.

♦ Data streaming for a read request should begin within 2 seconds.

♦ At least five instances of a VM should be available 99.99999% of the time, with at least one instance available in each of a provider’s three data centers. Precise, measurable terms make it possible to verify that the SLA is being met.

Service Level Management

It is impossible to know whether the terms of the SLA are being met without monitoring and measuring the performance of the service. Service Level Management is how that performance information is gathered and handled. Measurements of the service are based on the Service Level Objectives in the SLA.

A cloud provider uses Service Level Management to make decisions about its infrastructure. For example, a provider might notice that throughput for a particular service is barely meeting the consumer’s requirements. The provider could respond to that situation by reallocating bandwidth or bringing more physical hardware online. The goal of Service Level Management is to help providers make intelligent decisions based on its business objectives and technical realities.

A cloud consumer uses Service Level Management to make decisions about how it uses cloud services. For example, a consumer might notice that the CPU load on a particular VM is above 90%. In response, the consumer might start another VM. However, if the consumer’s SLA says that the first 100 VMs are priced at one rate, and more VMs are priced at a higher rate, the consumer might decide not to create another VM and incur higher charges. As with the provider, Service Level Management helps consumers make (and possibly automate) decisions about the way they use cloud services.

SLA requirements

Metrics

Monitoring and auditing require something tangible that can be monitored as it happens and audited after the fact. The metrics of an SLA must be objectively and unambiguously defined. Cloud consumers will have an endless variety of metrics depending on the nature of their applications and data. Although listing all metrics is impossible, some of the most common are:

♦ Throughput – How quickly the service responds

♦ Reliability – How often the service is available

♦ Elasticity – The ability for a given resource to expand and contract automatically, with limits clearly stated (the maximum amount of storage or bandwidth, for example)

♦ Load balancing – When elasticity kicks in (new VMs are booted or terminated, for example)

♦ Durability – How likely the data is to be lost

♦ Linearity – How a system performs as the load increases

♦ Agility – How quickly the provider responds as the consumer’s resource load scales up and down

♦ Automation – How often requests to the provider are handled without any human interaction

♦ Customer service response times – How quickly the provider responds to a service request. This refers to the human interactions required when something goes wrong with the on-demand, self-service aspects of the cloud.

Security

An SLA should be written with security in mind. A cloud consumer must understand their security requirements and what controls and federation patterns are necessary to meet those requirements. In turn, a cloud provider must understand what they must deliver to the consumer to enable the appropriate controls and federation patterns. (Security as a general requirement is discussed in significant detail in the group’s paper.)

Monitoring

If a failure to meet the terms of an SLA has financial or legal consequences, the question of who should monitor the performance of the provider (and whether the consumer meets its responsibilities as well) becomes crucial. It is in the provider’s interest to define uptime in the broadest possible terms, while consumers could be tempted to blame the provider for any system problems that occur. The best solution to this problem is a neutral third-party organization that monitors the performance of the provider. This eliminates the conflicts of interest that might occur if providers report outages at their sole discretion or if consumers are responsible for proving that an outage occurred.

A Note About Reliability

In discussions of reliability, a common metric bandied about is the number of “nines” a provider delivers. As an example, five nines reliability means the service is available 99.99999% of the time, which translates to total system outages of roughly 5 minutes out of every 12 months. One problem with this metric is that it quickly loses meaning without a clear definition of what an outage is. (It loses even more meaning if the cloud provider gets to decide whether an incident constitutes an outage.)

Beyond the nebulous nature of nines, it is important to consider that many cloud offerings are built on top of other cloud offerings. The ability to combine multiple infrastructures provides a great deal of flexibility and power, but each additional provider makes the system less reliable. If a cloud provider offers a service built on a second cloud provider’s storage service and a third cloud provider’s database service, and all of those providers deliver five nines reliability, the reliability of the entire system is less than five nines. The service is unavailable when the first cloud provider’s systems go down; the service is equally unavailable when the second or third providers’ systems have problems. The more cloud providers involved, the more downtime the overall system will have.

To sum up, any consumer who needs to evaluate the reliability of a cloud service should know as much as possible about the cloud providers that deliver that service, whether directly or indirectly.

Summary

As organizations use cloud services, the responsibilities of both the consumer and the provider must be clearly defined in a Service Level Agreement. An SLA defines how the consumer will use the services and how the provider will deliver them. It is crucial that the consumer of cloud services fully understands all the terms of the provider’s SLA, and that the consumer consider the needs of their organization before signing any agreement.

——-

The Cloud Computing Use Cases group is currently working on a new paper entitled “Moving to the Cloud.” Readers are encouraged to join the group and share their experiences, concerns and advice. Contributions that represent the breadth of cloud computing community will make the paper better. Please visit and join http://groups.google.com/group/cloud-computing-use-cases  to get involved.
 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion XL — were added to the benchmark suite as MLPerf continues Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

MLPerf Inference 4.0 Results Showcase GenAI; Nvidia Still Dominates

March 28, 2024

There were no startling surprises in the latest MLPerf Inference benchmark (4.0) results released yesterday. Two new workloads — Llama 2 and Stable Diffusion Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire