Service Level Agreements (SLAs) are crucial to any nontrivial use of the cloud. Once an organization understands its requirements, it needs a guarantee from its cloud provider that those requirements will be met. Cloud consumers trust cloud providers to deliver some or all of their infrastructure services, so it is vital to define what those services are, how they are delivered and how they are used.
In addition to the prose that defines the relationship between the consumer and provider, an SLA contains Service Level Objectives (SLOs) that define objectively measurable conditions for the service. The consumer must weigh the terms of the SLA and its SLOs against the goals of their business to select a cloud provider.
Simply put, an SLA is the foundation of the consumer’s trust in the provider. A well-written SLA codifies the provider’s reputation.
A Community Effort Targeted at SLAs
Since mid-2009, the open, free Cloud Computing Use Cases Group has led a conversation about cloud computing. The community has collaborated on a paper that documents the ways consumers are using the cloud today and how they hope to use it tomorrow.
The success of cloud computing will rest on the cloud being open and having effective controls in place to mitigate risk. The open discussion group, made up of individuals from business and academia from around the world, worked together to produce a white paper on Cloud Computing Use Cases in support of open cloud computing.
The first three editions of the paper covered the basic requirements for cloud computing, the needs of developers and the security controls cloud vendors must have in place. The following are some of the group’s findings.
Characteristics of SLAs
An SLA contains several things:
♦ A set of services the provider will deliver
♦ A complete, specific definition of each service
♦ The responsibilities of the provider and the consumer
♦ A set of metrics to determine whether the provider is delivering the service as promised
♦ An auditing mechanism to monitor the service
♦ The remedies available to the consumer and provider if the terms of the SLA are not met
♦ How the SLA will change over time
The marketplace features two types of SLAs: off-the-shelf agreements and negotiated agreements between a provider and consumer to meet that consumer’s specific needs. Off-the-shelf agreements are usually far less expensive, but it is unlikely that any consumer with critical data and applications will be able to use them. Therefore the consumer’s first step in approaching an SLA (and the cloud in general) is to determine how critical their data and applications are.
Business Level Objectives
Debating the terms of an SLA is meaningless if the organization has not defined its business level objectives. A consumer must select providers and services based on the goals of the organization. Defining exactly what services will be used is worthless unless the organization has defined why it will use the services in the first place.
Consumers should know why they are using cloud computing before they decide how to deploy and use cloud.
Service Level Objectives
An SLO defines a characteristic of a service in precise, measurable terms. Here are some sample SLOs:
♦ The system should never have more than 10 pending requests.
♦ Throughput for a request should be less than 3 seconds.
♦ Data streaming for a read request should begin within 2 seconds.
♦ At least five instances of a VM should be available 99.99999% of the time, with at least one instance available in each of a provider’s three data centers. Precise, measurable terms make it possible to verify that the SLA is being met.
Service Level Management
It is impossible to know whether the terms of the SLA are being met without monitoring and measuring the performance of the service. Service Level Management is how that performance information is gathered and handled. Measurements of the service are based on the Service Level Objectives in the SLA.
A cloud provider uses Service Level Management to make decisions about its infrastructure. For example, a provider might notice that throughput for a particular service is barely meeting the consumer’s requirements. The provider could respond to that situation by reallocating bandwidth or bringing more physical hardware online. The goal of Service Level Management is to help providers make intelligent decisions based on its business objectives and technical realities.
A cloud consumer uses Service Level Management to make decisions about how it uses cloud services. For example, a consumer might notice that the CPU load on a particular VM is above 90%. In response, the consumer might start another VM. However, if the consumer’s SLA says that the first 100 VMs are priced at one rate, and more VMs are priced at a higher rate, the consumer might decide not to create another VM and incur higher charges. As with the provider, Service Level Management helps consumers make (and possibly automate) decisions about the way they use cloud services.
SLA requirements
Metrics
Monitoring and auditing require something tangible that can be monitored as it happens and audited after the fact. The metrics of an SLA must be objectively and unambiguously defined. Cloud consumers will have an endless variety of metrics depending on the nature of their applications and data. Although listing all metrics is impossible, some of the most common are:
♦ Throughput – How quickly the service responds
♦ Reliability – How often the service is available
♦ Elasticity – The ability for a given resource to expand and contract automatically, with limits clearly stated (the maximum amount of storage or bandwidth, for example)
♦ Load balancing – When elasticity kicks in (new VMs are booted or terminated, for example)
♦ Durability – How likely the data is to be lost
♦ Linearity – How a system performs as the load increases
♦ Agility – How quickly the provider responds as the consumer’s resource load scales up and down
♦ Automation – How often requests to the provider are handled without any human interaction
♦ Customer service response times – How quickly the provider responds to a service request. This refers to the human interactions required when something goes wrong with the on-demand, self-service aspects of the cloud.
Security
An SLA should be written with security in mind. A cloud consumer must understand their security requirements and what controls and federation patterns are necessary to meet those requirements. In turn, a cloud provider must understand what they must deliver to the consumer to enable the appropriate controls and federation patterns. (Security as a general requirement is discussed in significant detail in the group’s paper.)
Monitoring
If a failure to meet the terms of an SLA has financial or legal consequences, the question of who should monitor the performance of the provider (and whether the consumer meets its responsibilities as well) becomes crucial. It is in the provider’s interest to define uptime in the broadest possible terms, while consumers could be tempted to blame the provider for any system problems that occur. The best solution to this problem is a neutral third-party organization that monitors the performance of the provider. This eliminates the conflicts of interest that might occur if providers report outages at their sole discretion or if consumers are responsible for proving that an outage occurred.
A Note About Reliability
In discussions of reliability, a common metric bandied about is the number of “nines” a provider delivers. As an example, five nines reliability means the service is available 99.99999% of the time, which translates to total system outages of roughly 5 minutes out of every 12 months. One problem with this metric is that it quickly loses meaning without a clear definition of what an outage is. (It loses even more meaning if the cloud provider gets to decide whether an incident constitutes an outage.)
Beyond the nebulous nature of nines, it is important to consider that many cloud offerings are built on top of other cloud offerings. The ability to combine multiple infrastructures provides a great deal of flexibility and power, but each additional provider makes the system less reliable. If a cloud provider offers a service built on a second cloud provider’s storage service and a third cloud provider’s database service, and all of those providers deliver five nines reliability, the reliability of the entire system is less than five nines. The service is unavailable when the first cloud provider’s systems go down; the service is equally unavailable when the second or third providers’ systems have problems. The more cloud providers involved, the more downtime the overall system will have.
To sum up, any consumer who needs to evaluate the reliability of a cloud service should know as much as possible about the cloud providers that deliver that service, whether directly or indirectly.
Summary
As organizations use cloud services, the responsibilities of both the consumer and the provider must be clearly defined in a Service Level Agreement. An SLA defines how the consumer will use the services and how the provider will deliver them. It is crucial that the consumer of cloud services fully understands all the terms of the provider’s SLA, and that the consumer consider the needs of their organization before signing any agreement.
——-
The Cloud Computing Use Cases group is currently working on a new paper entitled “Moving to the Cloud.” Readers are encouraged to join the group and share their experiences, concerns and advice. Contributions that represent the breadth of cloud computing community will make the paper better. Please visit and join http://groups.google.com/group/cloud-computing-use-cases to get involved.