With the prospect of increasing amounts of data being collected by a proliferation of internet –connected devices and the task of organizing, storing, and accessing such data looming, we face the challenge of how to leverage the power of the cloud running in our data centers to make information accessible in a secure and privacy-preserving manner. For many scenarios, in other words, we would like to have a public cloud which we can trust with our private data, and yet we would like to have that data still be accessible to us in an organized and useful way.
One approach to this problem is to envision a world in which all data is preprocessed by a client device before being uploaded to the cloud; the preprocessing signs and encrypts the data in such a way that its functionality is preserved, allowing, for example, for the cloud to search or compute over the encrypted data and to prove its integrity to the client (without the client having to download it). We refer to this type of solution as Cryptographic Cloud Storage.
Cryptographic cloud storage is achievable with current technologies and can help bootstrap trust in public clouds. It can also form the foundation for future cryptographic cloud solutions where an increasing amount of computation on encrypted data is possible and efficient. We will explain cryptographic cloud storage and what role it might play as cloud becomes a more dominant force.
Applications of the Cryptographic Cloud
Storage services based on public clouds such as Microsoft’s Azure storage service and Amazon’s S3 provide customers with scalable and dynamic storage. By moving their data to the cloud customers can avoid the costs of building and maintaining a private storage infrastructure, opting instead to pay a service provider as a function of its needs. For most customers, this provides several benefits including availability (i.e., being able to access data from anywhere) and reliability (i.e., not having to worry about backups) at a relatively low cost. While the benefits of using a public cloud infrastructure are clear, it introduces significant security and privacy risks. In fact, it seems that the biggest hurdle to the adoption of cloud storage (and cloud computing in general) is concern over the confidentiality and integrity of data.
While, so far, consumers have been willing to trade privacy for the convenience of software services (e.g., for web-based email, calendars, pictures etc…), this is not the case for enterprises and government organizations. This reluctance can be attributed to several factors that range from a desire to protect mission-critical data to regulatory obligations to preserve the confidentiality and integrity of data. The latter can occur when the customer is responsible for keeping personally identifiable information (PII), or medical and financial records. So while cloud storage has enormous promise, unless the issues of confidentiality and integrity are addressed many potential customers will be reluctant to make the move.
In addition to simple storage, many enterprises will have a need for some associated services. These services can include any number of business processes including sharing of data among trusted partners, litigation support, monitoring and compliance, back-up, archive and audit logs. A cryptographic storage service can be endowed with some subset of these services to provide value to enterprises, for example in complying with government regulations for handling of sensitive data, geographic considerations relating to data provenance, to help mitigate the cost of security breaches, lower the cost of electronic discovery for litigation support, or alleviate the burden of complying with subpoenas.
For example, a specific type of data which is especially sensitive is personal medical data. The recent move towards electronic health records promises to reduce medical errors, save lives and decrease the cost of healthcare. Given the importance and sensitivity of health-related data, it is clear that any cloud storage platform for health records will need to provide strong confidentiality and integrity guarantees to patients and care givers, which can be enabled with cryptographic cloud storage.
Another arena where a cryptographic cloud storage system could be useful is interactive scientific publishing. As scientists continue to produce large data sets which have broad value for the scientific community, demand will increase for a storage infrastructure to make such data accessible and sharable. To incent scientists to share their data, scientific could establish a publication forum for data sets in partnership with hosted data centers. Such an interactive publication forum would need to provide strong guarantees to authors on how their data sets may be accessed and used by others, and could be built on a cryptographic cloud storage system.
Cryptographic Cloud Storage
The core properties of a cryptographic storage service are that control of the data is maintained by the customer and the security properties are derived from cryptography, as opposed to legal mechanisms, physical security, or access control. A cryptographic cloud service should guarantee confidentiality and integrity of the data while maintaining the availability, reliability, and efficient retrieval of the data and allowing for flexible policies of data sharing.
A cryptographic storage service can be built from three main components: a data processor (DP), that processes data before it is sent to the cloud; a data verifier (DV), that checks whether the data in the cloud has been tampered with; and a token generator (TG), that generates tokens which enable the cloud storage provider to retrieve segments of customer data. We describe designs for both consumer and enterprise scenarios.
A Consumer Architecture
Typical consumer scenarios include hosted email services or content storage or back-up. Consider three parties: a user Alice that stores her data in the cloud; a user Bob with whom Alice wants to share data; and a cloud storage provider that stores Alice’s data. To use the service, Alice and Bob begin by downloading a client application that consists of a data processor, a data verifier and a token generator. Upon its first execution, Alice’s application generates a cryptographic key. We will refer to this key as a master key and assume it is stored locally on Alice’s system and that it is kept secret from the cloud storage provider.
Whenever Alice wishes to upload data to the cloud, the data processor attaches some metadata (e.g., current time, size, keywords etc…) and encrypts and encodes the data and metadata with a variety of cryptographic primitives. Whenever Alice wants to verify the integrity of her data, the data verifier is invoked. The latter uses Alice’s master key to interact with the cloud storage provider and ascertain the integrity of the data. When Alice wants to retrieve data (e.g., all files tagged with keyword “urgent”) the token generator is invoked to create a token and a decryption key. The token is sent to the cloud storage provider who uses it to retrieve the appropriate (encrypted) files which it returns to Alice. Alice then uses the decryption key to decrypt the files.
Whenever Alice wishes to share data with Bob, the token generator is invoked to create a token and a decryption key which are both sent to Bob. He then sends the token to the provider who uses it to retrieve and return the appropriate encrypted documents. Bob then uses the decryption key to recover the files. This process is illustrated in Figure 1.
Figure 1: (1) Alice’s data processor prepares the data before sending it to the cloud; (2) Bob asks Alice for permission to search for a keyword; (3) Alice’s token generator sends a token for the keyword and a decryption key back to Bob; (4) Bob sends the token to the cloud; (5) the cloud uses the token to find the appropriate encrypted documents and returns them to Bob. At any point in time, Alice’s data verifier can verify the integrity of the data.
An Enterprise Architecture
In the enterprise scenario we consider an enterprise MegaCorp that stores its data in the cloud; a business partner PartnerCorp with whom MegaCorp wants to share data; and a cloud storage provider that stores MegaCorp’s data. To handle enterprise customers, we introduce an extra component: a credential generator. The credential generator implements an access control policy by issuing credentials to parties inside and outside MegaCorp.
To use the service, MegaCorp deploys dedicated machines within its network to run components which make use of a master secret key, so it is important that they be adequately protected. The dedicated machines include a data processor, a data verifier, a token generator and a credential generator. To begin, each MegaCorp and PartnerCorp employee receives a credential from the credential generator. These credentials reflect some relevant information about the employees such as their organization or team or role.
Figure 2: (1) Each MegaCorp and PartnerCorp employee receives a credential; (2) MegaCorp employees send their data to the dedicated machine; (3) the latter processes the data using the data processor before sending it to the cloud; (4) the PartnerCorp employee sends a keyword to MegaCorp’s dedicated machine ; (5) the dedicated machine returns a token; (6) the PartnerCorp employee sends the token to the cloud; (7) the cloud uses the token to find the appropriate encrypted documents and returns them to the employee. At any point in time, MegaCorp’s data verifier can verify the integrity of MegaCorp’s data.
generates data that needs to be stored in the cloud, it sends the data together with an associated decryption policy to the dedicated machine for processing. The decryption policy specifies the type of credentials necessary to decrypt the data (e.g., only members of a particular team). To retrieve data from the cloud (e.g., all files generated by a particular employee), an employee requests an appropriate token from the dedicated machine. The employee then sends the token to the cloud provider who uses it to find and return the appropriate encrypted files which the employee decrypts using his credentials.
If a PartnerCorp employee needs access to MegaCorp’s data, the employee authenticates itself to MegaCorp’s dedicated machine and sends it a keyword. The latter verifies that the particular search is allowed for this PartnerCorp employee. If so, the dedicated machine returns an appropriate token which the employee uses to recover the appropriate files from the service provider. It then uses its credentials to decrypt the file. This process is illustrated in Figure 2.
Implementing the Core Cryptographic Components
The core components of a cryptographic storage service can be implemented using a variety of techniques, some of which were developed specifically for cloud computing. When preparing data for storage in the cloud, the data processor begins by indexing it and encrypting it with a symmetric encryption scheme (for example the government approved block cipher AES) under a unique key. It then encrypts the index using a searchable encryption scheme and encrypts the unique key with an attribute-based encryption scheme under an appropriate policy. Finally, it encodes the encrypted data and index in such a way that the data verifier can later verify their integrity using a proof of storage.
In the following we provide high level descriptions of these new cryptographic primitives. While traditional techniques like encryption and digital signatures could be used to implement the core components, they would do so at considerable cost in communication and computation. To see why, consider the example of an organization that encrypts and signs its data before storing it in the cloud. While this clearly preserves confidentiality and integrity it has the following limitations.
To enable searching over the data, the customer has to either store an index locally, or download all the (encrypted) data, decrypt it and search locally. The first approach obviously negates the benefits of cloud storage (since indexes can grow large) while the second scales poorly. With respect to integrity, note that the organization would have to retrieve all the data first in order to verify the signatures. If the data is large, this verification procedure is obviously undesirable. Various solutions based on (keyed) hash functions could also be used, but all such approaches only allow a fixed number of verifications.
A searchable encryption scheme provides a way to encrypt a search index so that its contents are hidden except to a party that is given appropriate tokens. More precisely, consider a search index generated over a collection of files (this could be a full-text index or just a keyword index). Using a searchable encryption scheme, the index is encrypted in such a way that (1) given a token for a keyword one can retrieve pointers to the encrypted files that contain the keyword; and (2) without a token the contents of the index are hidden. In addition, the tokens can only be generated with knowledge of a secret key and the retrieval procedure reveals nothing about the files or the keywords except that the files contain a keyword in common.
Symmetric searchable encryption (SSE) is appropriate in any setting where the party that searches over the data is also the one who generates it. The main advantages of SSE are efficiency and security while the main disadvantage is functionality. SSE schemes are efficient both for the party doing the encryption and (in some cases) for the party performing the search. Encryption is efficient because most SSE schemes are based on symmetric primitives like block ciphers and pseudo-random functions. Search can be efficient because the typical usage scenarios for SSE allow the data to be pre-processed and stored in efficient data structures.
Another set of cryptographic techniques that has emerged recently allows the specification of a decryption policy to be associated with a ciphertext. More precisely, in a ciphertext-policy attribute-based encryption scheme each user in the system is provided with a decryption key that has a set of attributes associated with it. A user can then encrypt a message under a public key and a policy. Decryption will only work if the attributes associated with the decryption key match the policy used to encrypt the message. Attributes are qualities of a party that can be established through relevant credentials such as being an employee of a certain company or living in Washington State.
Proofs of Storage
A proof of storage is a protocol executed between a client and a server with which the server can prove to the client that it did not tamper with its data. The client begins by encoding the data before storing it in the cloud. From that point on, whenever it wants to verify the integrity of the data it runs a proof of storage protocol with the server. The main benefits of a proof of storage are that (1) they can be executed an arbitrary number of times; and (2) the amount of information exchanged between the client and the server is extremely small and independent of the size of the data.
Trends and future potential
Extensions to cryptographic cloud storage and services are possible based on current and emerging cryptographic research. This new work will bear fruit in enlarging the range of operations which can be efficiently performed on encrypted data, enriching the business scenarios which can be enabled through cryptographic cloud storage.
About the Authors
Kristin Lauter is a Principal Researcher and the head of the Cryptography Group at Microsoft Research. She directs the group’s research activities in theoretical and applied cryptography and in the related math fields of number theory and algebraic geometry. Group members publish basic research in prestigious journals and conferences and collaborate with academia through joint publications, and by helping to organize conferences and serve on program committees. The group also works closely with product groups, providing consulting services and technology transfer. The group maintains an active program of post-docs, interns, and visiting scholars. Her personal research interests include algorithmic number theory, elliptic curve cryptography, hash functions, and security protocols.
Seny Kamara is a researcher in the Crypto Group at Microsoft Research in Redmond and completed a Ph.D. in Computer Science at Johns Hopkins University under the supervision of Fabian Monrose. At Hopkins Dr. Kamara was a member of the Security and Privacy Applied Research (SPAR) Lab. Seny Kamara spent the Fall of 2006 at UCLA’s IPAM and the summer of 2003 at CMU’s CyLab. Main research interests are in cryptography and security and recent work has been in cloud cryptography, focusing on the design of new models and techniques to alleviate security and privacy concerns that arise in the context of cloud computing.