The most important advantage behind the concept of cloud computing for scientific experiments is that the average scientist is capable of accessing many types of resources without having to buy or configure the whole infrastructure.
This is a fundamental need for scientists and scientific applications. It is preferable that scientists be isolated from the complexity of configuring and instantiating the whole environment, focusing only on the development of the in silico experiment.
The amount of published scientific and industrial papers provide evidence that cloud computing is being considered as a definitive paradigm and it is already being adopted by many scientific projects.
However, many issues have to be analyzed when scientists decide to migrate a scientific experiment to be executed in a cloud environment. The article “Azure Use Case Highlights Challenges for HPC Applications in the Cloud” presents several challenges focused on HPC support, specifically, for Windows Azure Platform. We discuss in this article some important topics on cloud computing support from a scientific perspective. Some of these topics were organized as a taxonomy in our chapter “Towards a Taxonomy for Cloud Computing from an e-Science Perspective” of the book “Cloud Computing: Principles, Systems and Applications”[11].
Background on e-Science and Scientific Workflows
Over the last decades, the effective use of computational scientific experiments evolved in a fast pace, leading to what is being called e-Science . The e-Science experiments are also known as in silico experiments [12]. In silico experiments are commonly found in many domains, such as bioinformatics [13] and deep water oil exploitation [14]. An in silico experiment is conducted by a scien-tist, who is responsible for managing the entire experiment, which comprises composing, executing and analyzing it. Most of the in silico experiments are composed by a set of programs chained in a coherent flow. This flow of programs aiming at a final scientific goal is commonly named scientific workflow [12,15].
A scientific workflow may be defined as an abstraction that allows the structured controlled composition of programs and data as a sequence of operations aiming a desired result. Scientific workflows represent an attractive alternative to model pipelines or script-based flows of programs or services that represent solid algorithms and computational methods. Scientific Workflow Management Systems (SWfMS) are responsible for the workflow execution by coordinating the invocation of programs, either locally or in remote environments. SWfMS need to offer support throughout the whole experiment life cycle, including: (i) design the workflow through a guided interface (to follow a specific scientific method [16]); (ii) control several variations of workflow executions [15]; (iii) execute the workflow in an efficient way (often in parallel); (iv) handle failures (v) access, store and manage data.
The combination of the life cycle support with the HPC environment has many challenges to SWfMS due to the heterogeneous execution environments of the workflow. When the HPC is a cloud platform, more issues arise as discussed next.
Cloud check-list before migrating a scientific experiment
We discuss scientific workflow issues related to cloud computing in terms of architectural characteristics, business model, technology infrastructure, privacy, pricing, orientation and access, as shown in Figure 1 .
Main issues in clouds for scientific applications
Pricing
Cost is one of the most important characteristics in both scientific and business domains. Since most of the public clouds adopt the pay per use model, it is important to preview the final price to be paid and to determine how the financial resources available for a scientific experiment are used. In general, the price to be paid for using clouds follow three main types (that have to be analyzed by scientists): free (normally if scientists have their own cloud), pay-per-use (pays a spe-cific value related to his resource utilization normally in hours) and bill broken (where scientists pay for using each component independent of used time). However, this evaluation is far from simple, since costs saved by cloud, such as, acquiring equipment and hiring supporting staff are difficult to calculate.
Business Model
Clouds may be classified into three main categories [17]: Software as a Service (SaaS), Infrastructure as a Service (IaaS) and Platform as a Service (PaaS), creating a model named SPI [17]. The evaluation of a cloud environment must consider the business model particularly with respect to scientific data support. In the e-Science field, the generated data is one of the most valuable resources. The SPI model does not consider services that are based on storage or databases. Thus, it is important to check models that provide Storage as a Service and Database as a Service. Storage as a Service provides access to several storage facilities that are remotely located. Database as a Service provides operations and functions of a remotely hosted database management system. Database services are particularly important in scientific experiments to store provenance data [18], so it can be queried with controlled access, what is not supported by storage services.
Architectural Characteristics
When analyzing the main architectural characteristics of clouds it is important to check and analyze the support for Virtualization, Security, Resource Sharing and Scalability. For example, clouds can occasionally relocate applications among hosts and allocate multiple applications on the same host according to resource availability. These moves and instabilities can generate negative impacts on workflow performance due to the flow of activity executions and data transfer between them. Ideally the cloud scheduler should be in sync with the SWfMS to be aware of the flow.
Privacy
Privacy is a fundamental issue in scientific experiments. Many unpublished experiments and results have to be private during the course of the experiment. We may classify cloud approaches in Private, Public and Hybrid. From the scientist point of view and in terms of privacy, the most “secure” approach is to use private clouds. In private clouds, all the security control is defined by the scientist (or a computer specialist team) which means that external access are more controlled by the scientist. However, hybrid and public clouds usually provide advanced security mechanisms (such as security policies in Amazon EC2) that guarantee the privacy of data and applications. Scientists have to analyze if the provided mechanisms are enough for their expectations.
Access
There are several types of access provided, such as (non-exhaustive list): Browsers, Thin Clients, Mobile Clients and API, for example. Analyzing the access type provided is important for scientists when choosing a cloud environment to run their experiments. Scientific experiments should be able to be accessed by different ways: web pages, mobile devices. The effective use of different technologies in scientific experiments leads to the need of different types of access. Web browsers are commonly used for accessing cloud services. It is an intuitive idea to use Web browsers since almost every computer has at least one browser installed and may access cloud services. In addition, many Web browsers are focused on cloud computing, such as Google Chrome. Thin clients and mobile are other im-portant types of access for clouds out of a desktop within handhelds or mobile phones. And finally, API is a fundamental way for accessing clouds via programming languages commands (such as Java, Python or C). Complex scientific applications usually make use of APIs to access cloud infrastructure in a native form. In this case, scientists have to analyze the access methods already used for their application and verify if this access method can be used or adapted to be migrated to a cloud environment.
Cloud Orientation
The cloud orientation differs according to the business model used. In the SaaS model, application are deployed on the cloud, and can only be invoked, i.e. all the execution control is in charge of the deployed application. In this case, we consider this approach as task centric Scientists need to transfer control to the application owners instead of having control of it during the course of the experiment. On the other hand, when the infrastructure is provided as a service (IaaS where virtualized hardware is provided to be configured and controlled), the scientist has full control of the actions. The programs that will execute, the environment configurations are chosen by scientists. In this case, we consider this approach as user centric. Scientists have to analyze which approach is more suitable for their needs. If they want to execute only one application such as bioinformatics BLAST, they can choose a task centric approach. However, if they want to try several programs, change environment configurations, the user centric approach is more suitable.
Technology Infrastructure
The technological infrastructure defines how a specific cloud approach is imple-mented. It can be based on based on grids [19], Peer-to-Peer [20], PC clouds, and cluster clouds or combination of them. This evaluation may be compromised in public clouds, such as Amazon EC2 [21], because we are not able to know which kind of technology is used to implement the cloud. However, in private clouds it is possible to obtain this information. It is quite useful because many experiments need a computational cluster or a grid to execute in parallel and produce results in a feasible time.
Conclusions
This article highlighted that despite the high interest about cloud computing from the scientific community (especially those that need to execute HPC scientific ap-plications); it is still a wide open field. Choosing the best cloud support is a step forward, but there is still a need for services focused on the scientific workflow execution to bridge the gap between the cloud and the SWfMS. SciCumulus [22] is an initiative in this direction. Some SWfMS, such as Swift [23] and Pegasus [24] are also incorporating cloud support in their systems.
About the Authors
Daniel de Oliveira is a Ph.D. student at the Department of Computer Science at the COPPE Institute from Federal University of Rio de Janeiro. He received a B.Sc. degree in 2005 and M.Sc. degree in 2008, both from Federal University of Rio de Janeiro, Brazil. He is currently working on his Ph.D. thesis in Computer Science in the same institution. His interests include Cloud Computing, e-Science, workflow management, data mining, text mining and ontologies. He is also mem-ber of IEEE, ACM and of the Brazilian Computer Society.
Fernanda Baião is a Professor of the Department of Applied Informatics of the Federal University of the State of Rio de Janeiro (UNIRIO) since 2004, where she leads the Distributed Databases Research Group. She received the Doctor of Science degree from the Federal University of Rio de Janeiro (UFRJ) in 2001. During the year 2000 she worked as a visiting student at the University of Wis-consin, Madison (USA). Her current research interests include distributed and parallel databases, data management in scientific workflows, conceptual data modeling and machine learning techniques. She participates in research projects in those areas, with funding from several Brazilian government agencies, including CNPq, CAPES and FAPERJ. She participates in several program committees of national and international conferences and workshops, and is a member of ACM and of the Brazilian Computer Society.
Marta Mattoso is a Professor of the Department of Computer Science at the COPPE Institute from Federal University of Rio de Janeiro (UFRJ) since 1994, where she leads the Distributed Database Research Group. She has received the Doctor of Science degree from UFRJ. Dr. Mattoso has been active in the database research community for more than ten years and her current research interests in-clude distributed and parallel databases, data management aspects of scientific workflows. She is the principal investigator in research projects in those areas, with funding from several Brazilian government agencies, including CNPq, CAPES, FINEP and FAPERJ. She has published over 200 refereed international journal articles and conference papers. She has served in program committees of international conferences, and is a regular reviewer of several international journals.
References
[1] N. Antonopoulos and L. Gillam, 2010, Cloud Computing: Principles, Systems and Applications. 1 ed. Springer.
[2] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, et al., 2010, A view of cloud computing, Commun. ACM, v. 53, n. 4, p. 50-58.
[3] R. Buyya, C.S. Yeo, and S. Venugopal, 2008, Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities, In: Proceedings of the 2008 10th IEEE International Conference on High Performance Computing and Communications, p. 5-13
[4] E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good, 2008, The cost of doing science on the cloud: the Montage example, In: SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, p. 1-12, Austin, Texas.
[5] Y. El-Khamra, H. Kim, S. Jha, and M. Parashar, 2010, Exploring the Performance Fluctuations of HPC Work-loads on Clouds, In: Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, p. 383–387, Washington, DC, USA.
[6] C. Evangelinos and C. Hill, 2008, Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere-Ocean Climate Models on Amazon’s EC2, Chicago, IL.
[7] I. Foster, Y. Zhao, I. Raicu, and S. Lu, 2008, Cloud Computing and Grid Computing 360-Degree Compared, In: Grid Computing Environments Workshop, 2008. GCE ’08, p. 10, 1
[8] T. Hey, S. Tansley, and K. Tolle, 2009, The Fourth Paradigm: Data-Intensive Scientific Discovery. Online book, Url.: http://emotionalcompetency.com/sci/booktoc.html.
[9] A. Matsunaga, M. Tsugawa, and J. Fortes, 2008, CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications, IEEE eScience 2008, p. 229, 222.
[10] C. Hoffa, G. Mehta, T. Freeman, E. Deelman, K. Keahey, B. Berriman, and J. Good, 2008, On the use of cloud computing for scientific workflows, In: IEEE Fourth International Conference on eScience (eScience 2008), Indianapolis, USA, p. 7–12
[11] D. Oliveira, F. Baião, and M. Mattoso, 2010, “Towards a Taxonomy for Cloud Computing from an e-Science Perspective”, Cloud Computing: Principles, Systems and Applications (to be published), Heidelberg: Springer-Verlag
[12] I.J. Taylor, E. Deelman, D.B. Gannon, M. Shields, and (Eds.), 2007, Workflows for e-Science: Scientific Workflows for Grids. 1 ed. Springer.
[13] M. Addis, J. Ferris, M. Greenwood, P. Li, D. Marvin, T. Oinn, and A. Wipat, 2003, Experiences with e-Science workflow specification and enactment in bioinformatics, Proceedings of UK e-Science All Hands Meeting, p. 459–467.
[14] W. Martinho, E. Ogasawara, D. Oliveira, F. Chirigati, I. Santos, G. Travassos, and M. Mattoso, 2009, A Concep-tion Process for Abstract Workflows: An Example on Deep Water Oil Exploitation Domain, In: 5th IEEE International Conference on e-Science, Oxford, UK.
[15] M. Mattoso, C. Werner, G.H. Travassos, V. Braganholo, L. Murta, E. Ogasawara, D. Oliveira, S.M.S.D. Cruz, and W. Martinho, 2010, Towards Supporting the Life Cycle of Large Scale Scientific Experiments, In-ternational Journal of Business Process Integration and Management, v. 5, n. 1, p. 79–92.
[16] R.D. Jarrard, 2001, Scientific Methods. Online book, Url.: http://emotionalcompetency.com/sci/booktoc.html.
[17] L. Youseff, M. Butrico, and D. Da Silva, 2008, Toward a Unified Ontology of Cloud Computing, In: Grid Com-puting Environments Workshop, 2008. GCE ’08, p. 10, 1
[18] J. Freire, D. Koop, E. Santos, and C.T. Silva, 2008, Provenance for Computational Tasks: A Survey, Computing in Science and Engineering, v.10, n. 3, p. 11-21.
[19] I. Foster and C. Kesselman, 2004, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann.
[20] E. Pacitti, P. Valduriez, and M. Mattoso, 2007, Grid Data Management: Open Problems and New Issues, Journal of Grid Computing, v. 5, n. 3, p. 273-281.
[21] Amazon EC2, 2010. Amazon Elastic Compute Cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2). Dispon?vel em: http://aws.amazon.com/ec2/. Acesso em: 5 Mar 2010.
[22] D. Oliveira, E. Ogasawara, F. Baião, and M. Mattoso, 2010, SciCumulus: A Lightweigth Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows, In: Proc. 3rd IEEE International Conference on Cloud Computing, Miami, FL.
[23] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, 2007, Swift: Fast, Reliable, Loosely Coupled Parallel Computation, In: Services 2007, p. 206, 199, Salt Lake City, UT, USA.
[24] E. Deelman, G. Mehta, G. Singh, M. Su, and K. Vahi, 2007, “Pegasus: Mapping Large-Scale Workflows to Dis-tributed Resources”, Workflows for e-Science, Springer, p. 376-394.