Cloud Computing and Digital Libraries - First Perspectives on a Future Technological Alliance by Pedro Lopes

Cloud Computing and Digital Libraries First perspectives on a future technological alliance

Pedro Lopes

IEETA, Universidade de Aveiro Aveiro, Portugal pedrolopes@ua.pt Abstract—“Cloud-computing” is emerging as a relevant computing paradigm aiming to be the technology that will mark the difference between Web2.0 and Web3.0. “CloudComputing” architecture features are pushing all the data and services to the Web and the added value in this transition may be used by a new generation of digital libraries where services and data coexist transparently “in the cloud”. The “cloud” may now be seen as a collection of networked features. Traditional digital libraries architectures may benefit with this new concept and lead to a new implementation model: “cloud libraries”. Cloud-computing; software-as-a-service; virtualization; platform-as-a-service; web; architecture; web2.0; digital libraries; data preservation; web-services; information access; cloud libraries.

INTRODUCTION

The term “cloud computing” has recently appeared in the media to describe a new computing paradigm, the next step in the evolution of on-demand information technology services and products [1]. It is a simple generic term which describes the new types of dynamic IT infrastructures, Quality of Service guaranteed computing environments and configurable software services where the computing is made in an infrastructure unknown to the user, usually using virtualized machines relying on a company-specific middleware layer. “Cloud-computing” is, as the name suggests, nothing more than computing in “the clouds”, where “the clouds” stands for “the web”. Despite the fact that this seems to be a new concept that will revolutionize the Internet we know, it is a mere buzzword to describe something that is already happening: web-based applications fulfilling all the user requirements and replacing desktop applications. Just looking at the Google Apps 1 or the Microsoft Live 2 suite, they have almost all the resources one may need whether in personal life or at work: documents, spreadsheets, e-mail, presentations, calendars… 1 2

Google Apps - http://www.google.com/apps Microsoft Live - http://www.live.com

This article has been written for and sponsored by Aveiro’s University Informatics Engineering Doctoral Program.

Computer science evolution led to the digitalization of every kind of information. Digital libraries are appearing as one more step to an easy access to the information spread throughout distinct medias. As the data is stored digitally, information retrieval is facilitated allowing a new wave of services and web applications that can gain quality from the gigantic amount of data available. However, digital libraries’ potentialities and problems are yet to be completely explored outside the libraries controlled environment. Despite some large corporation efforts, the amount and quality of services available to the developers (or end-users) is currently very reduced. Google, Microsoft and Yahoo have large-scale digitalization projects in progress but the data isn’t fully available and no services on this data are provided yet. Digital data preservation is another problem that digital libraries face. Just consider paper, even having 2000 years it can still be read but in a digital world, 99% of the computers in the market don’t have a floppy disk drive, which was very popular 10 years ago. Considering these problems, all the hype surrounding “cloud-computing” and the efforts being currently made to create complete “cloud operative systems”, it is natural that digital libraries should be connected to “cloud-computing” in order to obtain mutual benefits and enhance both perspectives. This paper is mainly focused on this alliance and the key improvements that may be obtained in merging “cloud-computing” concepts with digital libraries design, especially in areas such as data services (and service integration) or the digital data storage. II. MOTIVATION “Cloud-computing” is a new architecture and each service provider is implementing it with a different model, according to what they wish to offer to their clients. Related to digital libraries, it is hard to establish a simple definition as it depends on the scope of information one wishes to cover. Despite this fact, all digital libraries face the common issues that were previously mentioned. However, most of digital libraries are in early development stages and its evolution will enhance lots of aspects.

A. Cloud Computing After the Web2.0 success, the way we see the Internet has significantly changed. The current trend is to consider the Internet as a platform providing on-demand computing and software as a service to anyone, anywhere and at anytime [1].

what they really want, increasing application scalability and flexibility.

Figure 1. Large companies perspective on “cloud-computing”

With this in mind, the most powerful web companies are developing a sustained hardware and services platform (approaching a full operating system environment) that will provide an easy access to “cloud computing” benefits to every web and application developers. These companies are currently analyzing the market and studying business opportunities in order to decide what kind of services they wish to offer. As shown in Figure 1, there has to be a trade off between control over the applications and how large the companies want theirs services to grow. For instance, company XYZ can’t have high control over the cloud system and expect the system to scale efficiently or vice-versa. The most important companies developing “cloud-based” solutions include: •

Microsoft with its Azure Services Platform 3 , a platform for developing web applications. Currently in limited technological previews and trying to implement all the .NET framework functionalities that developers are used to, in the typical server scenarios.

•

Salesforce4, the self-proclaimed “leader in Customer Relationship Manager & Cloud-Computing” provides “no software”, only services to improve CRM web applications.

•

Amazon with the Elastic Compute Cloud 5, part of the Amazon Web Services, aims to provide “resizable compute capacity in the cloud”, offering complete virtualized operative systems to clients.

•

Google App Engine 6 offers, for free, the ability to “build apps on the same scalable system that power Google applications”.

These companies’ offer includes access to a virtual machine – Windows or Linux – and an agreed Service Level, which defines the available bandwidth, in and out traffic or available disk space and memory. One of “cloud-computing” key features is the ability to increase the Quality of Service dynamically according to the application usage and requirements. This allows the application owners to pay only 3 4 5 6

Azure Services Platform - http://www.microsoft.com/azure Salesforce.com - http://www.salesforce.com Amazon Elastic Compute Cloud - http://aws.amazon.com/ec2 Google App Engine - http://code.google.com/appengine

Figure 2. “Cloud-computing”

These new architectures and business models allow the requirements to be dynamically fulfilled according to the application’s needs and the accounts dynamically charged according to the Service Level Agreement. Most companies only offer services which, when mixed together, may be considered as a “cloud operating environment” because they offer storage access, databases, programming APIs, file management or (virtual) hardware control, just like a desktop operating system. Figure 2 presents a simple “cloud-computing” overview. However, understanding “cloud-computing” requires great knowledge about large hardware infrastructures, GRID [2, 3] engines and virtualization [4, 5]. A large hardware infrastructure is required as both data and services are replicated in the system, enabling the existence of a virtual machine that can use services located in the USA and data located in the UK. GRID computing technology is “cloudcomputing” main enabler. It was with the GRID, and the developments it brought – in distributed computing, largescale networks for instance – that sped up the current “cloudcomputing” deployments. Together with GRID advances, new virtualization techniques have also matured in the last few years, enough to allow the creation of a complete layer of virtualized environments with web-based access and control. This new virtualized layer will use GRID components as the main hardware infrastructure, merging both the technologies in a single architecture. As can be seen in Figure 3, the architecture relies on a distributed hardware layer that may be spawn through different geographic locations. On top of it there are virtual server containers that may use distinct hardware machines. This is the core layer of the “cloud-computing” architecture. Virtualized environments now offer almost the same efficiency as hardware directly deployed ones. With virtualization, the distributed resources are scaled in real time to the applications improving their efficiency and reducing the response time to traffic peaks. Virtualized environments also let the hardware suppliers scale the

offered resources in real time, allowing a better use of the available hardware.

smaller scale, one can consider a simple newspaper archive or a scientific magazine online version to be digital libraries.

“Cloud-computing” main benefits are the following:

Figure 4 shows a typical large-scale digital library model. The clients access the presentation layer, via Internet, using the browser. The access is mediated by a firewall protecting the server-side. The latter then accesses all the library servers organized in a distributed environment controlled by a metadata server enabling a new level of control [6]. Metadata is basically data about data. The metadata server component will store structured information about the digital library composition, stored data, available services and technical details about its operation.

•

Pay for what you use: one only has to pay what one uses, reducing the overall costs of application deployment. This also allows cost savings (especially on hardware) which leverages economies of scale;

•

Resource flexibility: the resources are scaled in realtime to the application’s needs improving infrastructure provider hardware’s efficiency and application scalability and sustainability;

•

Rapid prototyping and market testing: applications may be deployed instantly in the server, increasing speed to market;

•

Improved service levels and availability: as one only pay what one uses, the service levels agreements are more customizable and produced according to one’s needs;

•

Self-service deployment: the developers have full control over their virtualized machine and may deploy whatever and whenever they want without a hassle.

Figure 4. Large scale digital library model

By now, several similarities with the “cloud-computing” architecture may be seen. A large hardware infrastructure supporting different servers, requiring a complex management and a GRID-like control system used to enhance efficiency and scalability are common in both architectures. As they share some characteristics, their software engineers could benefit with knowledge exchange and interaction between these two areas. C. Use case scenarios 1) Digital Libraries Services and Applications With the enormous amount of digital data originated in several digital library projects, it is necessary to use this data in the most reasonable manner.

Figure 3. “Cloud-Computing” Architecture

Concluding, “cloud-computing” is being used by large companies to create a fully functional online operative system that will allow a return to the computer interaction roots where dumb terminals accessed a central mainframe containing all the applications. Except in this case, the applications are located in the “cloud” instead of a single server located nearby the thin clients. B. Digital Libraries In this context, digital library represents an architecture that supports services providing access to stored digital data. Usually, the access is provided through a web interface and the digital data are of public domain or, if otherwise, there is a subscription plan for information access. In a

Currently, the services offered by digital libraries are scarce. Considering newspapers or online scientific publications, the user can only access, view or download some data to his personal computer (if the service is subscribed). Even in large projects, the data access is direct and enhanced searches, semantic [7] searches or areaspecific services aren’t available. Without these services, information access is constrained. This problem is relevant because Web2.0 nourishes collaboration, communication and, above all, integration. By now, all that the digital libraries have to offer is a limited collection of services, reducing information-sharing capabilities. Despite the fact that digital libraries owners usually want to restrain what they offer, the benefits obtained with a comprehensive trade off between complete public information and a completely closed system have to be considered. 2) Digital Data Preservation Digital data preservation represents a threat to digital libraries. Data is an essential part of the library and its storage is of utmost importance. Digital data storage requires

extreme durability and scalability. However, component failures, obsolescence, human operation errors, natural disasters, attacks or management errors are some common difficulties that must be carefully studied when implementing a digital library [8]. These threats may be solved using a distributed data storage approach. Is in this area that “cloud-computing” may help, as both the storage and the services are completely distributed, whether with the GRID architecture [9, 10] or with distributed virtual servers. III.

CASE STUDIES

As mentioned before, the presented issues may be solved using “cloud-computing”. Some possible solution scenarios to solve the issues using “cloud-computing” features and the generic “cloud-computing” architecture proposed previously are now presented. A. Easy services scenario One of the main capabilities that big companies are offering is a wide service framework. For instance, Microsoft is aiming to provide a complete .NET framework solution. The Azure Services Platform will contain storage services, database services and access to Microsoft’s Live framework (besides the typical cloud management services). Amazon’s EC2 with its personal operating system, offers the possibility to install or create any required service.

designed. This will also create an infrastructure with the ability to support a new wave of applications, promoting interoperability and collaboration among the community of developers and users. This step could definitely increase digital libraries usage levels because one could develop simple applications that fitted one’s requirements and that could be published to other developers. It could also enhance existing applications, as more and more developers become interested in creating services and applications using the digitally available data. These new services could definitely improve existing digital library architectures. For instance, University of Aveiro has developed the SInBAD platform [12] which integrates four different main components but only offers traditional search mechanisms to analyze all the existing data. Adapting “cloud-computing”, new distributed services could be created to improve the existent search mechanisms and to offer new functionalities to users. Some of these services may involve user customized searches, textmining tools and query expansion mechanisms [13]. Saving user’s search history and using collective intelligence algorithms to extract user information will allow better knowledge about the users habits which will, subsequently, allow search results which are more interesting to the user than traditional ones. The stored data will also enable GRIDbased text-mining engines [14], a specialization of generic text-mining frameworks, designed to explore all the GRID capabilities. This create-and-share process is fundamental to the growth of digital libraries’ importance in the Internet and may lead to great developments in a new era of semantic digital libraries. This can also be a small step to the birth of Web3.0, the next generation of web applications and Internet usage. B. “Cloud storage” scenario The storage architecture is one of the best cloud features. Storage is distributed and scaled according to cloud application requirements improving overall system efficiency. To start, every single bit is replicated in different geographic located servers. This provides faster access, as users are directed to the nearby storage server. Other aspect is that when a server is heavily charged, the users are directed to a lighter server in order to balance server charge and speed up storage responses.

Figure 5. Digital Libraries improvements cycle

As one can see, “cloud-computing” relies on services to help the developers in the application design process. The cloud environment is, with this, a comprehensive distributed environment, complete with a set of tools available to every developer, which can be used in any kind of application. With a middleware service layer, controlled by the developers, data and service integration problems [11] may be solved. Creating a promiscuous architecture, with the best of both worlds, a complete service framework, enabling an increase of the digital libraries services scope can be

Digital data preservation is achieved when some primary objectives are accomplished. Retrieve and use data is easily understood as the primary objective. Long-term preservation while maintaining data properties is the other main objective. Succinctly, this means that the digital data must maintain its consistency, availability and integrity forever. The latter objective will, in some manner, involve hardware changes that have to be prepared in advance in order to avoid a future digital architecture failure. Most of these objectives usually require the existence of the mentioned metadata server in order to ease the processes required to fulfill them. In a “cloud-computing” environment, data replication improves architecture’s reliability. To deal with the obsolescence problems, service providers use this data replication to facilitate hardware or format updates. The

update is then propagated to the services, which are also updated in order to cope with the new hardware or the new data formats. This data management process is also eased due to the existence of metadata. This data about data contains mostly technical information. However, this data is maintained especially to guarantee that all the stored data is consistent across the storage layer. Is this technical metadata that allows the integration of different kinds of storage systems. For instance, it’s the metadata layer that contains geographical location information about the storage servers and it will provide it to the services, along with a storage server description allowing the integration of different kinds of storage servers: a relational database located in the USA and triplet storing system located in the UK. Within “cloudcomputing” architectures, scalability is no longer an issue as the architecture itself provides real-time scalable access to resources, whether is data or processing power. The cloudoperating environment treats storage level heterogeneity so it is no longer a problem to the digital data preservation issue. Storage differences are completely transparent to the service layer.

complexity, as the integration issue still doesn’t have a simple solution. This problem appears when integrating heterogeneous data or heterogeneous services. There are a large number of Service Oriented Architectures and one doesn’t a standardized manner to connect all of them. A deeper study of this subject is out of this paper’s scope. However it’s important to retain that service integration is an unsolved issue. Other important issue is “cloud library” the control layer. Despite the fact that there’s metadata containing all the relevant information like geographical information and technical specifications about the storage servers, controlling the GRID, the virtualization servers, the service layer, the storage layer, the geographic distribution for instance is an extremely complex software engineering task which surely involves a huge system model. Copyright and other legal issues also have to be considered when designing these types of systems as a great part of the data may need particular licenses to be displayed and other data has to be modified. Considering a medical imaging server, all the patient data has to be anonymized before being made public, and will only become public if the patient directly allows it. One final problem that can be pointed is the novelty of these solutions. Being a state-of-the-art technology, “cloudcomputing” doesn’t have yet a sustained community of developers. This may be a major step back if one whishes to have a functional digital library in a short period of time. IV.

Figure 6. Digital data preservation using distributed storage

At this point, it is important to mention that services are separated from storage. This separation may not be physical: the storage server may also play the service server role, but it occurs at a logical level. The service layer, using the available metadata, dynamically controls and accesses the storage layer as can be seen in Figure 6. With all this in mind, we may now consider that the storage is fully virtual, and that it is, in someway, hidden to the applications and developers. The storage layer, which encompasses a one-way connection to the services layer and uses some “cloud-computing” architectural principles, fulfills all digital data preservation requirements [8] and may be chosen to implement any kind of digital library framework. C. Unsolved issues Of course merging these two technologies is not just about good things. Several problems may rise when designing “cloud libraries”. Whatever framework is developed, it will require a specific kind of fully automated data and service integration [15]. This requirement is enough to increase the system

SUMMARY

Web3.0 is right around the corner. In spite of being the future, no one really knows what it will be exactly. What we may be sure of is that “cloud-computing” is here to stay and will be a part of it. We’re currently assisting a noticeable shift in the Internet paradigm. Internet is no longer a place restricted to entertainment or leisure. Applications are currently taking over the Web. Application suites like Microsoft Live or Google Apps currently offer a vast set of functionalities. A first approach to a complete web operative system has been given by applications like EyeOS 7, which offers a traditional desktop environment in the users’ browser. However, what is really making the difference are GRID and virtualization technologies. Merging both these concepts and subsequent architectures, originates an architecture that is capable of offering complete virtualized operative systems, not just simple web applications. This “cloud operative environments” may be traditional desktop operative systems – Windows, Linux – installed on a virtual machine or may be the next generation of operative systems that only work in this kind of architecture. The vast features that “cloud-computing” has to offer can be specifically applied to digital libraries. From this alliance, “cloud libraries” may be born, replacing the old systems. Digital library architectures may still be improved using concepts from other systems. Enhancements can be made in the amount and quality of existing services. Further 7

EyeOS: Web Desktop, Web OS - http://eyeos.org

developments may also be made in the digital data storage components using “cloud computing” ideas. One of the main concepts is distribution, the main foundation of “cloudcomputing”. This distribution ensures access throughout the digital world and enhances the development of new services and applications. It also contributes to a new type of collaborative development and application sharing among the community of developers or end users. “Cloud-computing” may be considered an evolution of GRID computing. This fundamental aspect leverages the use of cloud architectures to prevent digital storage threats. Problems like format obsolescence, heterogeneity, availability, consistency or integrity are thought-out within the cloud itself. There is no need to plan a specific component to protect the architecture against each threat because all the proposed “cloud-computing” architectures rely on hardware and software layers that will easily respond to any kind of threat. ACKNOWLEDGMENT Pedro Lopes wishes to thank Professor Joaquim Arnaldo Martins for the support and clear insights on digital libraries and for reviewing this paper. Pedro Lopes also thanks Professor José Luís Oliveira who offered the opportunity to participate in the Informatics Engineering Doctoral Program. REFERENCES [1] M. A. Vouk, "Cloud computing - Issues, research and implementations," in Information Technology Interfaces, 2008. ITI 2008. 30th International Conference on, 2008, pp. 31-40. [2] P. Liang, S. See, J. Yueqin, S. Jie, A. Stoelwinder, and N. Hoon Kang, "Performance evaluation in computational grid environments," in High Performance Computing and Grid in Asia Pacific Region, 2004. Proceedings. Seventh International Conference on, 2004, pp. 54-62. [3] F. Nadeem, M. M. Yousaf, and M. Ali, "Grid Performance Prediction: Requirements, Framework, and Models," in Emerging Technologies, 2006. ICET '06. International Conference on, 2006, pp. 695702. [4] W. Chen, H. Lu, L. Shen, Z. Wang, N. Xiao, and D. Chen, "A Novel Hardware Assisted Full Virtualization Technique," in Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for, 2008, pp. 1292-1297. [5] S. J. Vaughan-Nichols, "New Approach to Virtualization Is a Lightweight," Computer, vol. 39, pp. 12-14, 2006. [6] L. Lizhen, H. Guoqiang, S. Xuling, and S. Hantao, "Metadata Extraction Based on Mutual Information in Digital Libraries," in Information Technologies and Applications in Education, 2007. ISITAE '07. First IEEE International Symposium on, 2007, pp. 209-212. [7] T. Berners-Lee, J. Hendler, and O. Lassila, "Semantic Web," in Scientific American: Scientific American, 2001, pp. 34-43. [8] M. Cabral, "Distributed Data Storage for Digital Preservation," Lisboa: Instituto Superior Técnico, 2008. [9] G. Antunes, "GRITO: Utilização de clusters GRID para um sistema de preservação digital," Lisboa: Instituo Superior Técnico, 2008. [10] G. Antunes, "Data Grids for Digital Preservation," Lisboa: Instituto Superior Técnico, 2008. [11] P. Lopes, J. Arrais, and J. L. Oliveira, "Dynamic Service Integration using Web-based Workflows," in 10th International Conference on Information Integration and Web Applcations & Services, Linz, Austria, 2008, pp. 622-625. [12] M. Fernandes, P. Almeida, J. A. Martins, and J. S. Pinto, "A Digital Library Framework for the University of Aveiro," Aveiro: University of Aveiro, 2008, pp. 111-123.

[13] J. Arrais, J. Rodrigues, and J. Oliveira, "Improving Literature Searches in Gene Expression Studies," in 2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics (IWPACBB 2008), 2009, pp. 74-82. [14] Y. Lean, W. Shouyang, L. Kin Keung, and W. Yue, "A framework of Web-based text mining on the grid," in Next Generation Web Services Practices, 2005. NWeSP 2005. International Conference on, 2005, p. 6 pp. [15] P. Lopes, "Service Integration for Knowledge Extraction," in Electronics, Telecommunications and Informatics Department. vol. Master of Science Aveiro: University of Aveiro, 2008.