PhD Thesis Proposal - Service Composition in Biomedical Applications by Pedro Lopes

Service Composition in Biomedical Applications PhD Thesis proposal

Pedro Lopes Research Supervisor: José Luís Oliveira

Abstract Information technologies evolution has raised many challenges throughout the years. Supported by the exponential Internet evolution, current critical issues are related to the tremendous amount of information available online. This immense information quantity leverages a heterogeneity increase and results in an overwhelming difficulty in finding information with certified quality. The major strategic approach to this problem is to develop integration applications that can offer access to a multitude of distributed online resources in a single workspace. Integration tasks are often quite complex and require the manual implementation of software tools that can connect distinct applications. Hence, it is necessary to strive efforts in the development of protocols and technologies that enable resource description and promote the design of interoperable software. These advances are materialized in the phenomenon of the semantic web. This higher level of intelligence in web applications can only be reached if developers adopt standard ontologies and describe their resources correctly. The research that will be conducted during this doctorate aims to research new software integration frameworks and novel implementation strategies that enhance the development of next-‐generation web applications in the life sciences context.

Table of Contents Abstract...............................................................................................................................3 Table of Contents ................................................................................................................5 Acronym List........................................................................................................................7 1 Introduction...............................................................................................................9 1.1 Objectives ............................................................................................................ 10 1.2 Structure.............................................................................................................. 10 2

Background.............................................................................................................. 11 2.1 Problems and Requirements ................................................................................ 14 2.1.1 Heterogeneity....................................................................................................14 2.1.2 Integration .........................................................................................................16 2.1.3 Interoperability..................................................................................................19 2.1.4 Description ........................................................................................................20 2.2 Technologies ........................................................................................................ 22 2.2.1 Online resource access ......................................................................................23 2.2.2 Web Services .....................................................................................................24 2.2.3 GRID...................................................................................................................28 2.2.4 Semantic Web....................................................................................................30 2.3 Summary ............................................................................................................. 32

Approach ................................................................................................................. 33 3.1 Solutions.............................................................................................................. 33 3.1.1 Static applications..............................................................................................33 3.1.2 Dynamic applications.........................................................................................34 3.1.3 Meta-‐applications..............................................................................................35 3.2 Bioinformatics...................................................................................................... 37 3.2.1 Databases ..........................................................................................................37 3.2.2 Service Protocols ...............................................................................................39 3.2.3 Integration Applications ....................................................................................40 3.3 Summary ............................................................................................................. 42

Work Plan ................................................................................................................ 43 4.1 Objectives ............................................................................................................ 43 4.2 Calendar .............................................................................................................. 43 4.3 Publications ......................................................................................................... 44

5 Implications of Research .......................................................................................... 47 References......................................................................................................................... 49

Acronym List API BPM CSS CSV DBMS ESB EU-‐ADR FTP GEN2PHEN GUI HGP HTML HTTP HVP JSON LSDB NAS OASIS OQL OWL OWL-‐S RDF REST RIA SAN SAWSDL SOA SOAP SPARQL SQL UDDI URI VQL W3C WSDL WSDL-‐S WWW XML XMPP XSD

Application Programming Interface Business Process Management Cascading Style Sheet Comma-‐separated Values Database Management System Enterprise Service BUS European Adverse Drug Reaction Project File Transfer Protocol Genotype-‐to-‐Phenotype: A Holistic Solution Project Graphical User Interface Human Genome Project Hypertext Markup Language Hypertext Transfer Protocol Human Variome Project Javascript Object Notation Locus-‐specific Database Network-‐Attached Storage Organization for the Advancement of Structured Information Standards Object Query Language Web Ontology Language OWL Semantics Resource Description Framework Representational State Transfer Rich Internet Applications Storage Area Network Semantic Annotations for WSDL Service Oriented Architecture Simple Object Access Protocol SPARQL Protocol and RDF Query Language Structured Query Language Universal Description, Discovery and Integration Uniform Resource Identifier Visual Query Language World Wide Web Consortium Web Service Description Language WSDL Semantics World Wide Web Extensible Markup Language Extensible Messaging and Presence Protocol XML Schema Definition

1 Introduction Computer science is a constantly evolving field since the middle of the 20th century. More recently, this evolution has been support by the massification and growing importance of the Internet. World Wide Web innovations lead to the appearance of various applications and software frameworks (and their collaboration and communication toolkits) like Google, YouTube, Facebook or Twitter. These novel applications maintenance requires high computer science expertise and software engineering skills, as they must support millions of users, millions of database transactions and worldwide deployment in real time. The emergence of these web applications caused a shift in the Internet paradigm. Nowadays, not only qualified technical staff is able to publish content online: anyone with a minimum experience can create a blog, publish a video or connect with friends in a social network. Along with these Web2.0 applications, it is also important to note the appearance of several remarkable software development toolkits that eased and sped up the process of planning, executing, testing and deploying applications online. The main result of this evolution is an immense set of online resources that have to be dealt with more efficiently. Despite the increase in online information quantity, its general quality has decreased. Even using search engines, like Google or Bing, it is very difficult to find information for a particular topic. Focusing our study on a single topic, like entertainment, news or art history, we rapidly find several important databases, warehouses and service providers. Hence, the need for integration applications has risen and, consequently, the need for interoperable software. Therefore, a new challenge is posed to computers science experts: describe and give context to any online available resource in order to enhance and ease the interoperable software development process. This requires that software engineers endeavour efforts in ontologies and semantic description techniques that are crucial for an evolution to the next level of the Internet: the semantic and intelligent web. The Human Genome Project advances revolutionized the life sciences research field. This project generated a tremendous amount of genomic data that required the usage of software tools for a correct analysis and exploration of the decoded genome sequences. With this demand, bioinformatics is born. Human Genome Project successful efforts originated several other projects that fostered the chaotic appearance of various online databases and services. Subsequently, this resulted in an exponential heterogeneity increase in the bioinformatics landscape. The Human Genome Project also promoted a synergy between genomics and medicine, where a new level of challenges that demand computer science expertise arisen. Computer sciences play a key role in the life sciences research evolution and, accordingly, life sciences are a perfect scenario for innovation in computer science. Ongoing efforts have the main purpose of merging computer science expertise gained in web application development and apply it in the development of next-‐generation bioinformatics web applications. The research conducted in this doctorate envisages the design of a software framework combined with various implementation strategies that can prepare the bioinformatics field for the next step in the evolution of web-‐based applications. To attain this goal, it is necessary to study semantic resource

description techniques and the adequacy of service composition as a strategy for dynamic integration of interoperable software.

1.1 Objectives The research conducted in this doctorate should, above all, lead to innovative developments in the fields of work and should represent a valuable addition to general knowledge in our interest areas, mainly computer science and bioinformatics. The main objectives behind this research are as follows. 



Study, analyse and explore the life sciences research field in order to obtain a deep understanding about the problems, challenges, state-‐of-‐the-‐art applications and ongoing research. Perform a system and requirements analysis that provides a comprehensive description of the software complexities, required features, data models, implementation details and a careful definition of software-‐related purposes and software evaluation criteria. Develop a consistent software framework that fulfils the initial system requirements and features. This framework must encompass several software tools ranging from desktop to web applications and from databases to remote APIs.

1.2 Structure This thesis proposal is divided in four distinct sections. Section 2 contains a comprehensive background analysis and contextualization of this research in the life sciences field, focused on bioinformatics and its inherent problems and arisen requirements as well as current technologies and protocols. Section 3 contains a detailed overview of several projects and software frameworks from both bioinformatics and generic computer sciences research fields. These solutions are presented as success cases that represent the state-‐of-‐the-‐art in the area. Next, there is a work plan in Section 4. This work plan is composed of a calendar estimation for the four years of research and also our publication goals. At last there is Section 5 presenting some perspectives on the implications of our research in the computer science and bioinformatics fields.

2 Background Bioinformatics is emerging as one of the fastest growing scientific areas of computer science. This expansion was fostered by the computational requirements leveraged by the Human Genome Project [1]. HGP efforts resulted in the successful decode of the human genetic code. HGP history starts in the middle of the 20th century with the involvement of the USA Department of Energy in the first studies to analyze nuclear radiation effect in human beings. However, it took the DOE about 30 years, circa 1986, to propel and initiate the Human Genome Project. The ultimate project goals were as bold, audacious and visionary as the NASA Apollo program. HGP main goal was to decode the “Book of Life” in its entirety. Moreover, this knowledge would be the basis of a new generation of tools that can identify and analyze a single character change in the sentences that compose this book. Although HGP was an ambitious project, results appeared sooner than expected. This was the outcome of a deep collaboration with computer scientists that leveraged the deployment of novel software and hardware tools that aided biologists’ sequence decoding tasks. This joint effort between two large research areas, life and computer sciences, gave birth to a new discipline denominated bioinformatics. The Human Genome Project brought about a variety of benefits in several fields. Remarkable discoveries in sequence decoding fostered DNA forensics, genetic expression studies, and drug advances and improved several other fields like molecular medicine, energy and environment, risk assessment or bioarchaelogy/evolution studies. At the positively premature ending of the Human Genome Project, the availability of the human genome and other genome sequences have revolutionized all biomedical research fields [2]. Several projects were started, riding with HGP success. These new projects use scientific discoveries and technological advances generated from HGP in heterogeneous scenarios to obtain new relevant information. On one hand we have smaller projects, which are focused on specific genomic researches [3, 4]. On the other hand, we have larger projects that span through several institutions and cross various physical borders. One of these projects is the Human Variome Project [5, 6]. HVP follows directly HGP steps and envisages complementing the latter discoveries with a new level of knowledge that is both wider (covering more life sciences topics) and deeper (more detail in each topic). HVP main goals reflect the computational advances originated in HGP. The life sciences goals are tied to software and hardware developments, with particular focus on web-‐based applications and distributed infrastructures. The general HVP goal is to collect and curate all human variations – changes in our genetic code – associated with human diseases developed from specific mutations. This wide purpose is composed of smaller goals that focus on the development of software tools to aid this process and a set of guidelines and protocols to promote active developments in this field. These dynamic developments are only possible with new communication and collaboration tools that can help in breaking physical barriers between work groups and logical barriers between scientific areas such as biomedicine and computer science. At a European scale, there are also major projects with ongoing research in the life sciences fields. Projects such as the eu-‐ADR or GEN2PHEN have a strong involvement from the computer science community and the final outcome is intended to be a large collection of software

frameworks and applications. With these contemporary projects, we are witnessing a growing promiscuity between computer science and life science research. From the information technologies point of view, biology and biomedicine pose several challenges that will require a modernization of software applications and the progressive development of novel application strategies. That is, bioinformatics is the perfect real-‐world scenario to nurture progresses in various computer sciences research areas, triggering the resolution of problems related to heterogeneity, integration, interoperability and description of online resources. The challenge Our research is directly connected to the Genotype-‐to-‐Phenotype: A Holistic Solution project (GEN2PHEN). This project is focused on the development of tools that will connect online life sciences resources containing information spanning from the genotype – the human genetic sequences – to the phenotype – the human visible traits such as hair colour or penchant for a specific disease. Implicit in this purpose is the improvement of personalized medicine. This research field was born with the Human Genome Project and is sustained by areas like gene sequencing and expression, genotyping, SNP mapping and pharmacogenomics -‐ Figure 1. Personalized medicine is focused on the selection of the most adequate treatment for a patient according to his clinical history, physiology and genetic profile and the molecular biology of the disease [7].

Figure 1 -‐ Personalized Medicine applications aim to integrate data from various distinct life sciences research topics

In the future, personal electronic health records (EHR) may also contain genetic information required for a fit treatment. Two research directions will generate data that will feature EHRs in

the future: pharmacogenomics and gene expression. Pharmacogenomics studies variability in drug response, which comprises drug absorption and disposition, drug effects, drug efficiency and adverse drug reactions [8]. Gene expression profiling of diseases provides new insights on the classification and prognostic stratification of diseases based on molecular profiling originated in microarray research [9, 10]. Both these fields will generate a tremendous amount of heterogeneous data that needs to be integrated accurately in diverse systems. This data is made available through various types of online resources. Connecting these online resources, public databases, services or simply static files, leverages a complexity increase in the implicit integration tasks. Arisen issues revolve around heterogeneity, integration and interoperability. Solving these problems is not trivial and, despite the fact that there are several ongoing research projects in this area, computer science researchers have not yet discovered an optimal solution. Current research trends are focused on using semantic resource descriptions to empower autonomous communication between heterogeneous software. This approach adopts mainly service composition strategies to improve integration and interoperability in existent software frameworks. Service composition (and, subsequently, service oriented architectures) has proven to be the ideal scenario to render concrete the benefits obtained from semantic resource description, as it provides a solid foundation for standardized communication between distinct software elements. Figure 2 shows a general overview of this doctorate research work. The main practical purpose is to enhance the development process of Rich Internet Applications (RIA) that will accomplish GEN2PHEN’s project goals. GEN2PHEN objectives regarding integration and interoperability between online resources can be encompassed in a more generic computer science category. These problems are not specific to the life sciences area; they are common to several research areas that require, in some manner, involvement from the computer sciences community. To achieve the desired levels of integration and interoperability we will focus our research on the study and improvement of service composition scenarios. Service orchestration, service choreography and mashups (particularly workflows) will be studied in detail. As previously mentioned, the inclusion of semantic resource descriptions is crucial to the successful creation of service composition strategies and, therefore, this area will also be covered thoroughly during the conducted research.

Figure 2 – Relations between the requirements arisen by the GEN2PHEN’s project goals and the set of computer science concepts required to fulfill those requirements

2.1 Problems and Requirements Research in the life sciences area poses many problems and requirements. Among these problems, the key set is composed of four distinct topics: heterogeneity, integration, interoperability and description. Heterogeneity is related to the marked distinctions between access methods to the various types of online resources. Integration refers to the centralization and publication of distributed resources in a single entry point. Interoperability is seen as the ability that software has to communicate autonomously with external software. To overcome these problems we need to rely on novel techniques, mostly semantic resource description strategies and ideas. Next, these topics are explained in detail in a generic computer science context.

2.1.1 Heterogeneity Online resource heterogeneity issues are one of the main research problems studied in the last few years. However, its relevance is growing, triggered by the constant evolution of the Internet and the increased facility in publishing content online. We can classify online resource heterogeneity in five distinct groups, varying according to the complexity in solving them and their involvement at hardware/software levels (Figure 3). 



Hardware related issues arise when dealing with physical data storage (Figure 3 – 1). For instance, in a medical image integration application, it may be required that image backups, stored in tapes, are integrated in the system as well as images in the main facility web server. In this scenario, the integration setup would be considerably complex. The implementation would have to encompass both tape access methods and web server access methods that are quite distinct. Other complex integration scenario would involve the integration of information that is available in a company FTP server and its Storage Are Network (SAN) or Network Attached Storage (NAS) storage facility. Once again, the solution for this problem would have to encompass distinct information access methods in a single environment, therefore increasing the overall difficulty in implementing such a system. When dealing with file access in any storage facility, we may have logical storing problems (Figure 3 – 2). Content can be stored in a relational database, a simple text file or a binary file, among others. Hence, these several formats are accessed with entirely different interfaces. For instance, to integrate data stored in a Microsoft SQL Server 2008 database in a Java application, one would need the most recent JDBC Connector. If, in addition to this scenario, we required a connection to a MySQL database, we would need to add a new connection driver and implement several distinct methods. We can even grow the complexity of this system by adding content that is stored in a binary file. This would leverage the need for a new set of access methods that are completely different from the relational database ones, resulting in a scenario with great complexity and requiring a large collection of methods. The next level where heterogeneity can be a problem is at data format levels (Figure 3 – 3). Data stored in the same physical format can be stored in a distinct syntax. Although programming languages’ evolution has improved access to distinct file formats, reading a simple text file or a HTML file are operations that require different strategies and methods. A simple scenario could be the integration of several accounting results; these



results are offered in CSV formats, Excel files and tabular text files. To successfully integrate these files, developers must implement distinct access methods to the three logical formats. Moving deeper in the software layer, we reach the data models level (Figure 3 – 4). At this level, heterogeneity issues arise when files are distinctly structured or do not obey the same ontology. Difficulties in solving this issue were greatly reduced with the appearance of the XML standard, and most of the modern applications rely on this standard. Despite having normalized the process of reading and storing information, XML allows an infinite number of valid distinct structures, which are different from application to application. The simple scenario of describing a person name can result in several hierarchical configurations: we can have the element name with two sub-‐elements, first and last, or we can just define an element fullname. If we want to store the person’s name initials or her nickname the number of solutions would be even greater for such a small piece of information. Similarly, the same concept may also be stored in this diverse ways in a relational database: despite the fact that the logical storing is the same, the storage model may be different, leveraging the requirement of relation and concept mappings that must be developed and implemented by researchers. Considering that we are integrating data for a well-‐defined scientific topic, it probably has one or more ontologies that define logic structures and relations between elements of a thesaurus. The issues that arise in this specific scenario are driven by the fact that there is not a single ontology for a specific area. Usually, there are several ontologies that define the same content in distinct and non-‐interoperable manners. Once again, heterogeneity has to be solved with information and relation mappings that can correctly transpose information structured following ontology A to ontology B. These mappings are quite complex and traditionally require some kind of human effort for success. Finally we address the access methods heterogeneity (Figure 3 – 5). Web services have evolved and are the primary method for remote data access with standard protocols and data exchange formats. Nevertheless, web services may be divided in HTTP web services (REST or SOAP) and XMPP web services (with the IO Data extension). REST web services are much simpler: they can be easily accessed through HTTP requests and may display data in a customized format (that can differ from application to application). On the other hand, SOAP web services rely on WSDL to perform data exchanges. This means that an application using this strategy must follow the applicable standards thus resulting in more entangled underpinnings. XMPP web services are based in the Extensible Messaging and Presence Protocol, a protocol for message exchange widely used by instant messaging applications. These three types of services are explained in detail further in this document. In addition, the remote APIs may be implemented in distinct languages and it may be required that the platform merges content from both local and remote data sources. The resulting scenario involves the development of separate sets of methods and strategies.

Figure 3 -‐ Content heterogeneity organization according to hardware/software dependence and complexity

Summarily, resource heterogeneity raises many difficulties in the development of novel information integration platforms. These issues can only be solved with some kind of human effort and require particular resource integration and interoperability strategies that are detailed further in this document.

2.1.2 Integration To deal with resource heterogeneity issues or to simply centralize large amounts of distributed data in a single system, researchers have to develop state of the art resource integration architectures. The main goal of any integration architecture is to support a unified set of features working over disparate and heterogeneous applications. These architectures will always require the implementation of several methods to access the integrated data sources. The heterogeneity may be located at the previously presented levels, which include software and hardware platforms, diversity of architectural styles and paradigms, content security issues or geographic location. In addition to these technical restrictions to integration, there are also other hindrances such as enterprise/academic boundaries or political/ethical issues. Whether we are simply dealing with the integration of a set of XML files or with distributed instances of similar databases, the concept of resource integration will generically rely on hard-‐coded coordination methods to centralize the distributed information or to give the idea that the data is centralized. Several strategies for data integration can be used -‐ Figure 4. These approaches differ mostly on the amount and kind of data that is merged in the central database. Different architectures will also generate a different impact on the application performance and efficiency. Warehouse solutions (Figure 4 -‐ A) consist on the creation of a large database that contains data gathered from several resources. The central, larger database – the warehouse – may consist of a mesh of connected repositories that the data access layer sees as a single database. The Database Management System (DBMS) is responsible for the management and maintenance of the warehouse. In terms of implementation, this model requires that a mapping is made from each data source to the central warehouse data model. Next, the content is moved entirely from its source to the new location. The final result is a new data warehouse where the content from the integrated data sources is completely replicated. This model raises several problems in terms of scalability and flexibility: warehouses’ size can grow exponentially and each database requires its own integration schema. This means that for each distinct database, developers have to create a new set of integration methods, resulting in a very rigid platform. Despite these issues, this technique is very mature and a considerable amount of work has already been done to improve warehouse architectures. Nowadays, the debate is

focused on enhancing warehouse integration techniques [11] and solving old problems with state-‐ of-‐the-‐art technologies [12, 13]. Another widespread strategy involves the development of mediators – a middleware layer – connecting the application to the data sources. This middleware layer enables a dynamic customization of user queries performed in the centralized entry point, extending their scope to several databases previously modelled in a new virtually larger database. Kiani and Shiri [14] describe these solutions and a good example can be DiscoveryLink [15]. Mediator-‐based solutions are usually constrained by data processing delays: they require real-‐time data gathering, which can be bottlenecked by the original data source. Additionally, the gathered content also has to be processed to fit in the presentation model, hence, compromising even more the overall efficiency of the system represented in Figure 4 – B. Finally, link-‐based resource integration (Figure 4 – C) consists of aggregating in a single platform links to several relevant resources throughout the Web. This is the most widely used integration model due to the simplicity of collecting and showing links related to a certain subject. However, inherent in this simplicity are several drawbacks, especially regarding the limitations imposed by the fact that there is no real access to data, only to their public URLs. Most of the modern resources are dynamic which means that access to content may be generated in real-‐ time. Also, in the scientific research area, there is new data emerging daily. Therefore, the system requires constant maintenance in order to keep the system updated with the area novelties. The link-‐based integration strategy has the major drawback of restraining access to the original resources. The integration application will act as a proxy of the integrated resources; therefore, it will hide the original resource. Without this access, the range of features that can be implemented and the scope of the features that are offered to users is reduced. Entrez [16] is a link-‐based integration application that does not show the integrated resources directly. To have access to the original content source, users must analyze the data and follow a hyperlink highlighted in certain identifiers. DiseaseCard’s approach [17] is more direct: users can view and navigate inside the original resources’ interface which is a part of DiseaseCard’s layout.

Figure 4 -‐ Data integration models categorized according to their relation with the integration application and the integrated online resources

Despite the fact that these approaches cover almost all possible solutions for data integration, there are many problems that have not yet been solved. Figure 5 shows a comparison between data integration models, highlighting the main advantages and disadvantages of each. After a careful analysis of these models we can conclude that the best option is to create a hybrid solution that is capable of coping with the main disadvantages of the three strategies and take 17

advantage of their main benefits as well. Arrais studied this scenario and used this strategy for the integration of heterogeneous data sources in GeneBrowser [18].

Figure 5 – Comparison between the studied resource integration models highlighting the respective advantages and disadvantages

The development of hybrid approaches has gained momentum in the recent years especially with the introduction of novel data access techniques like remote APIs in the form of web services. This trend consists in making resources available as a service that can be executed by any other software system. Service oriented architectures (SOA) rely on a paradigm shift in integration application development, based on “everything-‐as-‐a-‐service” ideals. These ideals define that anything whether it is a simple database access or a complex mathematical equation can be requested or solved by a mere access to a predefined URI. These are the main principles that define service-‐oriented architectures [19, 20]. In SOA, any kind of software module can be considered as a service and be integrated in any kind of external application through the definition of a standardized communication protocol. In spite of the strategy chosen to integrate a collection of heterogeneous resources, there are several concerns that should be taken in account [21]: application coupling, intrusiveness, technology selection, data format and remote communication. Integration ecosystems often require that distinct applications call each other. Despite being dissimulated as local calls by the integration engine, these remote calls, available in the majority of programming languages, are very different due to the resort to network capabilities. Traditional (and erroneous) distributed computing assumptions – zero network latency or secure and reliable communication – must be measured and shunned. Remote communication concerns are reduced with the adoption of asynchronous communication techniques and the support for communication error solving, thus reducing network error susceptibility. One of the main integration issues is definitely the data format. Integrated applications must adopt a unified data format. Traditionally, this requirement is impossible to fulfil due to the fact that some of the integrated data sources are closed or considered legacy. In these scenarios, the solution consists in creating a translator that maps the distinct data formats to a single model. In this case, issues may arise when data formats evolve or are extended. Intrusiveness should be one of the main concerns when developing integration applications. The integration process should not impose any modifications in the constituent applications. That is, the integration strategy should operate without any interference in the existing applications and both the integrator and the integrated application should be completely independent. Nevertheless, sometimes-‐major changes are necessary to increase integration quality.

Application coupling is directly connected with intrusiveness. A good software development practice is “low coupling, high cohesion” [22] and this ideal is also applicable to integration strategies. High coupling results in high application dependencies on each other, reducing the possibilities of the applications evolving individually without affecting other applications. The optimal results would be resource integration interfaces that are specific enough to implement the desired features and generic enough to allow the implementation of changes as needed. The implicit complexities that arise when dealing with online resource integration require large efforts and expertise to be overcome. Like any other issue, integration is firmly related with the scientific area in question and, with this in mind, the adopted strategy or model must take in account several variables present in the environment where it will be implemented.

2.1.3 Interoperability Along with integration comes interoperability. Integration deals with the development of a unified system that includes the features of its constituent parts. On the other hand, interoperability deals with single software entities that can be easily deployed in future environments. This means that interoperability is a software feature that facilitates integration and collaboration with other applications. ISO/IEC 2382-‐01, Information Technology Vocabulary, Fundamental Terms defines interoperability as follows: “The capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units”. Interoperable systems can access and use parts of other systems, exchange content with other systems and communicate using predefined protocols that are common to both systems. This interoperability can be achieved at several distinct levels as pointed by Tolk’s work [23]. For our research, the essential levels are the ones that encompass syntactic and semantic interoperability. Software syntactic interoperability can be defined as the characteristic that defines the level where multiple software components can interact regardless of their implementation programming language or software/hardware platform. Syntactic software interoperability may be achieved with data type and specification level interoperability. Data type interoperability consists in distributed and distinct programs supporting structured content exchanges whether through indirect methods – writing in the same file – or direct methods – API invoked inside a computer or through a network. Specification level interoperability encapsulates knowledge representation differences when dealing with abstract data types, thus, enabling programs to communicate at higher levels of abstraction – web service level for instance. Semantics is a term that usually refers to the meaning of things. In practice, semantic metadata is used to specify the concrete description of entities. These descriptions and their relevance are detailed further in this document. Summarily, they intend to provide contextual details about entities: their nature, their purpose or their behaviour among others. Hence, semantic software interoperability represents the ability for two or more distinct software applications to exchange information and understand the meaning of that information accurately, automatically and dynamically. Semantic interoperability must be prepared in advance, in design time and with the purpose of predicting behaviour and structure of the interoperable entities. According to Tolk, we can have seven distinct levels of interoperability measured in the “Levels of Conceptual Interoperability Model” (Figure 6).

Figure 6 – Levels of conceptual interoperability model defined by Tolk

The highest level of interoperability is only attained when access to content and the usage of that content is completely automated. This is only possible when programming and messaging interfaces conform to standards with a consistent syntax and format across all entities in the ecosystem. Level 0 interoperability defines a stand-‐alone independent system with no interoperability. Level 1 defines technical interoperability characterized by features like the existence of a communication protocol that enables the exchange of information in the lowest digital software allowed: bits. Level 2 interoperability deals with a common structure – data format – for information exchange. Level 3 is achieved when a common exchange reference model exists, thus enabling meaningful data sharing. Level 4 can be reached when the independent systems are aware of the methods and procedures that each entity in the environment is using. Level 5 interoperability deals with the comprehension of state changes in the ecosystem that occur over time and the impact of these changes – at any level – in the system. Level 6 is achieved when the assumptions and constraints of the meaningful abstraction of reality are aligned. Conceptual models must be based on engineering methods resulting in a “fully specified, but implementation independent model” [24].

2.1.4 Description After analyzing the problems regarding the integration of heterogeneous online resources, it is crucial to move our study to the solutions tested so far. The most extensively tested and applied solution to deal with the integration and interoperability issues is resource description: the semantic web [25].

Any scientific research field deals with specific terminology that is associated with that particular area. For instance, researchers working with ancient history have a thesaurus of terms that is completely different from the one used in medicine: on one hand we have symbols, kings, religions, wars, and on the other hand we have diseases, symptoms or diagnostics. Therefore, it is of utmost importance that researchers are aware of the ontology used in their research area. Ontology [26] defines the collection of terms and relations between terms that are more adequate for a given topic. These relations, often designated axioms, establish connections between terms in the thesaurus that mimic the real world. For instance, in history studies, there could be a definition between the terms King and Prince defining that Prince is a son of the King. There can be an immense number of axioms relating terms and by spreading these relations between terms we can define ontology. Ontologies are the basis for the enhancements proposed by the semantic web. Semantic web’s main goal is to enable autonomous interoperation between machines based on the description of content and services that are available in the Internet (Figure 7). Web1.0 established the Internet as a set of Producers generating content for a large number of Consumers. The majority of online available resources were created by a small group of technical staff that was entirely dedicated to web development and adapting existing company strategies to the Internet era. In Web2.0 we have witnessed a shift in the Producer-‐Consumer relation that dominated the Internet. Nowadays, Internet content is mostly published by end-‐users that previously were only Consumers. The frontier between Consumer-‐Producer roles is blurred, as it is getting easier to publish content on the web thanks to Web2.0 tools such as blogs, micro-‐blogs, media-‐sharing applications or social networks. In the future, the semantic web – the intelligent web – will include specific software that will analyze user generated content and messages between users, searching for contextual information and improving everyday online tasks. In order to make the semantic web possible, developers and researchers need to cooperate to achieve several goals. It is important to create and broadcast centralized ontologies for several public interest areas and to empower the adoption of these ontologies by research groups and private companies. Nevertheless, this crucial step can only be given if the cooperation efforts originate enhanced semantic technologies that ease the complex task of describing content. A description of these technologies is made further in this document. Whatever group of technologies we chose, the critical aspect is that developers must adapt their application and prepare their research for semantic integration. Research groups working with state-‐of-‐the-‐art technologies must promote this difficult step that will require deep changes in the developed applications. Only promoting this usage, we can foster the development of a new, cleverer, Internet.

Figure 7 – Evolution of the Internet according to the improvements in the used communication paradigms

2.2 Technologies We have described the main issues that arise when one wishes to develop centralized user interfaces delivering access to a wide range of online resources in a single environment. Resource heterogeneity, integration, interoperability and description often complicate developers’ tasks and delay innovation in this research area. To overcome the arising issues, we can recur to several strategies that rely on distinct technologies. The evolution of these technologies has gained focus over the last few years and their promiscuity with the World Wide Web has proven to be extremely profitable. We can organize these technologies at four distinct levels, according to the dependencies among each other (Figure 8). Online resource access services are located at the bottom as they allow direct access to resources and empower developers to create the next level which comprises web services, an indirect data access method. GRID technologies use web services and/or data access strategies to enhance both computing power capabilities as well as data access capabilities. Semantic technologies promote resource description. Hence, they enable the fulfilment of integration and interoperability requirements.

• URI + RDF + OWL + SPARQL • Microformats Semanoc Web • Resource Descripoon

GRID

• Compuong (High-‐processing power) • Data (Large distributed datasets)

• Remote APIs (DAS) • WSDL + SOAP + UDDI • REST Web Services • XMPP

Online resource access

• Command line (SQL, XQuery) • Web based (ASP, JSP, AJAX) • Interacove GUI

Figure 8 – State-‐of-‐the-‐art technologies, concepts and their respective dependencies

2.2.1 Online resource access Online resource access services are responsible for encapsulating online resources and making it available to other systems. They allow access to local or relational databases, tools, file systems or any kind of external storage methods. The services encapsulation should be made using wrappers to enhance and ease integration and interoperability. With this in mind, it is important to take into account some generic concerns. Performance is a crucial concern especially because the end-‐users of the system will want fast and responsive applications regardless of the operation they are executing. Performance can be optimized by reducing the amount of data that is sent across the network or by minimizing query interdependence, thus reducing latency. Usability is also essential in any modern system. Expressiveness should be enough to allow users to pose almost any query to the system. Usability and expressiveness depend on metadata. Metadata should be carefully selected and constrained to the minimal information necessary to interpret what the wrapper is encapsulating. Researchers are constantly dealing with data that is located in distinct geographic locations and most of the times they are accessing this data from different computers. To prepare integration systems for distributed and remote access is very important because access to distributed resources has to be transparent and remote access to resources must be possible without a fuss. Resource access services are shown to end-‐users in three distinct flavours: command-‐line access, web-‐based access and interactive visual access. Whether we are dealing with databases or a FTP server, there is always a text-‐based query language allowing access to resources. This language provides user with a command-‐line access to both the resource structure and to the resource itself. There are several examples of command-‐line access languages: SQL for relational

databases, OQL to query object-‐oriented databases, XQuery for XML databases or shell scripting in Linux. The main problem with these languages is their learning difficulty for end-‐users. These languages were meant to be used by developers, thus, the resource structure is hidden and it is required that the resource organization is known before formulating a query. Accessing online resources through web-‐based methods is an attempt to overcome the main problems existing in the command-‐line usage. World Wide Web advances promoted and eased access to online resources and empowered the creation of novel applications. These new applications provide effective query forms and access to remote APIs that users can fill or execute in order to get resource access. For instance, with web access, and if the original application provides the correct set of features, it is possible to access almost all the content available in a database. Nevertheless, this type of resource is not perfect. On one hand, developers can hide the internal resource organization and only offer a limited access to some datasets, according to the system’s goals. On the other hand, this limits the amount of information users can extract, as they can only access content that has explicit access methods. That is, end-‐users do not have the freedom to create their own data queries and cannot view the resource structure; therefore they only get in contact with small views of a larger system. The lesser-‐known resource access methods are interactive GUIs. These kinds of applications rely on a visual construction of queries, using VQL, which enables access to resources. Query by Example [27] was the first proposed VQL to query relational databases. Assembling distinct blocks, like LEGO pieces, creates access queries. These blocks represent tables, methods or constraints that are arranged in a logical visual order to mimic the access to the resource. Once again, this is a very restrict resource access method mostly because it depends on a small set of blocks that can be arranged in a finite number of manners.

2.2.2 Web Services Web services [28] are, nowadays, the most widely used technology for the development of distributed web applications. The World Wide Web Consortium (W3C) defines web services “as software system designed to support interoperable machine-‐to-‐machine interaction over a network” [29]. This wide definition allows us to consider a web service as any kind of Internet available service as long as it enables machine-‐to-‐machine interoperability. Despite this all-‐ embracing definition, we can divide existing web services in two main groups: HTTP-‐based web services which encompass generic web services following W3C’s and OASIS’s standards or application-‐specific REST web services and XMPP-‐based web services. REST web services are a minority among the web interoperability world, although, they are emerging as a viable alternative to standardised web services. REST web services consist in simple web applications that respond to replies posted in a HTTP URL. Developers can configure this page to respond with HTML, XML, JSON, CSV or, most simply, free text. Using REST web services it does not matter what is the response structure and its inner format, the essential requirement is that the exchanged messages are understood between both the intervenient in the exchange. This feature makes REST web services a lightweight and highly customizable approach for exchanges between machines [30]. Nevertheless, though this approach is more attractive to developers, it still holds against itself the fact that it lacks the robustness of a standard-‐based strategy. Standardised web services have the main purpose of providing a unified data access interface and a constant data model of the data sources. Simple Object Access Protocol (SOAP) [31],

Universal Description, Discovery and Integration (UDDI) [32] and Web Services Description Language (WSDL) [33] are the currently used standards and they define machine-‐to-‐machine interoperability at all levels, ranging from the data transport protocol to the query languages used. Web service interoperation occurs among three different entities: the service requester, the service broker and the service provider -‐ Figure 9. When certain software wants to access a web service, it contacts the service broker in order to search for the service that is most adequate to accomplish its needs. The service broker is in constant communication with the service provider and will provide the service requester with the data it needs to establish a direct communication with the service provider. Communication with the service broker is done to exchange WSDL configurations. When the service requester knows what service to reach, it initiates a conversation with the service provider exchanging the necessary messages in the SOAP format.

Figure 9 -‐ Web Service interaction diagram

SOAP is a protocol used over the traditional Internet HTTP protocol and is used to specify the structure of information exchanges used in the implementation of web services in computer networks. The message formats are defined in XML and the protocol relies on underlying protocols for message negotiation and transmission. SOAP standard defines a comprehensive architecture based on several layers where all the components required for a basic message exchange framework are defined. These components include message format, message exchange patterns, and message processing models, HTTP transport protocol bindings and protocol extensibility. Nevertheless, SOAP still requires a protocol to define its interface. This protocol is WSDL and SOAP clients can read it dynamically and adjust inner message settings to it. WSDL standardises the description of web service endpoints. This description enables automation of communication processes by documenting (with accurate semantic descriptions) every element involved in the interaction (from the entities to the exchanged messages). In WSDL, a service is a collection of network endpoints capable of exchanging data. Their definition is separated in abstract message definition and concrete network deployment data. The definition encompasses several components that are structured in order to facilitate communication with other machines and ease the readability of the web service by humans. Obviously, if we think of a complex web service, we realize that there are numerous data types that need to be described. 25

WSDL recognizes this need and can use XML Schema Definition (XSD) as its canonical type system. Despite this, we cannot expect that this grammar will cover all possible data types and message formats in the future. To overcome this issue, WSDL is extensible, allowing the addition of novel protocols, data formats or structures to existing messages, operations or endpoints. A perfect example for WSDL extensibility is the inclusion of semantic annotations in WSDL. SAWSDL used WSDL-‐S as its main input [34] and is now a W3C recommendation. SAWSDL purpose is to define the organization of the novel semantic structures that can be added to WSDL documents. These novel structures are mainly deeper descriptions of several traditional WSDL components such as input and output messages, interfaces or operations. The relevance of describing content and services was mentioned previously. Although, it is crucial to reinforce that annotating WSDL services will improve their categorization in a central registry, thus enhancing service discovery and composition tasks. UDDI provides web service management on the web. As the name explicitly explains, the purpose of UDDI is to provide a XML/SOAP based framework that allows describing, discovering and managing in the web services environment. UDDI central registries usually offer a central registry with “publish and subscribe” features that allows the storage of service descriptions and detailed technical specifications about the web services. The storage mechanism relies once again on XML to define a metadata schema that can be easily searched by any discovery application. Standardising web service registry has the main benefit of organizing the disordered web services world. UDDI promotes uniform patterns for both the internal organization of the services as well as to the external presentation. Hence, it enhances the development of integration strategies and management of the access to distributed services in the web environment. The Extensible Messaging and Presence Protocol (XMPP) is an open and decentralized XML routing technology that allows any entity to actively send XMPP messages to another [35]. XMPP works as a complete communication protocol, independent from HTTP or FTP for data transferring. A XMPP network is composed of XMPP servers, clients and services. XMPP is more famously known by the Jabber messaging framework. The Jabber ID indentifies uniquely each XMPP entity. XMPP services are hosted in XMPP servers and offer remote features to other XMPP entities in the network, for instance, XMPP clients. Being a messaging protocol, it has been conventionally used by Jabber and Google Talk. Nevertheless, a collection of XMPP Extension Protocols (XEPs) extends the initial core specification, widening the scope of XMPP into various directions including remote computing and web services. Both HTTP and XMPP are used for content transfers, the main distinction between is that XMPP requires a XML environment while HTTP supports any kind of unstructured information. A XEP, IO Data, was created to enable the dispatch of messages from one computer to another, providing a transport package for remote service invocation [36]. Despite being an experimental XEP, it already solves two primary issues: the unneeded separation between the description (WSDL) and the actual SOAP service and asynchronous service invocation. XMPP infrastructure can be used to discover published services [37] and being asynchronous implies that clients do not have to poll repeatedly for status of the service execution, instead, the service sends the results back to the client upon completion. Service Composition Web service composition [38] defines the collection of protocols, messages and strategies that have to be applied in order to coordinate a heterogeneous set of web services and reach a given goal. However, the coordination mechanism is not complete if it does not offer a seamless and

transparent integration environment to end-‐users. The underlying architecture of service composition scenarios requires the development of a composition engine that is able to coordinate the execution of the web service workflow, communicate with the distinct web services and organize the information flow between the web services. An architecture with such complexity relies on a customized semantic structure to describe the composition [30]. Traditionally, service composition is completely hard-‐coded: the developers define a static composition to achieve the initial goals. However, modern service composition scenarios combine web services with semantic features. This combination leverages automation in the web service composition engine. That is, the web services workflow is established dynamically and web service interoperability and execution is triggered automatically. In these scenarios, the end users only need to define the input and the output they desire and the system will organize itself in order to satisfy the original constraints. Service composition can be applied in two distinct scenarios: service orchestration and service choreography [39]. These scenarios are very similar and they can be applied to any kind of interoperable mesh of services. Service orchestration relies on a central web service controller to deal with the information workflow and web service interoperability. This main controller is the maestro of a service collection and will organize them in order to solve the initial problem. Service choreography consists in an autonomous discovery of the best combination of services to attain a given goal. Benefits and drawbacks of these scenarios are common to centralized and distributed architectures. Currently, developers opt to create service orchestration architectures. Although they are more primitive, they are easier to implement and, in most cases, end-‐users need to have some control over the system, becoming the maestros of a particular web service collection. Web service choreography scenarios are more modern and their implementation is being eased by the latest developments in artificial intelligence and the semantic web. Service Oriented Architectures Service Oriented Architectures (SOA) is a modern application deployment architectural style. The rationale behind these architectures is that what the applications are connecting are services and not other applications. Considering that every component that applications required can be considered as an independent service, one can create an implementation and deployment strategy based in this paradigm. In traditional web application architectures (Figure 10 – A), the deployment can be decomposed in three generic layers: the presentation layer, the application layer and the data access layer. Each of these layers encompasses several programmatic components that permit a stable communication with the upper and lower layers of the model. In SOA architectures (Figure 10 – B), the layers are independent. That is, each layer component is independent from the remaining and the multiple applications can be composed by combining components belonging to each layer. This empowers two main concepts: reusable software, which are software applications, wrapped as services that can be used in a multitude of applications and application composition, where applications can be built by combining a set of services like LEGO pieces.

Figure 10 -‐ Distinguishing traditional architectures from SOA architectures

Nevertheless, it is not enough to a have a set of interoperable services to implement service oriented architectures. Additionally, two main components are required: the Enterprise Service Bus (ESB) and the Registry. The ESB is a software architecture component that acts as a message centre inside the SOA. Its main feature is message forwarding. In SOAs, the ESB is responsible for managing the messages exchanges inside the system between the application intervenient. ESB operability is, at a more abstract level, a service proxy that interacts and controls every service that composes the system. The Registry is a central service repository, acting as a service broker. The main purpose is to store service metadata that can be searched by other services, giving them the ability to find other services, autonomously, according to various criteria. Service oriented architectures have gain popularity recently, following the “everything-‐as-‐a-‐ service” trend. This has leveraged its importance in the computer science world, especially regarding web development. Web and distributed applications are a perfect scenario for the implementation of SOA. Distributed and heterogeneous resources are common on the World Wide Web and to connect them is a crucial task that can be aided, significantly, by implementing a service oriented architecture.

2.2.3 GRID GRID computing is one of the many breakthroughs that have been made possible with the evolution of Internet technologies. The GRID is a combination of software and hardware infrastructures that provide pervasive, consistent and low-‐cost access to highly capable computational capabilities [40]. Though this is the main idea for the GRID, it is very basic and somewhat inadequate to current standards. The evolution of this concept leads to a model that unites heterogeneous and distributed data sources to achieve seamless and advanced interoperable functionality. The real problems in the GRID concept derive from resource sharing and problem solving in dynamic, multi-‐institutional virtual organizations. And in this particular scenarios, resources can be seen as either software or hardware capabilities.

This ability to share resources is essential in modern science where collaboration and multidiscipline are daily topics. If we consider wider modern projects in any scientific research area, we must be aware that the developed work spans workgroups, institutions, countries and even continents. The possibility to connect distributed data, computers, sensors and other resources in a single virtual environment represents an excellent opportunity. Though, this kind of architecture has to be supported by various protocols, services and software that make possible controlled resource sharing on large scale. The foundation for a generic GRID architecture must encompass several attributes like distributed resource coordination; usage of standard, open, general-‐purpose protocols and interfaces and must deliver non-‐trivial qualities of service to end-‐ users. In addition to these attributes there has to be a certain number of features implemented to sustain the GRID operation. These features include: remote storage and/or replication of datasets; logical publication of datasets in a distributed catalogue; security, specially focused in AAAC; uniform access to remote resources; composition of distributed applications; resource discovery methods; aggregation of distributed services with mapping and scheduling; monitoring and steering of job execution; code and data exchanges between user personal machines and distributed resources and all these have to be delivered taking in account basic quality of service requirements. The GRID architecture may lead to several GRID implementations focused on the development of specific features. Therefore, we can develop and classify GRID technologies in three categories: computing grids, data grids and knowledge grids. This is an empiric classification measured by the main purpose of the GRID technologies being used in each category. Computing grids are focused on hardware resources with particular incidence on “high throughput computing”. Nevertheless, it is important to distinguish “high throughput computing” from “high performance computing”. The latter aims at short turnaround time on large scale computing using parallel processing techniques [41, 42]. Despite the main purpose of the computing GRID being parallel and distributing computing there is a remarkable difference in the relevance of network latency and robustness. This fact gains its relevance from the network latency that exists in virtual organizations that can span geographically distributed locations in contrast to cluster computing, where machines are physically co-‐located and the latency times are low. This high-‐latency in computing grids must be handled in the system architecture, as the existence of lengthy execution jobs must be supported. Data grids can be seen as large data repositories: a resourceome where data should be explicitly characterized and organized [43]. This data grid requires a unified interface that provides access to any integrated database and application. These interfaces must allow secure remote access and should contain ontology and/or metadata information to promote autonomous integration. At last we have knowledge grids, which are a lesser-‐known concept. For starters, we must understand the dogma that the knowledge we can represent on computers is just a part of the knowledge we can create and share among a community. Despite being a controversial term, some researchers [44] define knowledge grids as an environment to design and execute geographically distributed high-‐performance knowledge discovery applications. The main distinction between knowledge grids and generic grids is the usage of knowledge-‐based methodologies such as knowledge engineering tools, discovery and analysis techniques like data mining or machine learning and artificial intelligence concepts like software agents. The idea of knowledge in the web is merged with the semantic web. With this in mind, we can convey that

the Semantic Web is a Knowledge GRID that emphasizes on distributed knowledge representation and integration and a Knowledge GRID is a platform for distributed knowledge processing over the Semantic Web [45]. Modern large-‐scale research projects usually rely on some kind of GRID architecture to support the sharing of resources among the projects peers. The problems and requirements raised in this sharing environment are the ones debated in section 2.1: resource heterogeneity, integration, interoperability and description. Therefore, a deep understanding of GRID technologies and architectures is essential to newer developments in service composition and integration for any scientific research area.

2.2.4 Semantic Web The dramatic increase in content promoted by recent web developments like Web2.0 and social tools, combined with the ease in the publication of online content, have the major drawback of increasing the complexity of resource description tasks. Standard web technologies cannot support this exponential increase. Researchers are required to perform manual searches on the vast amount of online content, interpret and process page content by reading it and interacting with the web pages. Additionally, they have to infer relations between information obtained in distinct pages, integrate resources from multiple sites and consolidate the heterogeneous information while preserving the understanding of its context. These tasks are executed daily by researchers in the most diverse scientific areas and the web cannot offer simple mechanisms to execute them without relying on some kind of computer science knowledge. Despite the appearance of specific integration applications, there are no general solutions and case-‐by-‐case applications have to be developed. Nevertheless, the development of these applications is strained by the lack of modern resource publishers: service providers still assume that users will navigate through the information with a traditional browser and do not offer programmatic interfaces and without them, autonomous processing is difficult and fragile. Semantic Web aims to enable automated processing and effective reuse of information on the Web that will support intelligent searches and improved interoperability [46, 47] . Tim Berners-‐Lee, the self-‐proclaimed inventor of the modern Internet and director of W3C, promoted semantic Web developments in 2001 [25]. His initiative envisaged to smoothly link personal information management, enterprise application integration and worldwide sharing of knowledge. Therefore, tools and protocols were developed to facilitate the creation of machine-‐ understandable resources and to publish this new semantically described resource online. The long-‐term purpose is to make the Web a place where resources can be shared and processed by both humans and machines. The W3C Semantic Web Activity group has already launched a series of protocols to promote the developments in this area. Adding semantic features to existing content involves the creation of a new level of metadata about the resource [48]. This new layer will allow an effective use of described data by machines based on the semantic information that describes it. This metadata must identify, describe and represent the original data in a universal and machine understandable way. To achieve this there is a combination of four web protocols: URI, RDF [49], OWL [50] and SPARQL [51]. In parallel to W3C efforts, there are developments in microformats. Microformats are very small HTML patterns that are used to identify context on web pages. The idea is to use existing HTML capabilities, such as the attributes inside a tag, to make a simple content description. For instance, if we have a person name inside a p tag, we can

use the hCard (http://www.xfront.com/microformats/hCard.html) microformat to describe the person and her personal information. A URI is a simple and generic identifier that is built on a sequence of characters and that enables the uniform identification of any resource. Promoting uniformity in resource location allows several URI features like usage of distinct types of identifiers in the same context, unified semantic interpretation of resources, introduction of new resource types without damaging existing ones or reuse of identifiers in distinct situations. The “Resource” term is used in a general sense: it can identify any kind of component. URI can identify electronic documents, services, data sources and other resources that cannot be access via Internet like humans or corporations. Identifier refers to the operation of unequivocally distinguish what is being identified from any other element in the scope of identification. This means that we must be able to distinguish one resource from all other resources regardless the working area or resource purpose. Metadata is a term defining concrete descriptions of things. These descriptions should provide details about nature, intent or behaviour of the described entity as well as being, generically, “data about data”. RDF was designed as a protocol to enable the description of web resources in a simple fashion [52]. The syntax neutral data model is based on the representation of predicates and their values. A resource can be anything that is correctly referenced by an URI and is currently, like the latter, not limited to describing web resources. In RDF we can represent concepts, relations and taxonomies of concepts. This triplet characteristic results in a simple and flexible system. Although these are the main RDF benefits, they are also an issue: in certain scenarios, its generality must be formally confined so that software entities are able to correctly exchange the encoded information. To query RDF files and, in a larger scale, the Semantic Web, W3C developed the SPARQL syntax [53]. SPARQL is an SQL-‐like query language that acts as a friendly interface to RDF information, either being stored in RDF triplets or traditional databases (using appropriate wrappers). Describing content with metadata is not enough and there has to be an understanding of what the described data means. In this field, ontologies come to play. As mentioned previously, ontologies are used to characterize controlled vocabularies and background knowledge that can be used to build metadata. Ontology [26] consists on the collection of consensual and shared models in an executable form of concepts, relations and their constraints tied to a scaffold of taxonomies [54]. In practical terms, we use ontologies to assert facts about resources described in RDF and referenced by an URI. OWL is the de facto ontology standard and extends the RDF schema with three variants: OWL-‐Lite, OWL-‐DL and OWL-‐Full. These three variants offer different levels of expressiveness and can, summarily, define existential restrictions, cardinality constraints in properties and several property types like inverse, transitive or symmetric. OWL main benefit is that data represented with OWL can be reasoned and inferred to deduce new information. More recently, OWL-‐S was built on top of OWL and is focused on the features that describe web services. This new protocol allows developers to create a description about what the service provides, how the service works and how to access the service [55]. Combining these technologies will enable the interoperability between heterogeneous data sources resulting in the possibility of data in one source to be connected with data in a distinct source [56]. Semantic interoperability is the ability of two or more computer systems to exchange information and have the meaning of that information accurately and automatically interpreted by both systems. This design intent can only be achieved by recurring to these protocols and semantic web concepts since the beginning of the project development.

2.3 Summary Researchers’ daily tasks are getting more complex as traditional simple tasks like locating necessary information, gathering it and working with tools to process it get more difficult. The growing number of software tools is not helpful as well. Despite their quantity, their quality is questionable and each tool works differently and requires distinct end-‐user knowledge to be usable. Along with application complexity, there is also the immense number of data formats: even if we only consider a single scientific area, there are numerous data formats, data models and data types to consider and that impose time consuming tasks like manual data transformations or development of custom wrappers and converters which are generically far beyond scientific researchers’ scope. With this in mind, scenarios of service composition gain relevance and represent a research opportunity for software engineers and computer science specialists. The autonomous coordination of processes, tools and the shared integration of resources in scientific environments require high knowledge and expertise in computer sciences. Although this challenging task depends on a deep insight about the existing problems and state-‐of-‐art solutions, researchers contribute and collaboration is essential to provide a solid working basis which efforts will, hopefully, result in an improved set of tools and working environments for any researcher in the field.

3 Approach Any researcher working in the bioinformatics field can rapidly obtain a general idea of the existing problems related to the integration of distributed and heterogeneous online resources. It is also true that the crescent number of technological solutions to cope with these issues is not a major benefit. The number of approaches that can be designed using one or more of the mentioned technology is vast and choosing the right path is not trivial. In this section we present a discussion on the more widely used solutions and practical implementation scenarios of these solutions in the bioinformatics field.

3.1 Solutions The presented solutions result of the combination of several strategies and technologies to develop new applications that rely on state-‐of-‐the-‐art components to achieve the initial goals and fulfil the initial requirements. This section contains a summary roundup of application concepts that can be implemented to achieve the concrete goal of service composition.

3.1.1 Static applications The simplest approach to integrate heterogeneous components would be to simply program the entire application workflow. Obviously this approach has the major drawback of being static and fragile. However, for some specific scenarios, designing an application that solves a small subset of problems is the fastest way to deploy a viable solution in a short amount of time. To create static applications, developers only need to be aware of the features they want to implement and the main characteristics of the resources they wish to integrate. These applications combine a collection of methods that must be developed to integrate each application. As a result, a static integration application is composed by a set of wrappers that encapsulate the access to distributed data resources. These wrappers are developed independently and they can only exchange data with a single resource. If the main purpose of the application is to offer data obtained from web services and relational databases there has to be a wrapper to access each service, a wrapper to access each database and a set of methods to coordinate, statically, the access and data exchanges between the various wrappers -‐ Figure 11.

Application Layer

Static Integration Engine

WS 1

DB 1

File 1

WS n

DB n

File n

Figure 11 – Static applications architecture example, focusing on the static and manually maintained integration engine

At a first sight, these applications do not represent a valuable solution for the integration of resources. Nonetheless, this solution is widely used specially due to the simplicity of the development and the speed of deployment. With static applications, developers do not need to program dynamic or generic components. Static wrappers are much easily deployed, they can be developed and tested faster and added to the application engine without any difficulty. On the downside, static applications are not generic, flexible or robust. Anytime one wish to add a new resource, developers must program the access to that particular service and add it to the application. At a small level, this solution is feasible. However, when we are dealing with complex environments and a constantly evolving scenario – which is the case in the majority of scientific research projects – this solution is not enough. Another major drawback of static applications is related to control. In this particular case, control refers to the ability to vary application inputs, application outputs and the inner processes executed inside the system. Static solutions only allow the input of a single data type and only provide a single data type output. For instance, a static system can receive a String as input and reply a hexadecimal output. This system has static data entries that user cannot customize or change. In addition, the processes used to convert a character string to a hexadecimal value – a web service for instance – cannot be controlled as well: they are settled and cannot be changed.

3.1.2 Dynamic applications The design and development of dynamic solutions requires a higher-‐level of computer science knowledge and a steady background on the working area to support the various iterations of the project execution. Dynamic applications are the expected evolution of static applications [57] and are distinguished for being able to allow changes in its inputs and outputs as well as conveying distinct resource combinations to reach a given goal. Dynamic applications [58] involve the development of a middleware framework that supports specific features related to resource control. In dynamic applications, control is no longer attached to the application code as it mutates according to the final result expected from the application. That is, dynamic applications can recognize distinct inputs and outputs and work respectively: the application relies on a middleware layer to organize the used set of resources related to the inputted data and the desired output. An exemplifying scenario could be an application that accepts various data type inputs that identify a country – for instance: 2 letter 34

code, full name or phone prefix – and retrieves the wiki page for that country. The application identifies the data type of the input and communicates with a service specific to that input, retrieving the information that allows a successful output reply. Additionally, dynamic applications also allow a higher-‐level of user control. This means that users can select what resources to use from a predefined set of methods or add their own resources to the system that will be recognized automatically. Offering this kind of features increases the platform complexity dramatically. Dynamic access to services and autonomous semantic composition require the development of several focused, flexible and generic protocols [59]. Designing these protocols implies recurring to many technologies debated before and requires the adoption of semantic web strategies to describe incorporated resources and permit the integration of novel ones. Only recurring to intelligent web mechanisms we can improve existing applications and obtain significant advances in dynamic applications, as opposed to the current semi-‐dynamic solutions.

3.1.3 Meta-‐applications Metadata is, generally, data about data. Applying the same idea to applications we can conceive the paradigm of meta-‐applications: applications about applications. Meta-‐applications are state-‐ of-‐the-‐art systems that integrate distributed applications empowering interoperability among heterogeneous systems. Recent developments have also promoted a concept described as “software-‐as-‐a-‐service”: any software solution should be provided as a remote service [60]. If software engineers follow this paradigm, any application could act as a service, easing the integration tasks. Meta-‐applications are a specific set of dynamic applications that are directed to integrate services offering distinct levels of integration control to end-‐users. This control can be manipulated: on one hand, users can have full control over what services they want to execute and what answers they want to obtain; on the other hand, users can have zero control over the application, providing only the initial problem and the type of result they want, forcing autonomous interoperability between the integrated services to attain the proposed goals. Developing meta-‐applications is even more complex than dynamic applications. The problems arisen by heterogeneity and distribution are very difficult to deal with and to reach a flexible and generic solution is a cumbersome assignment. The mashup term was initially used in the music industry to categorize combinations of several songs in a single track. This term has been ported to the WWW and characterizes web hybrid applications: applications that mesh applications. Mashups are the main meta-‐application instance. Their purpose is to combine data gathered from multiple services to offer a centralized wider service [61, 62]. Mashups allow easy and fast integration relying on remote APIs and open data sources and services [63]. Mashups and meta-‐applications share a common basic purpose: to offer a new level of knowledge that was not possible by accessing each service separately. Workflows According to the Workflow Management Coalition a workflow is a logical organization of a series of steps that automate business processes, in whole or part, and where data or tasks are exchanged from one element to another for action [64]. Adapting this concept to software, we can convey that a workflow is a particular implementation of a mashup that consists on an ordered information flow that triggers the execution of several activities to deliver an output or

achieve a goal (Figure 12) [65-‐67]. A crucial workflow requirement is that the inputs of each activity must match with the precedent activity outputs, to maintain this consistency and deal with workflow execution operations, developers must implement a workflow management system [68].

Figure 12 -‐ Workflow example with two initial inputs and one final output

A workflow management system defines, manages and executes workflows through the execution of software that is driven by a computer representation of the workflow logic. Describing the workflow requires a complete description of its elements: task definition, interconnection structure, dependencies and relative order. The most common solution to describe the workflow is to use a configuration file or a database to store the required information. Existing workflow systems can support complex operations and deal with large amounts of data. Though, many science research fields require much more than that. There are emerging requirements that must be handled by workflow management system such as interactive steering, event-‐driven activities, streaming data and collaboration between distinct personnel in distinct parts of the globe. In novel scientific domains, modern researchers prefer to design and execute the workflows in real-‐time, next they will want to analyse the given results and reorganize the workflow accordingly. This exploratory proceeding requires more than what current workflow enactment applications can offer and implies the development of meta-‐ applications and unified working ecosystems that offer a wide range of heterogeneous features to researchers. To deal efficiently with such complexity, developers must overcome the previously mentioned problems related to heterogeneity. Scientific communities from every research field should promote interoperability and semantic descriptions to foster the development of applications that can accurately integrate online resources based on service composition.

3.2 Bioinformatics To cope with the aforementioned biology and biomedicine challenges, bioinformatics have to cross a long way. In the beginning, bioinformatics applications consisted on small desktop software tools that simplified genotype sequencing and genomic sequence analysis. Nowadays, bioinformatics applications encompass large resource networks and web applications that connect biomedical scientific communities worldwide. Web-‐based applications are a key factor to rapid bioinformatics improvements. Researchers use web-‐based applications on a daily basis and publish their discoveries online, spreading them faster and reaching more colleagues. However, like in many other areas, the growing number of web applications and resources has increased the level of heterogeneity and, consequently, the difficulty in finding information with certified quality. While a few years ago, only some of the best research groups published information online, nowadays anyone can publish anything online. This increase in information quantity as resulted in an overall quality decrease. An endless number of workgroups have been developing software solutions to solve the problems and requirements mentioned in section 2. These efforts resulted in remarkable developments mostly on resource integration and interoperability, resource description and mashup/workflow applications. As well as being prosperous developments in the biomedicine field, these applications also represent innovation and state-‐of-‐the-‐art solutions in computer sciences, requiring highly expertise and knowledge form the information technologies community. We organize the existing bioinformatics applications in three logic groups: databases, service protocols and integration applications. Next we present a brief description of some of the most relevant research outcomes that are valuable for a resolution of our initial challenge.

3.2.1 Databases There are many databases that contain biological and biomedical information. In most of the cases, the databases offer their data through web services or flat files that can be easily accessed or parsed. Databases do not follow a single model or notation. Therefore, we have the same biological concept represented in several distinct manners and with various identifiers. The task of converting an entity from one data type to other is often quite complex due to the multitude of existing data types and models. The Kyoto Encyclopaedia of Genes and Genomes (KEGG) is a Japanese initiative with the main goal of collecting genomic information relevant to metabolic pathways and organism behaviours [69]. KEGG is composed of five main databases, each with a distinct focus: Pathways, Atlas, Genes, Ligand and BRITE. Meshing these databases, KEGG aims to obtain a digital representation of the biological system [70]. UniProt intends to be a universal protein resource [71]. It is maintained by a consortium composed by the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). Each of the consortium members is focused on distinct areas, and the convergence of their efforts is a huge database, one of the best regarding curated protein information and functional information. From the association between the EBI and the European Molecular Biology Laboratory (EMBL) resulted various ongoing research projects that have already left their footprint in the bioinformatics community. ArrayExpress [72] archives public functional genomics data in two databases. The Experiments Archive stores results from conducted experiments submitted from

the entire world. The Gene Expression Atlas is a curated and re-‐annotated subset of the Experiments Archive that is directed to gene expression studies. Ensembl [73] is another genome database that contains information from a large number of species and is accessible through a large number of web services. Interpro [74] is another EBI database with focus on proteins and the proteome. In some manner, it is a smaller-‐scale competitor to UniProt. EMBL-‐EBI has many ongoing projects in the bioinformatics field. From these projects several new web applications and databases are born. Medline and the European Genome-‐Phenome Archive are some of these projects that are not as popular as the main projects although they have a growing importance in the life sciences community. USA’s National Center for Biotechnology Information (NCBI) -‐ associated with the National Library of Medicine -‐ is a resource for molecular biology information organized in various categories each containing several databases. From the extensive NCBI database list we can note some major databases. dbSNP [75] stores information about Single Nucleotide Polymorphisms (SNP), particular changes in our genetic sequence that are relevant for the detection of anomalies in our genes. The Mendelian Inheritance in Man (MIM) is a library of know diseases that are mainly caused by genetic disorders. NCBI is responsible for the Online MIM [76]. This allows them to act as a key point for other disease-‐centric and disease-‐related databases and applications (like DiseaseCard [17] for instance). Medical Subject Headings (MeSH) [77] are also made available online in NCBI facilities and are also correlated with other NCBI databases such as the Medical Literature Analysis and Retrieval System (MEDLINE), a huge bibliographic database of published material referred to life sciences and biomedicine, that can be accessed through PubMed, a online search engine, providing access to the entire MEDLINE library. GenBank is an open sequence database that contains information from laboratories throughout the world and regarding a huge number of distinct species. Navigating in the entirety of NCBI databases and applications is not easy. To facilitate this process, NCBI created the Entrez Global Query Cross-‐ Database Search System (Entrez) [16], offering online access to a multitude of NCBI databases through a single user interface. Entrez is also a remarkable project in online resource integration, proving that normalized data formats and coherency across databases and services is the best method to promote interoperability and to achieve dynamic integration. Another hot topic in bioinformatics research is phenotypic information. PhenomicDB [78, 79] is a database for comparative genomics regarding various species and genotype-‐to-‐phenotype information. Information is obtained from several public databases and merged in a single database schema improving database access performance and making several other features possible. PhenoBank started as a complex phenotype study for a single species, evolving to a intelligent solution regarding heterogeneous resource integration. PhenoGO [80] is a Gene Ontology centric database that intends to support high throughput mining of phenotypic and experimental data. In addition to these general-‐purpose databases, there are a large amount of others focusing on specific topics. Locus-‐specific databases (LSDB) contain gene centred variation information and are one of the first scenarios of a wide integration effort. Leiden Open-‐source Variation Database (LOVD) [81] follows the “LSDB-‐in-‐a-‐box” approach to deliver a customizable and easily-‐deployable LSDB application. This means that any research group can deploy its own LSDB and follow the same data model, thus promoting data integration and resource distribution. Though, there are various others locus specific databases like the Inserm’s Bioinformatics Group Universal Mutation Database (UMD) [82], which is directed to clinicians, geneticists and research biologists; or

downloadle variation viewers like VariVis [83] from the University of Melbourne’s Genomic Disorders Research Centre.

3.2.2 Service Protocols Data management in life sciences offers constant challenges to software engineers [84]. Offering this data to end-‐users and researchers worldwide is an even bigger challenge. Web applications tend to be complex and cluttered with data resulting in non-‐usable interfaces and fragile workspaces. The possibility to offer data as a service is a valuable option that is being used more often. The greatest benefit of these remote services is that they allow static programmatic and real-‐time dynamic integration. That is, developers can merge several distributed services in a single centralized application. The Distributed Annotation System (DAS) [85] specifies a protocol for requesting and returning annotation data for genomic regions. DAS relies on distributed servers that are integrated in the client for data supply and is expanding to several life sciences areas, not only sequence annotation. The main idea behind DAS is that distributed resources can be integrated in various environments without being aware of other intervenient. That is, resources can be replicated and integrated in several distinct systems, not only in a single static combination of resources. BioMart [86] consists of a generic framework for biological data storage and retrieval using a range of queries that allow users to group and refine data based upon many different criteria. Its main intention is to improve data mining tasks and it can be downloaded, installed and customized easily. Therefore, it promotes localized and specific integration systems that can merge their data from larger databases. The European Molecular Biology Open Software Suite (EMBOSS) [87, 88] is a software analysis package that unifies a collection of tools related to molecular biology and includes external service access. Applications are catalogued in about 30 groups ranging several areas and operations related to the life sciences. Soaplab was developed at the EBI and is a set of web services that provide remote programmatic access to several other applications [89]. Included in the framework are a dynamic web service generator and powerful command-‐line programs, including support for EMBOSS software. Integration efforts conducted in Soaplab resulted in making possible the use of a single generic interface when accessing any Soaplab web service regardless of the interfaces of underlying software. BioMOBY is a web-‐service interoperability initiative that envisages the integration of web-‐ based bioinformatics resources supported by the annotation of services and tools with term from well-‐known ontologies. BioMOBY was initiated in the Model Organism Bring Your own Database Interface Conference (MOBY-‐DIC) and the proposed integration may be achieved semantically or through web services. The BioMOBY protocol stack defines every layer in the protocol from the ontology to the service discovery properties [90, 91]. The Web API for Biology (WABI) is an extensive set of SOAP and REST web life sciences APIs, focused on data processing and conversion between multiple formats [92, 93]. WABI defines mainly a set of rules and good-‐practices that should be followed when the outcome of a research project is a set of web services. Along with these biomedical service protocols, there are the traditional web services protocols debated previously that give access to several other resources. The growing number of

bioinformatics-‐specific web services protocols is another step-‐back in application interoperability and integration in the life sciences area.

3.2.3 Integration Applications Goble [94] conveyed recently a “state of the nation” of integration in bioinformatics study and her main conclusions were that there is still a long path to traverse, specially concerning integration efficiency. Nonetheless, there were remarkable developments in the last few years. These developments include novelties in data and services integration, semantic web developments and the implementation of mashups/workflows in bioinformatics. Moreover, as stated by Stein, integration is a vital element in the creation of a large bioinformatics ecosystem [95]. The data integration issue can be approached following many different strategies and, worldwide, there are many research groups solving it differently [96]. The adoption of a GRID perspective is one of these approaches [97, 98]. myGRID is a multi-‐institutional and multi-‐ disciplinary consortium that intends to promote e-‐Science initiatives and projects. The most well-‐ known outcome from this project is the workflow enactment application Taverna, which we will discuss further in this document. More recently, GRID is given place to cloud-‐computing strategies [99] like Wagener’s work using XMPP [100], a field that is still lacking interest in the bioinformatics community, though it will gain relevance in a near future. Regarding the resource integration models presented previously – warehouse, mediator and link – there is a huge collection of applications that implement them. DiseaseCard [17, 101] is a public information system that integrates information from distributed and heterogeneous medical and genomic databases. It provides a single entry point to access relevant medical and genetic information available in the Internet about rare human diseases. Using link discovery strategies, DiseaseCard can update its database and include novel applications. Following this approach, it is easy to design a simple integration mechanism for disperse variome data that is available from existing LSDBs. With this system, the life sciences community can access the entire biomedical information landscape, transparently, from a single centralized point. GeneBrowser [18] adopts a hybrid data integration approach, offering a web application focused on gene expression studies which integrates data from several external databases as well as internal data. Integrated data is stored in an in-‐house warehouse, the Genomic Name Server (GeNS) [102]. From there it is possible to present content in multiple formats, from the replicated data sources, and to obtain data, in real-‐time, from the link-‐based integrated resources. Biozon is “a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein-‐protein interactions and cellular pathways, and establishes the relations between them” [103]. Biozon is a data warehouse implementation similar to GeNS, holding data from various large online resources like UniProt or KEGG and organized around a hierarchical ontology. Biozon clever internal organization (graph model, document and relation hierarchy) confers a high degree of versatility to the system, allowing a correct classification of both the global structure of interrelated data and the nature of each data entity. Bioconductor is an open-‐source and open development software package providing tools for the analysis and comprehension of genomic data [104]. The software package is constantly evolving and can be downloaded and installed locally. The software tools that compose the package are made available from service providers, generally in R language. Integration is made

on the clients through the enhancement and coherence in the access to various life sciences distributed tools and services. Large databases Ensembl [73] and Entrez [16] are also major web service providers, offering access to their entire content through a simple layer of comprehensive tools. Despite being a novelty in bioinformatics [47], semantic developments have already found their space in several research groups [105]. The complexities inherent to the life sciences field increase the difficulty in creating ontologies and semantics to describe the immense set of biology concepts and terms. Gene Ontology [106] is the most widely accepted ontology, aiming to unify the representation of gene-‐related terms across all species. This is only possible by providing access to an annotated and very rich controlled vocabulary. Similar efforts are being developed in other related areas [54]. Reactome is a generic database of biology, mostly human biology, describing in detail operations that occur at a molecular level [107]. There are also ongoing efforts to map proteins and their genetic effects in diseases [108]. Despite being developed at W3C, BioDASH [109] is a semantic web initiative envisaging the creation of a platform that enables an association, similar to the one that exists in real world laboratories, between diseases, drugs and compounds in terms of molecular biology and pathway analysis. RDFScape [110] and Bio2RDF [111] are the most known among several studies in bioinformatics semantics [90, 112, 113]. The main purpose of these projects is to create a platform that offers access to well-‐known data in RDF with the triplet format. The underlying complexities in modelling current database models to a new ontology and hierarchy are a remarkable accomplishment that will be very useful in the future of semantic bioinformatics. The integration of services is another area where innovation takes place. Service composition, mashups or workflows are among the hottest trends in bioinformatics application development. Service composition, which encompasses service orchestration and choreography, is already possible in various scenarios like BioMOBY or Bio-‐jETI [90, 91, 114, 115]. Bio-‐jETI uses the Java Electronic Tool Integration (jETI) platform, which allows the combination of features from several tools in an interface that is intuitive and easy to new users. jETI enables the integration of heterogeneous services (REST or SOAP) from different providers or even from distinct application domains. This approach is adapted to the life sciences environment resulting in a platform for multidisciplinary work and cross-‐domain integration. However, mashups and workflows are in the front-‐line of dynamic integration in bioinformatics with desktop applications like Taverna [114, 116-‐118] or, more recently, with various web applications. Taverna is the best state-‐of-‐the-‐art application regarding workflow enactment. It is a Java based desktop application that enables the creation of complex workflows allowing access to files and complex data manipulation. Additionally, Taverna also configures, automatically, the access to BioMOBY, Soaplab, KEGG and other services. Along with these predefined services, users can also dynamically add any web service through its WSDL configuration. These interesting features increase Taverna’s value significantly, overcoming its major drawback, being a heavy desktop-‐ based application. BioWMS is an attempt to create a Taverna-‐like web based workflow enactor. The set of features is not as complete as Taverna and the availability is very limited (unlike Taverna, which is available freely for the major operating systems) [119]. The Workflow Enactment Portal for Bioinformatics (BioWEP) consists of a simple web-‐based application that is able to execute workflows created in Taverna or in BioWMS [120, 121]. Currently, it does not support workflow creation and the available workflow list is quite restricted.

The Bioinformatics Workflow Builder Interface (BioWBI) is another web-‐based workflow creator that connects to a Workflow Execution Engine (WEE) through web-‐services to offer complete web-‐based workflow enactment [122]. DynamicFlow is also a web-‐based workflow management application that relies on Javascript to render a Web2.0 interface that enables the creation of custom workflow based on a predefined set of existing services and operations [123, 124]. Available services are semantically described in an XML configuration file, allowing real-‐time workflow validation. An interesting perspective is used in BioFlow [125]. The main idea behind this initiative is to create a generic workflow execution language that encompasses a definition for the various elements active in a workflow. This new declarative query language permits the exploitation of various recent developments like wrapped services and databases, semantic web and ontologies or data integration.

3.3 Summary Bioinformatics is no longer an emerging field that required software engineers’ assistance. When the bioinformatics research field gained momentum, it used traditional computational tools and software techniques to evolve. In the last few years, we are witnessing a shift in the relation between life sciences and informatics: bioinformatics requirements are fostering computer science evolution and not otherwise. Bioinformatics is no longer a small information technology research group; it has evolved steadily and can now promote innovation and foster the development of state-‐of-‐the-‐art computer science applications. With this in mind, it is crucial to escort computer science developments and apply them in bioinformatics. Whether we are dealing with the latest web trend or new data integration architectures, it is essential to enhance existing bioinformatics resources and prepare them for an intelligent web where semantic descriptions are the key to deal with heterogeneity in integration and interoperability.

4 Work Plan It is common sense that project planning is a key factor in project success [126]. Carefully planning ahead and pursuing a concise initial vision of the matter in hands are very important to reach initial project goals. In the computers sciences field, the major problem in planning ahead is the constant evolution of existing software and technologies. It is not failure proof to plan long-‐term goals and developments. Everyday new and innovative applications appear worldwide and new techniques are published. One cannot predict the discovery of a novel technology or application that will completely disrupt everything we have studied so far. In addition, dealing with web applications is even more complex because are the users who define the Internet. The WWW lives of concept trends that are adopted by general users and software applications in scientific fields must reflect these user interests. Next we present the global doctorate objectives, an estimated calendar and list of activities for the 4 years that comprise this research work and a targeted publication list that will define several moments in that calendar.

4.1 Objectives During the duration of this research work, there are several strategic objectives that should be accomplished. At a development level, the main purpose is to propose, develop and validate a software framework that deals with service composition and dependent topics, providing an added value to the scientific community whether through web applications destined to life sciences researchers or through software toolkits, like remote APIs, that can reach bioinformatics developers. Along with software developments, this doctorate work must also be composed of scientific work. In addition to the final thesis, there have to be various published scientific breakthroughs in both computer science and bioinformatics research fields. Scientific publications have major relevance in an expanding scientific community like bioinformatics. Moreover, peer reviews and feedback from other researchers represent a valuable foundation for further improvements in any research.

4.2 Calendar The following Gantt chart (Table 1) contains an estimation of the work being developed during this doctorate. Each year is divided quarterly and the work is divided in three categories: software, thesis and publications. The thesis is composed of two deliveries: a thesis proposal (this document) at the end of the first year and the final document at the end of the fourth year. One cannot also plan what the software outcome of the project will be. Therefore, it is not reasonable

to define static software milestones. Instead, we organize the software being developed in three main cycles. It is expected that at the end of each cycle, a software evaluation is conducted, where new software is presented, validated and published to the community. This evaluation will also work as a re-‐assessment of the project objectives and the adequacy of the developed work to the initial problem and requirements. Scientific publications are a specific case of planning. One cannot predict when will the most adequate conference be organized or when a paper will be accepted in a journal. With this in mind, the following Gantt chart comprises publications as a general term. It is desirable to obtain high impact factor publications – scientific journals or books – and several others medium factor publications – conference proceedings or workshops. Table 1 -‐ Gantt chart calendar comprising the activities being developed during this doctorate

Year 1

Year 2 Q4

State of the Art Domain Analysis Proposal Main corpus Delivery

Preliminary Research System Analysis Modelling Active Development Deliveries

High IF Medium IF Legend:

Year 3 Q4

Year 4

Thesis Writing

Software

Publications

 Initial analysis and preparation for the task in hand  Active development: implementing software or writing  Finalization: final software versions, rewrites and deliveries

4.3 Publications As previously mentioned, publication organization cannot be planned in advance. There are numerous constraints related to conference dates, open publication calendar and, more importantly, acceptance dates. These constraints limit the planning phase, as one cannot know when or where a scientific article will be accepted. Additionally, conference calendars change throughout the years and new international events are constantly appearing. Therefore, we can only analyse several scientific publications and choose the ones we find more adequate to the work we will develop. Journal impact factor [127] is a calculated measure used to evaluate the relative importance of a specific journal within its field of research. Knowing a publication impact factor, we can assert

about the visibility our work will gain if published in that journal. As a result of multiple scientific progresses, we wish to publish three articles in high impact factor magazines. These magazines include Science (http://www.sciencemag.org), Hindawi (http://www.hindawi.com), BMC Bioinformatics (http://www.biomedcentral.com/bmcbioinformatics) or the Bioinformatics (http://bioinformatics.oxfordjournals.org) and various Oxford Journals ((http://www.oxfordjournals.org) like Bioinformatics, Database or Nucleic Acids Research. Journal and magazine publications are the more valuable means of publication in the scientific community. Nevertheless, publishing work in conference proceedings is also a relevant way to publicize developments to our peers. In this scenario, publication magnitude is a direct consequence of the scientific group that indexes the proceedings (if any). IEEE (http://www.ieee.org), ACM (http://www.acm.org), dblp (http://www.informatik.uni-‐ trier.de/~ley/db/) or Springer (http://www.springer.com) are the more relevant indexing groups and participate in the indexing process of numerous conferences, in a diversity of topics, worldwide. The following table lists the works published so far. The first two [123, 124] are full papers and the others poster presentations. Table 2 – List works already published in this doctorate

Indexing Publishing Date Dynamic Service Integration using Web-‐based Workflows [123] ACM November 2008 DynamicFlow: A Client-‐side Workflow Management System [124] Springer June 2009 Arabella: A Directed Web Crawler dblp October 2009 Link Integrator: A Link-‐based Data Integration Architecture dblp October 2009 Table 3 lists interesting conferences that took place in the last couple of years and have a large probably of occurring during this doctorate. These conferences are focused on relevant computer science and bioinformatics topics, especially in online resources integration. Table 3 – Interesting conference list

Indexing International Conference on Information Integration and Web-‐ ACM based Applications & Services International Conference on Bioinformatics and Bioengineering IEEE International Conference on Bioinformatics and Biomedical IEEE Engineering International Conference on Bioinformatics and Biomedicine IEEE Data Integration in the Life Sciences Springer International Conference on Web Search and Data Mining ACM Symposium on Applied Computing: Web Technologies ACM Conference on Web Application Development International Conference on Enterprise Information Systems dblp International Conference on Web Information Systems and dblp Technologies International Workshop on Services Integration in Pervasive ACM Environments International Workshop on Lightweight Integration on the Web Springer International Workshop on Information Integration on the Web

Topic Computer Science Bioinformatics Bioinformatics Bioinformatics Bioinformatics Computer Science Computer Science Computer Science Computer Science Computer Science Computer Science Computer Science Computer Science

5 Implications of Research Any researcher developing efforts in computer science must be constantly aware of external innovations and improvements made in the areas related to the matter in hands. This everyday evolution is leveraged by numerous workgroups endeavours. Nowadays, web application innovations are related to ideals like Web2.0, Web3.0 or the Intelligent/Semantic Web. These WWW innovations must be taken in account when developing new applications in any scientific field. There has to be a bet in innovation in bioinformatics and the best way to win this bet is to adopt well-‐know concepts and trends from the Internet and applied them to the life sciences field. However, while in areas like entertainment or journalism we have easy access to a myriad of resources, this does not happen in the life sciences field. The life sciences research field is so vast that the number of topics that a single application can cover is much reduced. This leads to the appearance of an immense number of applications and, consequently, an even bigger number of resources to integrate and an overwhelming heterogeneity. In order to transform the web into the main bioinformatics application platform, it is necessary to design and develop new architectures that promote integration and interoperability. The work that will be executed during this doctorate aims to create an innovative and comprehensive software framework that can enhance the development of novel web-‐based applications in the bioinformatics field. We believe that this is an essential step that will improve life sciences research at many levels. New web-‐based tools will provide easier and faster access to resources: data, services or applications, by providing a set of software tools that foster integration and interoperability; amend the application development cycle, reducing integration complexities and accelerating deployment; ease everyday biologists and clinicians tasks by empowering pervasive bioinformatics; open the path to the intelligent web by leveraging resource description and, most importantly, promote communication and cooperation among the scientific community that will ultimately result in bold and ambitious scientific discoveries.

References [1] [2] [3] [4]

[5] [6] [7] [8]

[9]

[10] [11]

[12] [13]

[14] [15]

[16]

J. D. Watson, "The human genome project: past, present, and future," Science, vol. 248, pp. 44-‐49, April 6, 1990 1990. R. Tupler, G. Perini, and M. R. Green, "Expressing the human genome," Nature, vol. 409, pp. 832-‐833, 2001. D. Primorac, "Human Genome Project-‐based applications in forensic science, anthropology, and individualized medicine," Croat Med J, vol. 50, pp. 205-‐6, Jun 2009. L. Biesecker, J. C. Mullikin, F. Facio, C. Turner, P. Cherukuri, R. Blakesley, G. Bouffard, P. Chines, P. Cruz, N. Hansen, J. Teer, B. Maskeri, A. Young, N. Comparative Sequencing Program, T. Manolio, A. Wilson, T. Finkel, P. Hwang, A. Arai, A. Remaley, V. Sachdev, R. Shamburek, R. Cannon, and E. D. Green, "The ClinSeq Project: Piloting large-‐scale genome sequencing for research in genomic medicine," Genome Res, Jul 14 2009. R. G. H. Cotton, "Recommendations of the 2006 Human Variome Project meeting," Nature Genetics, vol. 39, pp. 433-‐436, 2007. H. Z. Ring, P.-‐Y. Kwok, and R. G. Cotton, "Human Variome Project: an international collaboration to catalogue human genetic variation," Pharmacogenomics, vol. 7, pp. 969-‐ 972, 2006. M. G. Aspinall and R. G. Hamermesh, "Realizing the promise of personalized medicine," Harv Bus Rev, vol. 85, pp. 108-‐17, 165, Oct 2007. D. M. Roden, R. B. Altman, N. L. Benowitz, D. A. Flockhart, K. M. Giacomini, J. A. Johnson, R. M. Krauss, H. L. McLeod, M. J. Ratain, M. V. Relling, H. Z. Ring, A. R. Shuldiner, R. M. Weinshilboum, S. T. Weiss, and for the Pharmacogenetics Research Network, "Pharmacogenomics: Challenges and Opportunities," Ann Intern Med, vol. 145, pp. 749-‐ 757, November 21, 2006 2006. Erwin P. Bottinger, "Foundations, promises and uncertainties of personalized medicine," Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine, vol. 74, pp. 15-‐21, 2007. J. N. Hirschhorn and M. J. Daly, "Genome-‐wide association studies for common diseases and complex traits," Nat Rev Genet, vol. 6, pp. 95-‐108, Feb 2005. S. S. S. Reddy, L. S. S. Reddy, V. Khanaa, and A. Lavanya, "Advanced Techniques for Scientific Data Warehouses," in International Conference on Advanced Computer Control, ICACC, 2009, pp. 576-‐580. N. Polyzotis, S. Skiadopoulos, P. Vassiliadis, A. Simitsis, and N. Frantzell, "Meshing Streaming Updates with Persistent Data in an Active Data Warehouse," Knowledge and Data Engineering, IEEE Transactions on, vol. 20, pp. 976-‐991, 2008. Y. Zhu, L. An, and S. Liu, "Data Updating and Query in Real-‐Time Data Warehouse System," in Computer Science and Software Engineering, 2008 International Conference on, 2008, pp. 1295-‐1297. A. Kiani and N. Shiri, "A Generalized Model for Mediator Based Information Integration," in 11th International Database Engineering and Applications Symposium, pp. 268-‐272. L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope, "DiscoveryLink: A system for integrated access to life sciences data sources," IBM Systems Journal, vol. 40, pp. 489-‐511, 2001. D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova, "Entrez Gene: gene-‐centered information at NCBI," Nucleic Acids Research, vol. 35, 2007.

[17]

[18]

[19]

[20]

[21] [22] [23] [24]

[25] [26] [27] [28] [29] [30]

[31] [32] [33] [34] [35] [36]

[37]

[38]

J. L. Oliveira, G. M. S. Dias, I. F. C. Oliveira, P. D. N. S. d. Rocha, I. Hermosilla , J. Vicente, I. Spiteri, F. Martin-‐Sánchez, and A. M. M. d. S. Pereira "DiseaseCard: A Web-‐based Tool for the Collaborative Integration of Genetic and Medical Information," in 5th International Symposium, ISBMDA 2004: Biological and Medical Data Analysis, 2004, pp. 409-‐417. J. Arrais, B. Santos, J. Fernandes, L. Carreto, M. Santos, A. S., and J. L. Oliveira, "GeneBrowser: an approach for integration and functional classification of genomic data," 2007. I. Jerstad, S. Dustdar, and D. V. Thanh, "A service oriented architecture framework for collaborative services," in Enabling Technologies: Infrastructure for Collaborative Enterprise, 2005. 14th IEEE International Workshops on, 2005, pp. 121-‐125. C. Papagianni, G. Karagiannis, N. D. Tselikas, E. Sfakianakis, I. P. Chochliouros, D. Kabilafkas, T. Cinkler, L. Westberg, P. Sjodin, M. Hidell, S. H. de Groot, T. Kontos, C. Katsigiannis, C. Pappas, A. Antonakopoulou, and I. S. Venieris, "Supporting End-‐to-‐End Resource Virtualization for Web 2.0 Applications Using Service Oriented Architecture," in GLOBECOM Workshops, 2008 IEEE, 2008, pp. 1-‐7. G. Hohpe and B. Woolf, Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions: Addison-‐Wesley, 2004. R. Kazman, G. Abowd, L. Bass, and P. Clements, "Scenario-‐based analysis of software architecture," Software, IEEE, vol. 13, pp. 47-‐55, 1996. A. Tolk and J. A. Muguira, "Levels of Conceptual Interoperability Model," in Fall Simulation Interoperability Workshop, Orlando, Florida, USA, 2003, pp. 14-‐19. P. K. Davis and R. H. Anderson, "Improving the Composability of DoD Models and Simulations," The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology, vol. 1, pp. 5-‐17, April 1, 2004 2004. T. Berners-‐Lee, J. Hendler, and O. Lassila, "The Semantic Web," Sci Am, vol. 284, pp. 34 -‐ 43, 2001. M. Uschold and M. Gruninger, "Ontologies: Principles, Methods and Applications," Knowledge Engineering Review, vol. 11, pp. 93-‐155, 1996. M. Zloof, "Query by example," in Proceedings of the May 19-‐22, 1975, national computer conference and exposition Anaheim, California: ACM, 1975. S. Staab, "Web Services: Been there, Done That?," IEEE Intelligent Systems, pp. 72-‐85, 2003. W. W. W. C. W3C, "Web Services," World Wide Web Consortium, 2002. F. Rosenberg, F. Curbera, M. J. Duftler, and R. Khalaf, "Composing RESTful Services and Collaborative Workflows: A Lightweight Approach," Internet Computing, IEEE, vol. 12, pp. 24-‐31, 2008. W. W. W. C. W3C, "Simple Object Access Protocol," World Wide Web Consortium, 2007. OASIS, "Universal Description, Discovery and Integration," OASIS, 2005. W. W. W. C. W3C, "Web Service Description Language," World Wide Web Consortium, 2001. J. Kopecky, T. Vitvar, C. Bournez, and J. Farrell, "SAWSDL: Semantic Annotations for WSDL and XML Schema," Internet Computing, IEEE, vol. 11, pp. 60-‐67, 2007. E. M. a. P. P. S. F. XMPP Standards Foundation, "Extensible Messaging and Presence Protocol," http://xmpp.org/: IETF, Internet Engineering Task Force, 1999. E. M. a. P. P. S. F. XMPP Standards Foundation, "XEP-‐0244: IO Data," http://xmpp.org/extensions/xep-‐0244.html: XMPP Standards Foundation, Extensible Messaging and Presense Protocol Standards Foundation, 2008. E. M. a. P. P. S. F. XMPP Standards Foundation, "XEP-‐0030: Service Discovery," http://xmpp.org/extensions/xep-‐0030.html: XMPP Standards Foundation, Extensible Messaging and Presense Protocol Standards Foundation, 1999. N. Milanovic and M. Malek, "Current solutions for Web service composition," Internet Computing, IEEE, vol. 8, pp. 51-‐59, 2004.

[39] [40] [41]

[42]

[43] [44] [45] [46] [47] [48] [49] [50] [51]

[52] [53]

[54] [55]

[56] [57] [58] [59] [60]

C. Peltz, "Web services orchestration and choreography," Computer, vol. 36, pp. 46-‐52, 2003. I. Foster and C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure: Morgan Kaufmann Publishers Inc., 2003. M. Taiji, T. Narumi, Y. Ohno, N. Futatsugi, A. Suenaga, N. Takada, and A. Konagaya, "Protein Explorer: A Petaflops Special-‐Purpose Computer System for Molecular Dynamics Simulations," in Proceedings of the 2003 ACM/IEEE conference on Supercomputing: IEEE Computer Society, 2003. S. Masuno, T. Maruyama, Y. Yamaguchi, and A. Konagaya, "Multidimensional Dynamic Programming for Homology Search on Distributed Systems," in Euro-‐Par 2006 Parallel Processing, 2006, pp. 1127-‐1137. N. Cannata, E. Merelli, and R. B. Altman, "Time to Organize the Bioinformatics Resourceome," PLoS Comput Biol, vol. 1, p. e76, 2005. M. Cannataro and D. Talia, "Semantics and knowledge grids: building the next-‐generation grid," Intelligent Systems, IEEE, vol. 19, pp. 56-‐63, 2004. C. Goble, R. Stevens, and S. Bechhofer, "The Semantic Web and Knowledge Grids," Drug Discovery Today: Technologies, vol. 2, pp. 225-‐233, 2005. J. Hendler, "COMMUNICATION: Enhanced: Science and the Semantic Web," Science, vol. 299, pp. 520-‐521, January 24, 2003 2003. E. Neumann, "A Life Science Semantic Web: Are We There Yet?," Sci. STKE, vol. 2005, pp. pe22-‐, May 10, 2005 2005. M. Stollberg and A. Haller, "Semantic Web services tutorial," in Services Computing, 2005 IEEE International Conference on, 2005, p. xv vol.2. W. W. W. C. W3C, "Resource Description Framework," World Wide Web Consortium, 2004. W. W. W. C. W3C, "Web Ontology Language," World Wide Web Consortium, 2007. A. Ruttenberg, T. Clark, W. Bug, M. Samwald, O. Bodenreider, H. Chen, D. Doherty, K. Forsberg, Y. Gao, V. Kashyap, J. Kinoshita, J. Luciano, M. S. Marshall, C. Ogbuji, J. Rees, S. Stephens, G. T. Wong, E. Wu, D. Zaccagnini, T. Hongsermeier, E. Neumann, I. Herman, and K. H. Cheung, "SPARQL: Advancing translational research with the Semantic Web," BMC Bioinformatics, vol. 8, p. S2, 2007. E. J. Miller, "An Introduction to the Resource Description Framework," Journal of Library Administration, vol. 34, pp. 245-‐255, 2001. S. Harris and N. Shadbolt, "SPARQL Query Processing with Conventional Relational Database Systems," in Web Information Systems Engineering – WISE 2005 Workshops, 2005, pp. 235-‐244. R. Stevens, C. A. Goble, and S. Bechhofer, "Ontology-‐based knowledge representation for bioinformatics," Brief Bioinform, vol. 1, pp. 398-‐414, January 1, 2000 2000. D. Martin, M. Paolucci, S. McIlraith, M. Burstein, D. McDermott, D. McGuinness, B. Parsia, T. Payne, M. Sabou, M. Solanki, N. Srinivasan, and K. Sycara, "Bringing Semantics to Web Services: The OWL-‐S Approach," in Semantic Web Services and Web Process Composition, 2005, pp. 26-‐42. M. Hepp, "Semantic Web and semantic Web services: father and son or indivisible twins?," Internet Computing, IEEE, vol. 10, pp. 85-‐88, 2006. H. M. Sneed, "Software evolution. A road map," in Software Maintenance, 2001. Proceedings. IEEE International Conference on, 2001, p. 7. Z. Zou, Z. Duan, and J. Wang, "A Comprehensive Framework for Dynamic Web Services Integration," in European Conference on Web Services (ECOWS'06), 2006. G. O. H. Chong Minsk, L. E. E. Siew Poh, H. E. Wei, and T. A. N. Puay Siew, "Web 2.0 Concepts and Technologies for Dynamic B2B Integration," IEEE, pp. 315-‐321, 2007. M. Turner, D. Budgen, and P. Brereton, "Turning software into a service," Computer, vol. 36, pp. 38-‐44, 2003. 51

[61] [62] [63]

[64] [65]

[66] [67]

[68] [69] [70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

L. Xuanzhe, H. Yi, S. Wei, and L. Haiqi, "Towards Service Composition Based on Mashup," in Services, 2007 IEEE Congress on, 2007, pp. 332-‐339. N. Yan, "Build Your Mashup with Web Services," in Web Services, 2007. ICWS 2007. IEEE International Conference on, 2007, pp. xli-‐xli. Q. Zhao, G. Huang, J. Huang, X. Liu, and H. Mei, "A Web-‐Based Mashup Environment for On-‐the-‐Fly Service Composition," in Service-‐Oriented System Engineering, 2008. SOSE '08. IEEE International Symposium on, 2008, pp. 32-‐37. D. Hollingsworth, The Workflow Reference Model, 1995. G. Preuner and M. Schrefl, "Integration of Web Services into Workflows through a Multi-‐ Level Schema Architecture," in 4th IEEE Int’l Workshop on Advanced Issues of E-‐Commerce and Web-‐Based Information Systems (WECWIS 2002), 2002. J. Cardoso and A. Sheth, "Semantic E-‐Workflow Composition," Journal of Intelligent Information Systems, 2003. P. C. K. Hung and D. K. W. Chiu, "Developing Workflow-‐based Information Integration (WII) with Exception Support in a Web Services Environment," in 37th Hawaii International Conference on System Sciences -‐ 2004, 2004. S. Petkov, E. Oren, and A. Haller, "Aspects in Workflow Management," 2005. M. Kanehisa and S. Goto, "KEGG: Kyoto Encyclopedia of Genes and Genomes," Nucl. Acids Res., vol. 28, pp. 27-‐30, January 1, 2000 2000. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-‐Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, "From genomics to chemical genomics: new developments in KEGG," Nucl. Acids Res., vol. 34, pp. D354-‐357, January 1, 2006 2006. A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Natale, C. O'Donovan, N. Redaschi, and L.-‐S. L. Yeh, "The Universal Protein Resource (UniProt)," Nucl. Acids Res., vol. 33, pp. D154-‐159, January 1, 2005 2005. H. Parkinson, U. Sarkans, M. Shojatalab, N. Abeygunawardena, S. Contrino, R. Coulson, A. Farne, G. Garcia Lara, E. Holloway, M. Kapushesky, P. Lilja, G. Mukherjee, A. Oezcimen, T. Rayner, P. Rocca-‐Serra, A. Sharma, S. Sansone, and A. Brazma, "ArrayExpress-‐-‐a public repository for microarray gene expression data at the EBI," Nucl. Acids Res., vol. 33, pp. D553-‐555, January 1, 2005 2005. T. Margaria, M. G. Hinchey, H. Raelt, J. Rash, C. A. Rou, and B. Steffen, "Ensembl Database: Completing and Adapting Models of Biological Processes," Proceedings of the Conference on Biologically Inspired Cooperative Computing (BiCC IFIP): 20-‐25 August 2006; Santiago (Chile), pp. 43 -‐ 54, 2006. N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, V. Buillard, L. Cerutti, R. Copley, E. Courcelle, U. Das, L. Daugherty, M. Dibley, R. Finn, W. Fleischmann, J. Gough, D. Haft, N. Hulo, S. Hunter, D. Kahn, A. Kanapin, A. Kejariwal, A. Labarga, P. S. Langendijk-‐Genevaux, D. Lonsdale, R. Lopez, I. Letunic, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, A. N. Nikolskaya, S. Orchard, C. Orengo, R. Petryszak, J. D. Selengut, C. J. A. Sigrist, P. D. Thomas, F. Valentin, D. Wilson, C. H. Wu, and C. Yeats, "New developments in the InterPro database," Nucl. Acids Res., vol. 35, pp. D224-‐228, January 12, 2007 2007. S. T. Sherry, M.-‐H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin, "dbSNP: the NCBI database of genetic variation," Nucl. Acids Res., vol. 29, pp. 308-‐311, January 1, 2001 2001. A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKusick, "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders," Nucl. Acids Res., vol. 33, pp. D514-‐517, January 1, 2005 2005. C. E. Lipscomb, "Medical Subject Headings (MeSH)," Bull Med Libr Assoc, vol. 88, pp. 265-‐ 6, Jul 2000.

[78]

[79]

[80]

[81]

[82]

[83] [84] [85]

[86] [87] [88] [89]

[90]

[91] [92] [93] [94] [95] [96] [97] [98]

A. Kahraman, A. Avramov, L. G. Nashev, D. Popov, R. Ternes, H.-‐D. Pohlenz, and B. Weiss, "PhenomicDB: a multi-‐species genotype/phenotype database for comparative phenomics," Bioinformatics, vol. 21, pp. 418-‐420, February 1, 2005 2005. P. Groth, N. Pavlova, I. Kalev, S. Tonov, G. Georgiev, H.-‐D. Pohlenz, and B. Weiss, "PhenomicDB: a new cross-‐species genotype/phenotype resource," Nucl. Acids Res., vol. 35, pp. D696-‐699, January 12, 2007 2007. L. Sam, E. Mendonca, J. Li, J. Blake, C. Friedman, and Y. Lussier, "PhenoGO: an integrated resource for the multiscale mining of clinical and biological data," BMC Bioinformatics, vol. 10, p. S8, 2009. Ivo F.A.C. Fokkema, Johan T. den Dunnen, and Peter E.M. Taschner, "LOVD: Easy creation of a locus-‐specific sequence variation database using an ldquoLSDB-‐in-‐a-‐boxrdquo approach," Human Mutation, vol. 26, pp. 63-‐68, 2005. Christophe BÈroud, GwenaÎlle Collod-‐BÈroud, Catherine Boileau, Thierry Soussi, and Claudine Junien, "UMD (Universal Mutation Database): A generic software to build and analyze locus-‐specific databases," Human Mutation, vol. 15, pp. 86-‐94, 2000. T. Smith and R. Cotton, "VariVis: a visualisation toolkit for variation databases.," BMC Bioinformatics, vol. 9, p. 206, 2008. S. Haider, B. Ballester, D. Smedley, J. Zhang, P. Rice, and A. Kasprzyk, "BioMart Central Portal-‐-‐unified access to biological data," Nucl. Acids Res., vol. 37, pp. W23-‐27, July 1, 2009 2009. A. Jenkinson, M. Albrecht, E. Birney, H. Blankenburg, T. Down, R. Finn, H. Hermjakob, T. Hubbard, R. Jimenez, P. Jones, A. Kahari, E. Kulesha, J. Macias, G. Reeves, and A. Prlic, "Integrating biological data -‐ the Distributed Annotation System," BMC Bioinformatics, vol. 9, p. S3, 2008. D. Smedley, S. Haider, B. Ballester, R. Holland, D. London, G. Thorisson, and A. Kasprzyk, "BioMart -‐ biological queries made easy," BMC Genomics, vol. 10, p. 22, 2009. M. Sarachu and M. Colet, "wEMBOSS: a web interface for EMBOSS," Bioinformatics, vol. 21, pp. 540-‐1, Feb 15 2005. P. Rice, I. Longden, and A. Bleasby, "EMBOSS: the European Molecular Biology Open Software Suite," Trends Genet, vol. 16, pp. 276-‐7, Jun 2000. S. Pillai, V. Silventoinen, K. Kallio, M. Senger, S. Sobhany, J. Tate, S. Velankar, A. Golovin, K. Henrick, P. Rice, P. Stoehr, and R. Lopez, "SOAP-‐based services provided by the European Bioinformatics Institute," Nucl. Acids Res., vol. 33, pp. W25-‐28, July 1, 2005 2005. M. DiBernardo, R. Pottinger, and M. Wilkinson, "Semi-‐automatic web service composition for the life sciences using the BioMoby semantic web framework," Journal of Biomedical Informatics, vol. 41, pp. 837-‐847, 2008. M. Wilkinson and M. Links, "BioMoby: An open source biological web services proposal," Brief Bioinform, vol. 3, pp. 331 -‐ 341, 2002. Y. Kwon, Y. Shigemoto, Y. Kuwana, and H. Sugawara, "Web API for biology with a workflow navigation system," Nucl. Acids Res., vol. 37, pp. W11-‐16, July 1, 2009 2009. H. Sugawara and S. Miyazaki, "Biological SOAP servers and web services provided by the public sequence data bank," Nucl. Acids Res., vol. 31, pp. 3836-‐3839, July 1, 2003 2003. C. Goble and R. Stevens, "State of the nation in data integration for bioinformatics," Journal of Biomedical Informatics, vol. 41, pp. 687-‐693, 2008. L. Stein, "Creating a bioinformatics nation," Nature, vol. 417, pp. 119 -‐ 20, 2002. L. D. Stein, "Integrating biological databases," Nature Genetics, vol. 4, pp. 337-‐345, 2003. R. Stevens, A. Robinson, and C. Goble, "myGrid: personalized bioinformatics on the information grid," Bioinformatics, vol. 19, pp. I302 -‐ I304, 2003. V. Bashyam, W. Hsu, E. Watt, A. A. T. Bui, H. Kangarloo, and R. K. Taira, "Informatics in Radiology: Problem-‐centric Organization and Visualization of Patient Imaging and Clinical Data," Radiographics, p. 292085098, January 23, 2009 2009.

[99] [100]

[101]

[102] [103] [104]

[105] [106]

[107]

[108] [109] [110] [111]

[112]

[113]

[114]

[115]

M. A. Vouk, "Cloud computing -‐ Issues, research and implementations," in Information Technology Interfaces, 2008. ITI 2008. 30th International Conference on, 2008, pp. 31-‐40. J. Wagener, O. Spjuth, E. Willighagen, and J. Wikberg, "XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous Web services," BMC Bioinformatics, vol. 10, p. 279, 2009. G. Dias, F.-‐J. Vicente, J. L. Oliveira, and F. Martin-‐Sánchez "Integrating Medical and Genomic Data: a Sucessful Example For Rare Diseases," in MIE 2006: The 20th International Congress of the European Federation for Medical Informatics, 2006, pp. 125 -‐ 130. J. Arrais, J. Pereira, and J. L. Oliveira, "GeNS: A biological data integration platform," in ICBB 2009, International Conference on Bioinformatics and Biomedicine, Venice, 2009. A. Birkland and G. Yona, "BIOZON: a system for unification, management and analysis of heterogeneous biological data," BMC Bioinformatics, vol. 7, 2006. T. Margaria, R. Nagel, and B. Steffen, "Bioconductor-‐jETI: A Tool for Remote Tool Integration," Proceedings of the 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS): 4-‐8 April 2005; Edinburgh, U.K., pp. 557 -‐ 562, 2005. N. Cannata, M. Schroder, R. Marangoni, and P. Romano, "A Semantic Web for bioinformatics: goals, tools, systems, applications," BMC Bioinformatics, vol. 9, p. S1, 2008. M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-‐Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, "Gene Ontology: tool for the unification of biology," Nat Genet, vol. 25, pp. 25-‐29, 2000. I. Vastrik, P. D'Eustachio, E. Schmidt, G. Joshi-‐Tope, G. Gopinath, D. Croft, B. de Bono, M. Gillespie, B. Jassal, S. Lewis, L. Matthews, G. Wu, E. Birney, and L. Stein, "Reactome: a knowledge base of biologic pathways and processes," Genome Biolology, vol. 8, p. R39, 2007. A. Mottaz, Y. Yip, P. Ruch, and A. Veuthey, "Mapping proteins to disease terminologies: from UniProt to MeSH.," BMC Bioinformatics, vol. 9 Suppl 5, p. S3, 2008. E. K. Neumann and D. Quan, "BioDASH: a Semantic Web dashboard for drug development," Pac Symp Biocomput, pp. 176 -‐ 187, 2006. A. Splendiani, "RDFScape: Semantic Web meets Systems Biology," BMC Bioinformatics, vol. 9, p. S6, 2008. F. Belleau, M.-‐A. Nolin, N. Tourigny, P. Rigault, and J. Morissette, "Bio2RDF: Towards a mashup to build bioinformatics knowledge systems," Journal of Biomedical Informatics, vol. 41, pp. 706-‐716, 2008. M. Schroeder, A. Burger, P. Kostkova, R. Stevens, B. Habermann, and R. Dieng-‐Kuntz, "From a Services-‐based eScience Infrastructure to a Semantic Web for the Life Sciences: The Sealife Project," Proceedings of the Sixth International Workshop NETTAB 2006 on "Distributed Applications, Web Services, Tools and GRID Infrastructures for Bioinformatics", 2006. K.-‐H. Cheung, V. Kashyap, J. S. Luciano, H. Chen, Y. Wang, and S. Stephens, "Semantic mashup of biomedical data," Journal of Biomedical Informatics, vol. 41, pp. 683-‐686, 2008. R. de Knikker, Y. Guo, J.-‐l. Li, A. Kwan, K. Yip, D. Cheung, and K.-‐H. Cheung, "A web services choreography scenario for interoperating bioinformatics applications," BMC Bioinformatics, vol. 5, p. 25, 2004. T. Margaria, C. Kubczak, and B. Steffen, "Bio-‐jETI: a service integration, design, and provisioning platform for orchestrated bioinformatics processes," BMC Bioinformatics, vol. 9, p. S12, 2008.

[116] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li, "Taverna: a tool for the composition and enactment of bioinformatics workflows," Bioinformatics, vol. 20, pp. 3045 -‐ 3054, 2004. [117] B. Ludascher, I. Altintas, C. Berkley, D. Higgings, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao, "Taverna: Scientific Workflow Management and the Kepler System," Research Articles, Concurrency and Computation: Practice & Experience, vol. 18, pp. 1039 -‐ 1065, 2006. [118] G. Carole Anne and R. David Charles De, "myExperiment: social networking for workflow-‐ using e-‐scientists," in Proceedings of the 2nd workshop on Workflows in support of large-‐ scale science Monterey, California, USA: ACM, 2007. [119] E. Bartocci, F. Corradini, E. Merelli, and L. Scortichini, "BioWMS: a web-‐based Workflow Managemt System for bioinformatics," BMC Bioinformatics, vol. 8, p. 14, 2007. [120] P. Romano, E. Bartocci, G. Bertolini, F. De Paoli, D. Marra, G. Mauri, E. Merelli, and L. Milanesi, "Biowep: a workflow enactment portal for bioinformatic applications," BMC Bioinformatics, vol. 8, 2007. [121] P. Romano, D. Marra, and L. Milanesi, "Web services and workflow management for biological resources," BMC Bioinformatics, vol. 6, p. S24, 2005. [122] T. Life Sciences Practice, "BioWBI and WEE: Tools for Bioinformatics Analysis Workflows," 2004. [123] P. Lopes, J. Arrais, and J. L. Oliveira, "Dynamic Service Integration using Web-‐based Workflows," in 10th International Conference on Information Integration and Web Applications & Services, Linz, Austria, 2008, pp. 622-‐625. [124] P. Lopes, J. Arrais, and J. Oliveira, "DynamicFlow: A Client-‐Side Workflow Management System," in Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, 2009, pp. 1101-‐1108. [125] H. Jamil and B. El-‐Hajj-‐Diab, "BioFlow: A Web-‐Based Declarative Workflow Language for Life Sciences," in Proceedings of the 2008 IEEE Congress on Services -‐ Part I -‐ Volume 00: IEEE Computer Society, 2008. [126] D. Dvir, T. Raz, and A. J. Shenhar, "An empirical analysis of the relationship between project planning and project success," International Journal of Project Management, vol. 21, pp. 89-‐95, 2003. [127] E. Garfield, "The History and Meaning of the Journal Impact Factor," JAMA, vol. 295, pp. 90-‐93, January 4, 2006 2006.