From Genotype to Phenotype - Future Perspectives on Data and Services Integration by Pedro Lopes

FROM GENOTYPE TO PHENOTYPE  FUTURE PERSPECTIVES ON DATA AND SERVICE INTEGRATION

TÓPICOS AVANÇADOS EM ENGENHARIA INFORMÁTICA BIOINFORMÁTICA  Programa Doutoral em Engenharia Informática 20082009               Pedro Lopes | pedrolopes@ua.pt

TABLE OF CONTENTS   Table of contents .....................................................................................................................................................2  Introduction – The GEN2PHEN Project.........................................................................................................3  Integration Scenarios and Related Work......................................................................................................6  Semantic Web ......................................................................................................................................................7  Social environments .........................................................................................................................................8  Integration ............................................................................................................................................................9  Summary............................................................................................................................................................. 10  Our Ongoing Developments ............................................................................................................................ 12  Dynamicflow ..................................................................................................................................................... 12  DiseaseCard ....................................................................................................................................................... 14  Summary............................................................................................................................................................. 15  Future Perspectives ............................................................................................................................................ 16  Cloud‐computing............................................................................................................................................. 16  Information Integration ............................................................................................................................... 17  Data Visualization ........................................................................................................................................... 18  Summary............................................................................................................................................................. 19  Conclusion............................................................................................................................................................... 20  References............................................................................................................................................................... 21

INTRODUCTION – THE GEN2PHEN PROJECT   Bioinformatics  is  emerging  as  one  of  the  more  fastest‐growing  scientific  areas  of  computer science. Recent hardware and software developments show an evolution faster  than the Moore’s Law predictions. This development has begun with the Human Genome  Project 1  which  has  succeeded  in  decoding  the  complete  human  genetic  code.  This  generated  a  tremendous  amount  of  information  that  was  readily  available  and  the  scientific  community  rapidly  started  designing  applications,  increasing  the  amount  of  resources  needed  in  this  area.  Following  the  Human  Genome  Project  came  the  Human  Variome  Project2,  which  aims  to  collect  information  about  genome  variations  and  their  influence in human health. Along with the latter, European Community is also sponsoring  a  bioinformatics  project  in  its  Seventh  Framework  Program:  Genotype  to  Phenotype  Databases: a Holistic Solution (GEN2PHEN)3.    The  GEN2PHEN  Project  is  a  collaborative  project  with  19  partners.  Most  of  the  partners  are  from  European  institutions  with  relevant  work  in  the  bioinformatics  scientific  area.  GEN2PHEN  is  an  ambitious  project  aiming  to  unify  human  and  model  organisms genetic variation databases allowing the creation of a central genome browser  with the ability to blend GEN2PHEN data and medical data. The overall goal is to create a  complete biomedical knowledge environment. The strategy and objectives of this project  may be divided in several research areas:  •

Analyze  the  genotype to phenotype  field and  investigate  current  needs  and  practices  in  order  to  obtain  a  complete  knowledge  about  other  ongoing  projects  with  similar  objectives.  The  active  biology  community  must  be  consulted in order develop an accurate state‐of‐the‐art document describing  the general process on the field and enabling the most correct definition of  what  this  particular  area  is  lacking  and  what  models  and  technologies  are  being effectively used.

Human Genome Project: http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

Human Variome Project: http://www.humanvariomeproject.org

GEN2PHEN: http://www.gen2phen.org

•

Develop standards for the genotype to phenotype field of research in order  to  speed  up  the  standardization  process  with  new  data  models,  nomenclature and technology standards.

•

Create

generic

database

components,

services

and

integration

infrastructures  for  the  genotype  to  phenotype  domain.  These  solutions  will  be  mostly  web  applications  applying  new  interface  usability  standards  and  customized to their end users. Solutions for genetic and genomic databases  will  be  developed.  This  particular  objective  is  aiming  to  create  a  central  GEN2PHEN  database  crossing  all  the  research  areas  and  a  simpler  application, which can be deployed by any research group.  •

Create  data  search  and  presentation  solutions  for  genotype  to  phenotype  knowledge.  Applications  designed  when  fulfilling  the  previously  mentioned  objective  won’t  be  complete  without  proper  search  mechanisms  that  must  encompass  information  distributed  throughout  different  applications  and  architecture  layers.  The  applications  must  also  have  an  effective  interface  layer designed to respect the community requests.

•

Facilitate  research  and  diagnostic  genotype  to  phenotype  databases  population  by  developing  new  tools  and  promoting  them  in  the  scientific  community.  The  newly  developed  applications  will  also  support  more  efficient  methods  for  data  insertion  allowing  anyone  to  collaborate  in  this  project.

•

Build  a  major  genotype  to  phenotype  Internet  portal,  a  GEN2PHEN  knowledge  centre.  This  portal  will  contain  all  GEN2PHEN  related  information,  ranging  from  calendars  to  databases,  from  publications  to  discussion forums.

•

Deploy  developed  solutions  to  the  community  in  order  to  increase  researchers interest and participation. Several resources will be devoted to  advertising,  explaining  and  training  researchers  in  using  the  developed  solutions.

The  project  main  focus  is  on  developing  and  promoting  a  new  generation  of  applications that will aid different types of researchers in their scientific work and, at the  same time, gather and integrate information from different sources which will be shared  to the community. GEN2PHEN applications have to be state‐of‐the‐art web applications. It  is  important  to  research  and  study  the  most  popular  Web2.0  (and  next  Web3.0)

applications in order to improve developers’ knowledge about what captivates the users,  increasing  general  biomedical  community  interest.  This  research  should  be  mainly  focused  on  user  interactions  issues  like  usability,  interfaces,  “quality  of  service”  and  overall user satisfaction. This new wave of applications has to address issues like semantic  data  integration,  user  collaboration,  information  sharing  and  search  engines’  algorithms  improvements.

Fig. 1 GEN2PHEN strategy

Developing  a  simple  Rich  Internet  Application  is,  by  now,  a  somewhat  trivial  process, not requiring great software engineering and programming knowledge. However,  bioinformatics  and  biomedicine  don’t  depend  only  on  good‐looking  interfaces.  What  matters,  and  this  is  the  difficult  part,  is  what’s  under  the  hood.  Going  deeper  in  the  application  composition,  several  issues  like  data  integration,  service  integration,  service  orchestration,  workflow  composition,  distributed  processing,  query  expansion  or  object  ontologies arise.  This report intends to give a GEN2PHEN project overview with special incidence in  these  next‐generation  web  applications  problems.  Some  solutions  with  ongoing  development  will  be  referred  as  well  as  systems  in  development  in  our  workgroup  and  how can both help assessing GEN2PHEN application design.

INTEGRATION SCENARIOS AND RELATED WORK

First of all is necessary to understand to whom these new application paradigms will  be important and why these generic GEN2PHEN goals are so significant. The biological and  biomedical  scientific  community  is  watching  an  exponential  increase  on  the  information  available.  This  growth  leads,  subsequently,  to  the  growth  of  the  number  of  applications  (web  or  desktop)  to  solve  the  same  specific  problems.  And  along  with  these  new  applications, come new data sources, new services and the heterogeneity among them is  huge.  The  main  issue  one  main  found  when  doing  scientific  research  is  where  to  find  information.  A  few  years  ago  this  was  a  problem  because  of  the  lack  of  applications  and  databases.  Now,  this  is  a  problem  because  of  the  excessive  amount  of  information  available on every corner of the web.

Fig. 2 Web2.0 integration

From  the  users  perspective,  we  believe  they  are  looking  for  a  central,  unifying  portal, customized to their personal status, where they can easily find all the information  they  need.  This  is  the  added  value  GEN2PHEN  solutions  may  have.  Currently,  there  are

innumerous  ongoing  works  focusing  on  this  problem.  However,  there  isn’t  a  universal  solution to solve all the heterogeneity problems arose by data and service integration. And  the problems don’t boil down to this; there are also the novel functionalities possible with  the semantic web [1] and the grand developments made in information mining. Following  Goble and Stevens [2] work, one can conclude that not all is well in the kingdom of data  integration in bioinformatics and that data integration has a long path to run in order to  completely satisfy the initials goals.   The group of applications that should be studied may be divided in three main areas  that are largely connected and potentiate integration. There are developments in semantic  web  and  its  application  in  biology  and  how  the  bridge  between  generic  ontologies  and  biological  ones  can  be  made.  Other  groups  are  working  in  collaboration  tools  for  the  community,  which  have  better  information  sharing  and  productivity  tools.  The  largest  group  is  the  integration  one.  In  this  group  one  can  encompass  data  integration,  service  integration, service orchestration, workflow composition and mashup applications.    SEMANTIC WEB  Semantic  web  developments  have  the  main  purpose  of  describing,  with  a  pre‐ defined  ontology,  all  the  information  existent  in  the  web.  Semantic  web  key  components  are  RDF4,  OWL5 and  SPARQL6.  RDF  stands  for  Resource  Description  Framework  and  is  a  generic metadata model for online information and content description. OWL is the Web  Ontology Language, which is the ontology‐authoring tool usually associated with the RDF  schema.  SPARQL  is  a  recursive  acronym  for  SPARQL  Protocol  and  RDF  Query  Language  and  is  a  query  language,  based  on  SQL,  to  obtain  information  stored  in  the  RDF  format.  Implementing  semantic  web  architectures  is  not  a  trivial  task  [3]  for  any  kind  of  data.  However,  it  is  important  to  introduce  these  metadata  structures  and  algorithms  in  bioinformatics, as they will become part of Web3.0.  Applying semantic web concepts and technologies in bioinformatics one can access,  in  a  unified  manner,  several  biological  documents  described  with  RDF.  Automation  of  processes  and  improved  machine‐machine  data  exchange  are  also  enabled  with  the  application of these concepts. Belleau et al. propose Bio2RDF [4], a preliminary approach                                                                4

Resource Description Framework: http://www.w3.org/RDF

Web Ontology Language: http://www.w3.org/2004/OWL

SPARQL Query Language for RDF: http://www.w3.org/TR/rdf-sparql-query

to  create  an  engine  which  provides  RDF  access  to  biological  data  distributed  through  several  databases  such  as  KEGG  or  NCBI.  Bio2RDF7 makes  all  the  data  available  in  their  website using only the URL to locate the resources.   Splendiani [5] also as a proposal to  bring the semantic web to biology, but the implementation isn’t as advanced as Bio2RDF.  These  are  the  most  recent  implementations  but  biology  and  medicine  are  very  difficult  scientific areas due to the complexity in defining a proper ontology that covers all the life  sciences concepts and terms.    SOCIAL ENVIRONMENTS  Social  networks  and  collaboration  environments  are  some  of  the  most  popular  Web2.0 applications. These applications connect users and allow them to share personal  information,  music,  videos  or  any  other  type  of  data.  Additionally,  several  small  applications  are  developed  to  integrate  information  about  different  users  or  entertainment  areas.  For  instance,  a  movies  application  would  allow  every  user  to  describe his personal movie tastes; when used in a large scale environment, it would give  the  developers  important  information  about  cinema  which  could  be  used  to  improve  advertisements shown to the user: a user who likes horror movies would have a greater  probability of seeing horror movie ads than one who likes comedies. Facebook8 is one of  the largest worldwide used social web applications with over 120 million users. Using the  personal  connections,  personal  preferences  and  other  specific  applications,  Facebook  owners  have  valuable  market  information.  Like  Facebook,  MySpace9 or  Google’s  Orkut10  provide  almost  the  same  functionalities  to  users.  Experiencing  a  sustained  growth  is  Carole  Anne  et  al.  [6]  myExperiment11 which  is  the  first  bioinformatics  social  network  application  where  one  can  connect  with  others,  share  files  (with  focus  on  Taverna  workflows, detailed more ahead in this report) and create scientific communities. Despite  the  focus  on  Taverna,  myExperiment  provides  a  rich  scientific  ecosystem  offering  the  community  a  wide  range  of  tools  essential  in  any  social  collaborative  environment.  myExperiment  also  offers  access  to  its  services  using  RESTful  programming  interfaces,                                                                7

Bio2RDF: http://www.bio2rdf.org

Facebook: http://www.facebook.com

MySpace: http://www.myspace.com

Orkut: http://www.orkut.com

myExperiment: http://www.myexperiment.org

thus, it is possible to build new applications on the framework or use myExperiment data  and tools to improve existing ones.    INTEGRATION  Integration in bioinformatics is one of the areas where more groups are interested  and with more ongoing work. Integration is a research area which includes the mentioned  semantic  web  and  social  networking  tools  besides  other  fields  such  as  mashups  or  workflows. A workflow is a simple sequence of logic steps or activities that are executed  independently  from  each  other  [7].  Applying  this  generic  concept  to  bioinformatics,  one  may assume that a workflow is an organized information flow, connecting distinct services  and/or  data  sources  in  order  to  solve  a  problem  in  a  modularly  manner.  The  most  used  solution  for  workflow  building  and  execution  is  Taverna  [8,  9].  Taverna  is  a  Java  based  desktop application offering a simple interface for workflow composition and execution. It  can access several types of services such as BioMoby [10] or generic WSDL web services.  The  major  setback  is  that  to  integrate  services,  one  must  define  an  integration  XML  component to assist information piping from service A output to service B input. Taverna  can  also  be  used  from  within  other  applications,  allowing  access  to  the  results  of  previously  saved  workflows  or  executing  workflows  in  real  time.  One  of  myExperiment  functionalities is workflow sharing, one may access a large workflow storage system and  find  solutions  developed  by  others  or  share  one’s  workflow  and  important  development  information. Currently, Taverna’s greatest flaw is being desktop based as we’re assisting a  shift  in  the  computational  paradigm:  web  applications  usage  dominating  over  desktop  ones.   Alongside with workflows there are mashups. Mashups begun in the music industry:  they were simple mixes of several songs into a single song. With Web2.0, this idea crossed  to  web  applications.  Mashups  are  web  applications  which  combine  information  from  a  predefined  collection  of  data  sources  or  services  in  a  single  interface.  We  can  consider  a  mashup  as  being  a  meta  application:  it  basically  creates  a  new  application  by  using  functionalities provided by other applications. Online, there are several workflow/mashup  building  frameworks.  It  is  important  to  mention  Yahoo!  Pipes12 and  Microsoft  Popfly13  because they have remarkable interfaces and pre‐built components to access World Wide                                                                12

Yahoo! Pipes: http://pipes.yahoo.com/pipes

Microsoft Popfly: http://www.popfly.com

Web  most  popular  websites.  Bioinformaticians  can  use  these  tools  with  data  from  different  data  sources  to  develop  new  applications.  Cheung  et  al.  [11]  pursued  this  approach  to  create  a  biomedical  mashup  application.  Despite  this,  the  mentioned  tools  weren’t  specifically  designed  to  be  used  in  the  life  sciences  area.  Therefore,  several  researchers are working on service integration frameworks: de Knikker et al. [12] have a  basic  web  service  choreography  scenario;  Bio‐jETI  from  Margaria  et  al.  [13]  is  a  similar  solution, using the same principles as de Knikker. These tools share a common problem in  integration:  the  information  sources  heterogeneity  doesn’t  allow  a  fully  automated  integration  solution.  Each  service  stores  and  offers  the  data  in  its  own  model,  increasing  the difficulty in concept mapping and information exchange. There isn’t yet an automated  tool which offers a simple integration interface, allowing the use of components from any  random service. BioMoby [10] is an initiative to create an ontology and central repository  of  bioinformatic  resources.  With  this  semantic  framework,  one  can  share  or  use  online  services  created  by  others  in  an  almost  automated  fashion  [14].  BioMoby 14  central  repository  faces  typical  resource  discovery  problems  such  as  validation  or  duplication.  Anyone can add services and the description provided or service functionality may not be  scientifically  valid  and  induce  errors  to  users.  Duplication  of  services  is  also  a  problem:  there  can  be  any  number  of  services  doing  the  same  task,  thus  it  is  difficult  to  choose  which ones fits better in the desired requirements.

Fig. 3 – Existing developments categories

SUMMARY  Fully automated and dynamic integration is the panacea that developers haven’t yet  reached.  Workflow  or  mashup  solutions  are  the  most  popular  to  integrate  services  and                                                                14

BioMoby: http://www.biomoby.org

data sources. However, both of them imply hard coding several functionalities, increasing  dependency on developers to add new functionalities. Applying a semantic web approach  to  bioinformatics  will  empower  developers  to  create  more  independent  applications.  Describing  services  and  information  semantically  will  allow  automated  communication  between  heterogeneous  applications.  This  will  enhance  existing  workflow  and  mashup  applications:  it  will  be  easier  for  users  to  add  new  services  to  existing  applications,  becoming developers of new meta applications adjusted to their needs.

OUR ONGOING DEVELOPMENTS

Our  bioinformatics  group  is,  like  others,  developing  software  solutions  to  solve  problems  associated  with  this  specific  area.  The  developed  work  didn’t  focus  on  integration  or  semantic  web.  Our  work  was  mostly  focused  on  aiding  microarray  laboratory  research.  ANACONDA  [15]  is  a  tool  to  study  gene  primary  structure.  The  Microarray  Information  Database  –  MIND  [16]    ‐  is  a  web  application  which  helps  researchers  in  the  task  of  analyzing  microarray  experiment  results.  More  abstract  than  MIND is GeneBrowser [17], a tool for gene expression studies from microarray gene lists  results.  However,  the  web  trends  and  the  association  with  projects  like  GEN2PHEN  or  ALERT15 brought the necessity to expand our group’s application range. DynamicFlow [18,  19]  is  a  web‐based  workflow  management  application,  providing  Web2.0  semi‐ autonomous  service  integration.  DiseaseCard  [20]  is  an  older  application,  however  it  already implements basic collaboration and integration functionalities which later became  famous with Web2.0. Further developments are being studied to implement semantic web  engines, mashup applications and novel information visualization techniques.    DYNAMICFLOW  DynamicFlow is a framework for dynamic integration of heterogeneous information  sources.  The main goal when developing this framework was to create a novel and agile  interface for service integration. The application should have a usable, easy and intuitive  interface for solving problems using a “divide and conquer” strategy: the main problem is  divided in smaller tasks that can be solved with a certain web service; the tasks are then  combined, using the workflow metaphor, creating an information flow from task to task,  until  we  get  the  final  solution.  This  modular  approach  could  be  useful  for  researchers  because  it  is  more  similar  to  the  plan  they  have  when  solving  problems  in  the  wet  lab:  structuring  the  problem  and  then  solving  it  iteratively,  using  simple  tasks  in  a  web  application running in their browser.                                                                 15

ALERT Project: http://www.alert-project.org

Fig. 4 DynamicFlow framework model

One  of  DynamicFlow’s  key  elements  is  its  innovative  model.  The  three‐layered  model  ‐  Fig.  4  –  divides  the  application  in  access:  the  bottom  layer,  containing  the  databases and the external services; design, the top layer where the user interactions like  workflow  building  occur,  using  AJAX  technology  and  drag‐‘n‐drop  metaphors;  core,  the  processing  layer  which  encompasses  server‐side  processing  on  the  application’s  web  server  and  client‐side  processing  in  the  client’s  browser.  This  is  one  of  the  framework’s  main features, the division of the processing layer in two separate components. The web  server  processes  client  requests  and  connects  to  the  authentication  server  and  the  framework’s DBMS but service–application communication and data piping between tasks  are  client‐side  processed,  reducing  server  charger  and  speeding  up  the  application  execution with an increase in efficiency and response time. This semi‐autonomous process  of maintaining a valid information flow from one service to the next is possible due to the  service  definition  standard  that  was  previously  defined.  The  standard  follows  a  simple  ontology  and  provides  an  easy  way  for  editing  the  available  services.  Using  it,  the  application  can  validate  workflow  consistency,  execute  the  workflow  and  display  intermediate results all using the browser’s resources. It’s a primitive version of semantics  in an information integration application.  The  work  conducted  resulted  in  a  web  application  prototype  available  for  testing  and  open  to  new  developments.  These  new  developments  will  be  on  five  main  topics:  perfecting the service definition standard, inclusion of semantic web technologies (RDF),  interface improvements, new user interaction and widening the service range.

DISEASECARD   DiseaseCard 16 project  has  begun  in  2003  with  the  objective  of  creating  a  rare  disease  link  aggregator,  integrating  information  from  distributed  and  heterogeneous  medical  and  genomic  databases.  The  links  were  gathered  by  a  web  crawling  engine  and  grouped into nodes representing concepts ‐ Fig. 5. For instance, for the Peters anomaly17  disease,  the  node  References  contains  all  the  reference  sections  of  the  NCBI  OMIM18  database  that  refer  to  this  disease  and  the  node  Pathology  contains  Orphanet 19  information about this disease. Along with the external information, each disease also has  a  forum  entry,  where  any  registered  user  can  share  his  personal  experience.  A  tree  –  similar  to  Windows  Explorer  one  –  shows  all  the  nodes  and  their  collection  of  links,  displaying, in a unified interface, information from the genotype to the phenotype. As we  want to gather as much information as possible, rare diseases are the main target due to  their high association between genotype and phenotype. It is important to mention that no  database  information  is  replicated:  DiseaseCard  only  saves  link  information  of  shared  data. Modern concepts like integration – heterogeneous link gathering – and collaboration  – public disease forums – where already considered when developing the system.

DiseaseCard: http://www.diseasecard.org

Peters Anomaly disease card: http://diseasecard.org/evaluateCard.do?diseaseid=604229

OMIM Home: http://www.ncbi.nlm.nih.gov/omim

Orphanet: http://www.orpha.net/consor/cgi-bin/index.php

Fig. 5 DiseaseCard concept map

As  the  application  got  older,  it  lost  quality:  the  web  crawling  engine  doesn’t  automatically adapt to link changes and so, for several concepts, the resulting nodes were  empty.  In  a  preliminary  analysis  of  GEN2PHEN  goals  and  how  they  can  be  achieved,  we  concluded  that  DiseaseCard  was  the  most  adequate  solution  and  should  be  under  development  again.  After  a  careful  analysis  and  the  definition  of  an  action  plan,  its  operability  was  restored,  the  crawler  was  corrected,  the  interface  got  a  new  look  and  DiseaseCard is back on track.  As far as GEN2PHEN is concerned, DiseaseCard will be a simple way to achieve some  of the initially proposed goals. In the future, adding GEN2PHEN related databases and web  portals is a priority to complete the application. The inclusion of semantics in DiseaseCard  and  in  the  portals  it  crawls  will  ease  the  crawling  process  and  improve  the  obtained  results  precision.  Information  miming  features  are  also  being  researched:  even  if  it  only  stores links, DiseaseCard contains valuable information in those links which can be useful  in new types of queries.    SUMMARY  Both  DynamicFlow  and  DiseaseCard  are  ongoing  projects  that  will  be  developed  within the GEN2PHEN perspective. The next section details new functionalities, interfaces  and user interactions that can be implemented in either of these applications in order to  improve their quality.

FUTURE PERSPECTIVES

Web2.0  changed  Internet  forever.  Developers  don’t  just  care  about  what  the  application does anymore but also what the users want it to do. Users are now the most  important part of the Internet. They produce content, they have their own web footprint,  and they are part of a new online community. If Web2.0 is the social web, Web3.0 may be  the  intelligent  web.  Despite  being  science  fiction,  Web3.0  is  nearer  one  may  think.  Different platforms can communicate with each other automatically; “cloud‐computing” is  taking  over  the  web;  web  is  getting  intelligent  with  new  semantics;  distributed  applications are being integrated. These facts, which were mere dreams a few years ago,  are  empowering  the  Internet  with  new  solutions  and  establishing  it  as  the  platform  for  everything: productivity, entertainment, research, leisure…    CLOUD‐COMPUTING  New computing paradigms are changing the Internet at the architecture level. GRID  [21]  architectures  are  the  new  solution  for  distributed  computing.  Virtualization  improvements  [22]  make  virtual  machines  almost  as  powerful  as  real  ones.  “Cloud‐ computing”  [23]  uses  the  best  of  both  to  offer  an  online  development  environment.  Microsoft with the Azure Services Platform20, Amazon with the Elastic Compute Cloud21 or  Google  with  its  App  Engine22 offer  access  to  virtual  machines  where  anyone  can  deploy  applications  which  will  use  distributed  resources  to  guarantee  real‐time  scalability,  flexibility and availability.  Following  the  same  paradigm  trend,  new  web  applications  and  web  applications  suites  are  replacing  traditional  desktop  apps.  For  instance,  Microsoft’s  Live23 suite  offers  almost all the Office suite tools online and Google24 also has the essential productivity tools  online, in the “cloud”.                                                                20

Azure Services Platform: http://www.microsoft.com/azure/default.mspx

Amazon Elastic Compute Cloud: http://aws.amazon.com/ec2

Google App Engine: http://code.google.com/appengine

Microsoft Live: http://www.live.com

Google Apps: http://www.google.com/apps

INFORMATION INTEGRATION  Considering  information  integration  tools  one  can  explore  mashups  and  web  desktops. Popular mashup applications are personal and customizable web portals, made  with  gadgets  that  access  almost  any  web  application.  Netvibes25 is  definitely  the  most  complete personal portal in the Web. However, the most famous is Google’s iGoogle26. Both  offer,  in  a  simple  interface,  the  ability  to  customize  a  page  with  any  gadgets  we  want.  Available gadgets include e‐mail access, calendars, to‐do lists, newsreaders and almost any  interesting tool to include in a single portal.

Fig. 6 iGoogle gadget interface stub

Web  desktops  are  web  applications  that  simulate  the  traditional  desktop  environment: there’s wallpaper, icons to access applications, trash bin, task bar and menus  for applications. eyeOS27 is a cloud computing operating system allowing any user to work  online  in  a  vast  set  of  applications.  Besides  this,  it  is  also  an  open  source  development  platform: users can create their applications and install them on their web desktop.

Netvibes: http://www.netvibes.com

iGoogle: http://www.google.com/ig

eyeOS: http://eyeos.org

DATA VISUALIZATION  Other  interesting  area  is  data  visualization.  Traditionally,  search  results  are  listed  with a simple description. However, new search engines like Viewzi28 or Searchme29 offer  results in different interfaces. The results are presented in a much more visually appealing  interface.  Screenshots  are  taken  from  the  pages  and  show  in  grids  or  lists.  Results  are  ordered  by  date  to  form  a  chronological  sequence.  Information  is  gathered  from  distinct  search engines in order to better rank the results. Context relations are established among  results to create a visual relational tree. The distinct visualizations of the same results are  important as they can offer distinctive insights on the same data. Aiming an improved user  interaction and greater usage satisfaction, these tools rely on AJAX, Flash or Silverlight to  create captivating and usable interfaces.

Fig. 7 Viewzi result grid for gen2phen search

Viewzi: http://www.viewzi.com

Searchme: http://www.searchme.com

SUMMARY  All  the  presented  applications  and  interfaces  are  new  solutions  that  are  being  considered in several thematic fields. They represent the first step to the next generation  of web applications and open the door to a new level of user interaction.  This new wave of web applications will have repercussions on bioinformatics. New  applications  like  iBioinformatics  and  BioDesktop  or  new  result  visualization  tools  could  leave their mark in the bioinformatics world.  From  the  iGoogle  and  Netvibes  example  one  could  develop  a  similar  portal,  integrating  gadgets  and  applications  in  a  single  interface.  iBioinformatics  or  BioVibes  would  represent  a  leap  forward  in  integration  and  personalization.  If  one  could  create  a  large  range  of  services  in  the  gadget  repository,  any  research  could  customize  the  application according to his needs, thus, creating his own personal meta application.  BioDesktop  or  BiOS  could  be  an  EyeOS  based  bioinformatics  and  biomedical  web  desktop. Following the desktop metaphor, one could create a web desktop implementation  containing applications and tools useful for researchers. Any user could then have his own  personal desktop online, customized according to his own needs and taste.  Integration plays a large role in the future of bioinformatics, but data visualization is  also important. Web screenshots are useful to show a preview of the page we’re searching.  This idea could be applied to bioinformatics search results, showing pathway previews or  protein structure previews. Arranging the results in grids or lists and using technologies  like AJAX, Flash or Silverlight to create new interfaces one could develop interesting and  useful applications.

CONCLUSION   Bioinformatics  applications  are  evolving.  Evolution  isn’t  a  simple  process  and  choosing  the  right  path  isn’t  a  trivial  task.  This  evolution  process  is  usually  sustained  by  large projects like the Human Genome Project a few years ago or the European GEN2PHEN  project now.  As  bioinformatics  is  evolving,  so  are  other  software  applications.  The  trend  is  to  move  the  software  to  the  web  and  to  make  it  available,  freely,  to  the  entire  world.  This  process may be complex, but in the end, the positive aspects rule over the tradeoffs that  have to be made.  For bioinformatics, continuing this ride along with state‐of‐the‐art web technologies  is a tremendous task. The life sciences area is definitely one of the areas where the amount  of  data  is  larger,  and  where  the  differences  between  applications  and  services  are  more  noticeable.  This  leads  to  an  enormous  complexity  in  integration  heterogeneous  information sources.  Despite  these  facts,  several  groups  are  working  to  solve  integration  problems  and  they  have  several  approaches.  Semantic  web  concepts  for  better  machine‐machine  exchanges  or  “proprietary”  integration  frameworks  using  hard‐coded  concept  mapping  are solutions currently under development. However, there isn’t any heavenly solution for  these  problems.  Fully  automatic  and  dynamic  information  integration  hasn’t  yet  been  achieved and is still science fiction.  Hopefully, using the presented perspectives and using more concepts from success  cases in other areas like entertainment or CRM, will enhance current bioinformatics web  applications and empower developers with tools to design new ones.

REFERENCES   1.

Berners‐Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Sci Am 284 (2001) 34 ‐

43  2.

Goble,  C.,  Stevens,  R.:  State  of  the  nation  in  data  integration  for  bioinformatics.

Journal of Biomedical Informatics 41 (2008) 687‐693  3.

Fielding, R.: Semantic Web Services Challenge: Architectural Styles and the Design

of Network‐based Software Architectures. Semantic Web Services Challenge: Challenge on  Automating  Web  Services  Mediation,  Choreography  and  Discovery:  2006;  Stanford  University, USA (2000)   4.

Belleau, F., Nolin, M.‐A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: Towards a

mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics 41 (2008) 706‐716  5.

Splendiani,  A.:  RDFScape:  Semantic  Web  meets  Systems  Biology.  BMC

Bioinformatics 9 (2008) S6  6.

Carole  Anne,  G.,  David  Charles  De,  R.:  myExperiment:  social  networking  for

workflow‐using e‐scientists. Proceedings of the 2nd workshop on Workflows in support of  large‐scale science. ACM, Monterey, California, USA (2007)  7.

Cardoso,  J.,  Sheth,  A.:  Semantic  E‐Workﬂow  Composition.  Journal  of  Intelligent

Information Systems (2003)   8.

Ludascher,  B.,  Altintas,  I.,  Berkley,  C.,  Higgings,  D.,  Jaeger,  E.,  Jones,  M.,  Lee,  E.A.,

Tao,  J.,  Zhao,  Y.:  Taverna:  Scientific  Workflow  Management  and  the  Kepler  System.  Research Articles, Concurrency and Computation: Practice & Experience 18 (2006) 1039 ‐  1065  9.

Oinn,  T.,  Addis,  M.,  Ferris,  J.,  Marvin,  D.,  Senger,  M.,  Greenwood,  M.,  Carver,  T.,

Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment  of bioinformatics workflows. Bioinformatics 20 (2004) 3045 ‐ 3054  10.

Wilkinson,  M.,  Links,  M.:  BioMoby:  An  open  source  biological  web  services

proposal. Brief Bioinform 3 (2002) 331 ‐ 341  11.

Cheung,  K.‐H.,  Yip,  K.Y.,  Townsend,  J.P.,  Scotch,  M.:  HCLS  2.0/3.0:  Health  care  and

life sciences data mashup using Web 2.0/3.0. Journal of Biomedical Informatics 41 (2008)  694‐705

12.

de  Knikker,  R.,  Guo,  Y.,  Li,  J.‐l.,  Kwan,  A.,  Yip,  K.,  Cheung,  D.,  Cheung,  K.‐H.:  A  web

services  choreography  scenario  for  interoperating  bioinformatics  applications.  BMC  Bioinformatics 5 (2004) 25  13.

Margaria,  T.,  Kubczak,  C.,  Steffen,  B.:  Bio‐jETI:  a  service  integration,  design,  and

provisioning  platform  for  orchestrated  bioinformatics  processes.  BMC  Bioinformatics  9 (2008) S12  14.

DiBernardo,  M.,  Pottinger,  R.,  Wilkinson,  M.:  Semi‐automatic  web  service

composition for the life sciences using the BioMoby semantic web framework. Journal of  Biomedical Informatics 41 (2008) 837‐847  15.

Pinheiro,  M.,  Afreixo,  V.,  Moura,  G.,  Freitas,  A.,  Santos,  M.A.S.,  Oliveira,  J.L.:

Statistical,  computational  and  visualization  methodologies  to  unveil  gene  primary  structure features. Vol. vol. 45, n.¬∫ 2 (2006) p. 163 ‐ 168  16.

using MAGE‐ML. MGED 9: The meeting of the Microarray Gene Expression Data Society  17.

Arrais,  J.,  Santos,  B.,  Fernandes,  J.,  Carreto,  L.,  Santos,  M.,  A.  S.,  Oliveira,  J.L.:

GeneBrowser:  an  approach  for  integration  and  functional  classification  of  genomic  data.  Vol. vol. 4, n.º 3 (2007)  18.

Lopes,  P.:  Service  Integration  for  Knowledge  Extraction.  Electronics,

Telecommunications  and  Informatics  Department,  Vol.  Master  of  Science.  University  of  Aveiro, Aveiro (2008)  19.

Lopes,  P.,  Arrais,  J.,  Oliveira,  J.L.:  Dynamic  Service  Integration  using  Web‐based

Workflows.  In:  Society,  A.C.  (ed.):  10th  International  Conference  on  Information  Integration  and  Web  Applications  &  Services.  Association  for  Computer  Machinery,  Linz,  Austria (2008) 622‐625  20.

Oliveira, J.L., Dias, G.M.S., Oliveira, I.F.C., Rocha, P.D.N.S.d., Hermosilla , I., Vicente, J.,

Spiteri,  I.,  Martin‐Sánchez,  F.,  Pereira  ,  A.M.M.d.S.:  DISEASECARD:  A  Web‐based  Tool  for  the  Collaborative  Integration  of  Genetic  and  Medical  Information.  5th  International  Symposium, ISBMDA 2004: Biological and Medical Data Analysis (2004) 409‐417  21.

Nadeem,  F.,  Yousaf,  M.M.,  Ali,  M.:  Grid  Performance  Prediction:  Requirements,

Framework, and Models. Emerging Technologies, 2006. ICET '06. International Conference  on (2006) 695‐702  22.

Chen,  W.,  Lu,  H.,  Shen,  L.,  Wang,  Z.,  Xiao,  N.,  Chen,  D.:  A  Novel  Hardware  Assisted

Full  Virtualization  Technique.  Young  Computer  Scientists,  2008.  ICYCS  2008.  The  9th  International Conference for (2008) 1292‐1297

23.

Vouk, M.A.: Cloud computing ‐ Issues, research and implementations. Information

Technology Interfaces, 2008. ITI 2008. 30th International Conference on (2008) 31‐40