Mar 2016 VOL 2 ISSUE 3
“Science is a beautiful gift to humanity; we should not distort it.” -
APJ Abdul Kalam
Predictive Metagenomics Profiling: Why, what, and how?
WebFEATURE: Tool to identify and visualize Functional sites in Macromolecules
Public Service Ad sponsored by IQLBioinformatics
Contents
March 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
03 Tools
06
WebFEATURE: Tool to identify and visualize Functional sites in Macromolecules 07
04 Cloud Computing Big Data in Bioinformatics
09
05 Sequence Analysis A practical guide to selection analyses of coding sequence datasets and its intricacies 12
5
Genomics
What is PRSice? 14 Predictive metagenomics profiling: why, what, and how? 16
07 Meta-Analysis How to develop a search strategy for metaanalysis and systematic reviews? 19
Contents
March 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics
08 Systems Biology Network Biology: Get-together of Macromolecules
21
09 Bioinformatics & Legal Aspects Biological Databases & Intellectual Property Rights – An Introduction 23
EDITOR Dr. PRASHANT PANT FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published quarterly for one year (4 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
EDITORIAL: Welcoming BiR in its 2nd year
EDITORIAL
Bioinformatics, being one of the best field in terms of future prospect, lacks one thing - a news source. For there are a lot of journals publishing a large number of quality research on a variety of topics such as genome analysis, algorithms, sequence analysis etc., they merely get any notice in the popular press.
Dr. Prashant Pant
Editor
One reason behind this, rather disturbing trend, is that there are very few people who can successfully read a research paper and make a news out of it. Plus, the bioinformatics community has not been yet introduced to research reporting. These factors are common to every relatively new (and rising) discipline such as bioinformatics. Although there are a number of science reporting websites and portals, very few accept entries from their audience, which is expected to have expertise in some or the other field. Bioinformatics Review has been conceptualized to address all these concerns. We will provide an insight into the bioinformatics - as an industry and as a research discipline. We will post new developments in bioinformatics, latest research. We will also accept entries from our audience and if possible, we will also award them. To create an ecosystem of bioinformatics research reporting, we will engage all kind of people involved in bioinformatics - Students, professors, instructors and industries. We will also provide a free job listing service for anyone who can benefit out of it.
Letters and responses: info@bioinformaticsreview.com
TOOLS
WebFEATURE: Tool to identify and visualize Functional Sites in Macromolecules Image Credit: Google Images
“WebFEATURE is a web-based analysis tool for the identification of macro molecules. The users can easily identify the functional sites in the query structures.�
he identification and assignment of functions of unknown macro molecules has been observed to be faster and reliable than the sequencebased methods. This may be due to the structure-based methods which can identify molecules beyond their residues with the help of 3D space.
T
bFEATURE is a web-based analysis tool for the identification of macro molecules. The users can easily identify the functional sites in the query structures. It scans the query structures beyond the residue identity and also consider the biophysical and biochemical properties of the functional sites in 3D space.
Fig.1 Web interface of WebFEATURE.
The FEATURE system uses a supervised learning algorithm to find
the conserved properties from the similar structures and then builds statistical models. These models represent the statistical distribution of physico-chemical properties of the functional sites at distances from the site of interest. It better explains the chemical patterns behind the residues. The supervised learning algorithm automatically discover the physico-chemical properties of the macro molecules.
Fig 2. The output of a WebFEATURE scan for an ATP binding site in Casein Kinase-1 (PDB ID: 1csn) shows the hits, above cutoff, superimposed on the structure and crystallographically bound ATP. Hit score statistics are plotted in a histogram to the right of the Chime viewer. By entering a new cutoff in the Cutoff text field, or by clicking on the histogram, the user can change the displayed hits by score.
Bioinformatics Review | 7
Buttons are provided to change the representation of the molecule and hits. Details on the statistical model are also provided. The user can also specify the sites and the non-sites in the query structure. Sites are the locations for functional or structural roles and non-sites are those where that function does not occur. The training algorithm generates an output model which differentiates the sites from the nonsites. This generated model is then used as a part of input to the scanning algorithm of FEATURE. The scanning algorithm then analyze the grid points over a query structure for similar sites with in a significant cut-off. The logodd scoring function of the physicochemical properties around each grid point is calculated. This score provides a probability (likelihood) that a grid point is a site of interest. The higher the score, more likely the point is of interest.
An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.
WebFEATURE is a user-friendly web based tools which also provides offline analysis of the outputs. The result can be downloaded and can be visualized in Chimera, PyMol, etc. The WebFEATURE provides the results in real-time manner. For further details click here Note:
Bioinformatics Review | 8
CLOUD COMPUTING
Big Data in Bioinformatics Image Credit: Google Images
“Big data describes a large volume of data, in bioinformatics and computational biology, it represents a new paradigm that transforms the studies to a large-scale research.� ith the ever-increasing amount of biological data being generated with the advanced tools and techniques, a number of suitable ways have been simultaneously developed to handle this vast amount of data in order to make it presentable, accessible and arranged in a logical order to increase workability with the data. Due to the nature of the data being voluminous, Big Data management methods have shown their capabilities to make the biological data effectively managed both in terms of accessibility as well as cost.
W
The high-throughput experiments in bioinformatics, and increasing trends of developing personalized medicines, etc., increasing a need to produce, store, and analyze these massive datasets in a manageable time. The role of big data in bioinformatics is to provide repositories of data, better
computing facilities, and data manipulation tools to analyze data.
In the field of bioinformatics, the big data technologies/tools have been categorized into four:
The sequencing data obtained has a need to be mapped to specific reference genomes for further analysis. For this purpose, CloudBurst, a parallel computing model is used [4]. Parallel computing model facilitates the genome mapping by parallelizing the short-read mapping process to improve the scalability of large sequencing data. They have also developed some new tools such as Contrail for assembling large genomes and Crossbow for identifying SNPs from sequence datasets. Similarly, various tools such as DistMap (a toolkit for distributed short read mapping on a Hadoop cluster) [5], SeqWare (to access large-scale whole genome datasets) [6], Read Annotation pipeline (developed by DDBJ, cloud-based pipeline to analyze NGS data) [7], and Hydra (for processing large peptide and spectra databases) [8] have been developed.
1. Data storage and retrieval:
2. Error Identification:
Parallel Computing is one of the fundamental infrastructures that manage big data tasks [1]. It allows executing algorithms simultaneously on a cluster of machines or supercomputers. Recently, Google has proposed the MapReduce novel parallel computing model as the new big data infrastructure [2]. Similarly, Hadoop which is an open-source MapReduce package was introduced by Apache for distributed data management and is successfully applied in the field of bioinformatics [3]. Hadoop also provides the cloud computing facilities for centralized data storage and provides remote access to them.
Bioinformatics Review | 9
It is necessary to identify errors in the sequence datasets, so many of the cloud-based software packages have been developed to achieve this purpose. For example, SAMQA [9] which identifies errors and ensures that large-scale genomic data meet the minimum quality standards, ART [10] which simulates data for three major sequencing platforms,viz., Sequencing, Illumina and SOLiD, and CloudRS [11].
integrate big data technologies into user-friendly operations. For achieving this purpose, some of the software packages have been introduced. SeqPig reduces the technological skills required to use MapReduce by reading large formatted files to feed analysis applications [15], CloVR is a sequencing analysis package distributed through a virtual machine [16], CloudBioLinux [17], and so on.
3. Data Analysis:
For further details click here.
This feature of big data allows the researchers to analyze the data obtained by performing experiments. For example, GATK (Genome Analysis Toolkit) is a MapReduce-based programming framework which is used for large-scale DNA sequence analysis [12]. It supports many data formats (SAM, BAM, and many others), ArrayExpress Archive of Functional Genomics data repository is an international collaboration for integrating highthroughput genomics data [13], BlueSNP is used to analyze the genome-wide association studies [14], and much more.
References
4. Platform Integration Deployment:
7.
Since everyone does not have a good grasp of the computing and networking knowledge, therefore, novel methods are needed to
1.
2.
3.
4.
5.
6.
Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: A literature review. Biomedical informatics insights, 8, 1. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(12), S1. Schatz, M. C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), 1363-1369. Pandey, R. V., & Schlötterer, C. (2013). DistMap: a toolkit for distributed short read mapping on a Hadoop cluster. PLoS One, 8(8), e72614. D O’Connor, B., Merriman, B., & Nelson, S. F. (2010). SeqWare Query Engine: storing and searching sequence data in the cloud. BMC bioinformatics, 11(12), S2. Nagasaki, H., Mochizuki, T., Kodama, Y., Saruhashi, S., Morizaki, S., Sugawara, H., ... & Kaminuma, E. (2013). DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. DNA research, dst017.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Wulf, W. A., Levin, R., & Harbison, S. P. (1981). HYDRA/C. mmp, an experimental computer system. McGraw-Hill Companies. Robinson, T., Killcoyne, S., Bressler, R., & Boyle, J. (2011). SAMQA: error classification and validation of high-throughput sequenced read data. BMC genomics, 12(1), 419. Huang, W., Li, L., Myers, J. R., & Marth, G. T. (2012). ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593-594. Chen, C. C., Chang, Y. J., Chung, W. C., Lee, D. T., & Ho, J. M. (2013, October). CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework. In Big Data, 2013 IEEE International Conference on (pp. 717-722). IEEE. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome research, 20(9), 1297-1303. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., ... & Oezcimen, A. (2003). ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic acids research, 31(1), 68-71. Huang, H., Tata, S., & Prill, R. J. (2013). BlueSNP: R package for highly scalable genome-wide association studies using Hadoop clusters. Bioinformatics, 29(1), 135136. Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., & Heljanko, K. (2014). SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics, 30(1), 119-120. Angiuoli, S. V., Matalka, M., Gussman, A., Galens, K., Vangala, M., Riley, D. R., ... & Fricke, W. F. (2011). CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC bioinformatics, 12(1), 356. Krampis, K., Booth, T., Chapman, B., Tiwari, B., Bicak, M., Field, D., & Nelson, K. E. (2012). Cloud BioLinux: pre-configured and ondemand bioinformatics computing for the genomics community. BMC bioinformatics, 13(1), 42.
Bioinformatics Review | 10
SEQUENCE ANALYSIS
A practical guide to selection analyses of coding sequence datasets and its intricacies Image Credit: Google Images
“This is the second article under the popular series “Do you HYPHY…”, recently published online in BiR. In this issue, I would like to take the substitution/selection analyses and its intricacies to the next level.” s previously mentioned, HyPhy offer a collection of programs such as GARD (Genetic Algorithm for Recombination Detection), SLAC (Single Likelihood Ancestor Counting), REL (Random Effects Likelihood), and FEL (Fixed Effects Likelihood). All these packages contribute to a holistic selection analysis depicting whether there are signatures of positive selection and if yes, what are the sites under positive selection. To start with, one needs a coding sequence dataset and it is important to point out here that such an intron-free coding sequence dataset is a minimum basic condition to work in these packages failing which the program will simply not work. Due to the aforesaid reason, if you are working with non-coding sequences such as rRNA, repetitive DNA sequences, or with molecular marker based sequences (Sequence Tagged Sites, STS) which contain
A
composites of non-coding and coding regions, HyPhy cannot be deployed in such cases for selection analyses. Prior to selection analyses, the first and foremost thing one should look for is potential chimeras in the sequences. Chimeric sequences are sequences generated due to multiple artifacts from amplification and sequencing. These are highly undesirable sequences and can lead to drastic anomalies in the selection analyses and can skew the data to an extreme (unreal) positive selection. To reveal potential recombinants, HyPhy offers GARD which works on statistical probability analyses to depict sites (essentially bases) at which breakpoints have occurred thus creating chimeras or putative recombinants. Apart from GARD, one can also refer to RDP (Recombination Detection Program) by Darren Martin (2000). Both these program are a bit different in their result delivery but
essentially do the same thing, of detecting recombinants. The advantage of using GARD at the Datamonkey server is that it allows partitioning of sequence datasets into clean non-recombinant partitions ignoring the possible recombinant/chimeric regions and thus prevent overestimations. Once the recombinants have been identified and excluded from the dataset, it is all set to proceed for the analyses. For the identification of signatures of positive selection, three methods are employed namely SLAC, REL, and FEL. The probable reason for employing three methods as opposed to one is to have a more rigorous estimation of positive selection and to account for false positives (overestimations). SLAC method is basically a ‘‘counting method’’ that employ either a single most likely ancestral reconstruction, weighting across all possible ancestral reconstructions, or Bioinformatics Review | 11
sampling from ancestral reconstructions. REL models variation in non-synonymous (changes in amino acid residues from one group to another of amino acids) and synonymous (substitutions of amino acids of similar groups/nature) rates across sites according to a predefined distribution model. The distribution model also uses a priori calculated selection pressure at an individual site. The selection pressure, in turn, is derived from empirical Bayesian approaches. The major demerit of the REL method is that it suffers from high false-positive rates. FEL is the most robust method of all and here the estimation is site by site in terms of rates of non-synonymous substitution (rate is given by 'dN' also referred to as β) over rates of synonymous substitution (rates are given by 'dS' also referred as α). In this manner, a site by site dN/dS analyses selects the codon under positive selection dN/dS rates and thus selects which are the codon sites under positive (α<β) selection. Essentially speaking, a neutral selection is said to occurring on a sequence dataset when dN/dS=1, while a value of more than 1, depicts positive selection. Negative selection is concluded if the value is less than 1. FEL is considered to be most accurate and precise of the three methods as it estimates dN and dS independently at each codon sites using the modified Suzuki and Gojobori method and
makes no a priori assumption about the rate distribution, making the estimation all the more accurate. The errors in dN or dS and also that of local α or β estimations are corrected by probability values or p-values acting as a level of significance for every site. In this manner, a combined approach to study sites under evolution and/or selection are studied on a coding sequence dataset. However, apart from this, there are many others ways to deal with selection analyses and they will be taken up in future issues. For a further practical application of these methods applied in a field study, readers can refer to Author's previously published article in Archives of Virology (Singh-Pant et al., 2012; DOI: 10.1007/s00705-0121287-x). The author has a comprehensive list of references and is available upon request. For more information contact prashant@bioinformaticsreview.com
Bioinformatics Review | 12
GENOMICS
What is PRSice?
Image Credit: Google Images
â&#x20AC;&#x153;PRS is calculated using estimated published GWAS results. This technique was first applied by the International Schizophrenia Consortium (2009), demonstrating that genetic risk for schizophrenia (SCZ) is a predictor of bipolar disorder. It has also been proved as the reliable indicator of genetic liability.â&#x20AC;? tiology is the study of origination or causation of an event or phenomenon. Genetic etiology is the study of genes responsible for particular traits along with some other genes in an organism. The identification of genetic etiology has become a protocol while studying genotypes and/or phenotypes of individuals. For this, PRS which means, Polygenic Risk Score is calculated.
E
A PRS is the summation of traitassociated alleles across various genetic loci, weighted by effect sizes on a trait of interest. The effect sizes are calculated by the Genome-Wide Association Studies (GWAS). It has revealed that the genetic basis of most of the complex traits caused by the small effects of hundreds or thousands of variants. The polygenic effects can be considered as the genetic liability to disease risk
associated with these genes. PR score was found to be accurate in most of the applications. PRS only considers the variants with P-value threshold, i.e., PT. PRS is calculated using estimated published GWAS results. This technique was first applied by the International Schizophrenia Consortium (2009), demonstrating that genetic risk for schizophrenia (SCZ) is a predictor of bipolar disorder. It has also been proved as the reliable indicator of genetic liability. Jack Euesden et al. (2014), introduced a software package to easily calculate the PRS, known as PRSice (means 'precise'). They found out the genetic relationship between the Schizophrenia (SCZ) and Major Depressive Disorder (MDD) and
significantly proved that PRS of SCZ predicts the status of MDD.
Fig. 1 Bar chart of PRSice at broad pvalue threshold for SCZ predicting MDD status [Jack Euesden et al. (2014).] PRSice returns the best fit PRS according to the polygenic risk associated with the alleles responsible for a trait. The main feature of PRSice is that it can easily and automatedly calculate the PRS at any value of Pvalue threshold, i.e., PT and then identifies the most precise threshold. It only requires the GWAS results of individuals on a base phenotype and Bioinformatics Review | 13
genotype data on target phenotype, to calculate the PRS for each individual, and plots a PRS model depicting the fit range of PT value. PRSice can also consider the SNPs in linkage disequilibrium (i.e., an occurrence of a combination of genes in non-random proportions in a population), but it depends on the user whether to use this option or not. PRSice is a command line program that calculates PRS for individuals, under variously specified parameters and reduces the computation time. For further reading, click here. For any query, write to muniba@bioinformaticsreview.com Reference PRSice: Polygenic Risk Score software ----Jack Euesden*, Cathryn M. Lewis, and Paul F. Oâ&#x20AC;&#x2122;Reilly*
Bioinformatics Review | 14
SEQUENCE ANALYSIS
Predictive metagenomics profiling: why, what and how? Image Credit: Google Images
“This article reviews the latest approaches of functional and ecological profiling of 16S rRNA metagenomic dataset from environmental samples. It is intended for researchers working in the area of molecular microbial ecology, metagenomics and functional aspects of metagenomics data particularly with reference to 16S rRNA.” hat is predictive metagenomics profiling?
W
Recently, predictive metagenomics profiling (PMP) has been added to the microbial ecologist’s arsenal of strategies for probing microbial communities. Currently, there are two platforms available for PMP; PICRUSt (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States; http://picrust.github.io/picrust/) and tax4fun available at tax4fun.gobics.de/ [1,2]. The aim of PMP is to predict the abundance of functional gene families present in microbial communities (i.e. the community’s functional potential)
using amplicon-based sequencing data, such as 16S rRNA data. If PMP reveals that the predicted functional potential of a community changes between treatments, then researchers will have a strong incentive to invest the time, money and computational power to investigate community function further via techniques including shotgun metagenomics or functional microarrays such as GeoChip [3]. Additionally, the types of gene families predicted to vary by PMP could give rise to new hypotheses that will direct future experimental design. How does it work?
PICRUSt and tax4fun are described in detail in their respective publications but briefly: Both programs generate a database of organisms with known gene content. For tax4fun, this is achieved using KEGG organisms (Kyoto Encyclopedia of Genes and Genomes; www.genome.jp/kegg/) with sequenced genomes. For PICRUSt, a reference phylogenetic tree with GreenGenes identifiers (available at greengenes.lbl.gov/) is created. The gene content for organisms in the PICRUSt reference phylogenetic tree is either a) compiled directly from databases containing the well annotated, sequenced genomes or, b) inferred using an ancestral state reconstruction algorithm. The latter
Bioinformatics Review | 15
process is used when organisms in the reference phylogenetic tree have not been sequenced. The underlying assumption is that taxonomically similar organisms will have functionally similar capabilities. Indeed, the authors of PICRUSt highlight that phylogeny and biomolecular function are highly correlated [1] (Langille et al., 2013). Gene family predictions for experimental data then are made by associating the taxonomic ID of OTUs (Operational Taxonomic Units) in experimental data to organisms in the reference database with precomputed gene content. For PICRUSt, GreenGenes taxonomic identifiers are used to create the phylogenetic reference tree, thus to map the experimental 16S rRNA data to this tree, taxonomic assignments of OTUs must also be performed using GreenGenes. Conversely, tax4fun requires SILVA (www.arb-silva.de) assigned taxonomies which are then mapped to the reference database of KEGG organisms. Both programs use 16S rRNA copy number information to normalize the abundance of identified OTUs. Why use predictive metagenomics profiling? Unlike metagenomics, which requires massive amounts of sequencing in order to achieve adequate read-
coverage of rare genomes – which is particularly challenging for highly diverse communities such as soils – PMP only requires sufficient sequencing depth to cover a single target gene – i.e. the 16S rRNA gene [4]. This significantly reduces the amount of data required to take a first look at the community functional profile. The output of a PMP pipeline is a table of predicted gene family counts (in the form of identifiers such as KEGG ontologies) which can be clustered to path-way level categories (i.e. KO modules) if so desired. Additionally, for each OTU, 16S rRNA copy number is used to scale the contribution of predicted gene families to the community’s functional potential. A natural limitation for PMP lies in the diversity of available reference organisms that have their genomes fully sequenced. For an OTU to have its gene content predicted it must be able to be mapped to a reference organism. If it is not similar enough to a reference organism, then gene content cannot be predicted. Additionally, even though PICRUSt can predict the gene content of unsequenced organisms, these predictions will only be accurate if there are a number of closely-related, sequenced genomes available to make predictions from. Nevertheless, PMP is a simple and elegant strategy
for adding value to 16S rRNA metagenomics data, such as the one that would be generated on an Illumina Miseq. Using predictive profiling
metagenomics
The input for PMP is a normalized OTU table whereby taxonomic assignments have been made using the database appropriate to the PMP strategy (i.e. SILVA or GreenGenes). As with any sequencing data, appropriate QC and data trimming strategies need to be observed. Data QC, taxonomic assignments, generation of an OTU table and normalisation (i.e. rarefied counts or model-based normalisation) can be performed using existing pipelines such as QIIME (Quantitative Insights Into Microbial ecology; www.qiime.org) and USEARCH 16S rRNA pipelines [5,6,7]. Both PICRUSt and tax4fun are compatible with QIIME outputs, making it easy for users to add PMP to existing workflows. Through the QIIMEPICRUSt help pages, there is a range of recommended scripts that can be used to assess changes to functional profiles between communities [5]. Additionally, KO data can be exported and used in conjunction with traditional ecological tools, such as ordinations (NMDS, PCO etc.,) or species indicator tests (available using
Bioinformatics Review | 16
the multipatt function in the indicspecies package in R) to identify gene families or functional modules that are indicative of different communities [8]. For a more comprehensive list of references, readers are directed to mail us at info@bioinformaticsreview.com References 1.
Langille, M.G.I., Zaneveld, J., Caporaso, J.G., McDonald, D., Knights, D., Reyes, J.A., Clemente, J.C., Burkepile, D.E., Vega Thurber, R.L., Knight, R., Beiko, R.G., Huttenhower, C., 2013. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnology, 31, 814-821.
2.
Aßhauer, K.P., Wemheuer, B., Daniel, R., Meinicke, P., 2015. Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics, 31, 2882-2884.
3.
He, Z., Deng, Y., Van Nostrand, J.D., Tu, Q., Xu, M., Hemme, C.L., Li, X., Wu, L., Gentry, T.J., Yin, Y., Liebich, J., Hazen, T.C., Zhou, J., 2010. GeoChip 3.0 as a high-throughput tool for analyzing microbial community composition, structure and functional activity. ISME Journal, 4, 1167-1179.
4.
Howe, A.C., Jansson, J.K., Malfatti, S.A., Tringe, S.G., Tiedje, J.M., Brown, C.T., 2014. Tackling soil diversity with the assembly of large, complex metagenomes. Proceedings of the National Academy of Sciences of the United States of America, 111, 4904-4909.
5.
Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Pẽa, A.G., Goodrich, J.K., Gordon, J.I., Huttley, G.A., Kelley, S.T., Knights, D., Koenig, J.E., Ley, R.E., Lozupone, C.A., McDonald, D., Muegge, B.D., Pirrung, M., Reeder, J., Sevinsky, J.R., Turnbaugh, P.J., Walters, W.A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R., 2010.
QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335-336. 6.
Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460-2461.
7.
McMurdie, P.J., Holmes, S., 2014. Waste Not, Want Not: Why rarefying microbiome data Is inadmissible. PLoS Computational Biology, 10.
8.
De Cáceres, M., Legendre, P., Moretti, M., 2010. Improving indicator species analysis by combining groups of sites. Oikos, 119, 16741684.
Bioinformatics Review | 17
META-ANALYSIS
How to develop a search strategy for meta-analysis and systematic reviews? Image Credit: Google Images
“The article tries to help budding biological data analysts in developing the right methodology for querying literature databases, and quickly get started with their analyses.” t all begins with finding a topic to work on! Find yourself a topic which has current relevance and with publications of clinical trials, case-control or response-no response studies in recent years, Randomized control trials are the gold standard for such analyses and are also hard to find. The procedure emphasizes the studies being recent as that makes the data for analysis updated to the latest methods, techniques, and error minimization procedures. Also, the recentness of included studies empowers the analyst’s view of the relevance of the topic in current times.
I
To construct a systematic review and conduct a meta-analysis of all literature published on any topic, one must know how to precisely extract meta-data from varied databases at their disposal. One may create his corpus of experimental studies from any single database, say PubMed, by
submitting a query containing words/phrases outlining names of a disorder, prescribed/ on trial drugs, and type of study expected all connected by use of operators. PubMed usually will return with results in five-figures and this is a fairly large dataset to sort out manually. We must consider at this point that PubMed is not the ultimate collection of all the biological literature ever published and in current publications, it is a sufficiently well-informed database but mustn’t be regarded as complete. There are other databases which act as topic-specific databases and can be expected to contain a more concentrated literature on the theme they nurture. Moving on to the actual development of query, the analyst must ascertain that the search string developed is complete in all respects, i.e. it contains all the MeSH terms for the
items queried. Even though PubMed will automatically generate synonyms for a query from its database of MeSH terms in our experience we found providing the MeSH terms saw a steep fall in the number of results returned, the drop was so significant that previously obtained five-figured corpus was reduced to a measly three figured and was manually sortable. It is noteworthy that if possible one mustn’t filter out studies published in languages other than English as it may make the study biased by eliminating participation of regional studies and data. The analyst may always try to contact the authors for the English version of the study and thus reduce the amount of bias, making the analysis robust. NOTE: For any query write to manish@bioinformaticsreview.com
Bioinformatics Review | 18
SYSTEMS BIOLOGY
Network Biology: Get Together of Macromolecules Image Credit: Google Images
â&#x20AC;&#x153;An ecosystem may be modeled as a network of interacting biotic and abiotic components and or a typical molecular mechanism can be modeled as a network of regulatory genes or proteins or even a network of protein can be constructed with different interacting amino acids.â&#x20AC;? network is a group of two more than two interacting components. The complex biological systems can be represented as computable networks that provide a unique way of analyzing the complex underlying mechanisms.
A
For example, an ecosystem may be modeled as a network of interacting biotic and abiotic components and or a typical molecular mechanism can be modeled as a network of regulatory genes or proteins or even a network of protein can be constructed with different interacting amino acids. Further, a set of genes can be broken into a subset of small nucleic acids/genes interacting together to form a large network of regulating mechanism. Even small components (atoms and or molecules) of a protein form an interacting network as carbon, oxygen, nitrogen, and sulfur. In every
network, the connected components are considered as nodes and interactions between them are called as edges. In another language, a node represents a unit and an edge represents an interaction between the units. From a biological point of view, a node depicts a wide array of biological units. They may be anything from a tissue, cell, organ, cellular components, neurons, macromolecules to an individual atom. The most connected node i.e. node with the highest degree is called as Hub. Any Hub in a network is considered the most vital node as it has maximum links and becomes responsible for regulating other nodes. The overall functionality is affected if the hub gets disturbed.
interactions/edges it shares with other nodes, while betweenness is a measure of centrality of a node that how central a node is, quantifying the number of times a node acts as a bridge along the shortest path between two other nodes in a network.
Fig. 1 The degree of nodes is represented in the graph as 5 of the central node while others have 1 as they are not linked with any node.
Basically, a network is characterized by two of its properties viz. network degree and betweenness centrality. A degree of a node is the number of
Bioinformatics Review | 19
Fig. 2 The highly centered node is depicted in yellow color between two sub-sub graphs. Node A can exchange the information from both sides and can control the overall functionality of the network. The highly centered node serves as a bridge among various portions of the network. An edge must pass through the node to reach other portion of the network. The importance of between centrality is varied for different networks like a social network, protein-protein interaction network in biology, gene-protein interaction and all regulatory networks. The centralized node is significant for passing the information from one subnetwork to the other sub-network or from one portion to the other portion in a giant biological interaction network. Networks that we analyze are static whereas biological processes are greatly dynamic. There arrives a very special approach called mathematical modeling to deal with such a static graph to infer the useful information (refer to Introduction to Mathematical Modeling in the previous issue). Apart from, degree and betweenness centrality, other network properties include shortest path length, mean path length, diameter, density and average path length. They are referred to as topological properties of a network. Each one of them is significant in finding out the relationship between
two nodes, flow/exchange of information, robustness, and functionality of a network, and can across to know about the overall function of biological systems i.e. robustness, redundancy, almost all biological networks hold a positive degree of the topological properties. Therefore, we can have a rough estimation of the function of a macromolecule (protein and or gene) in association with other connected molecules. The computational approach has made it easier to calculate the statics on topological properties. Soft-wares like Cytoscape and Gephi are widely used for visualizing and constructing biological networks.
Bioinformatics Review | 20
BIOINFORMATICS & LEGAL ASPECTS
Biological Databases & Intellectual Property RightsAn Introduction Image Credit: Google Images
“Intellectual Property Rights (IPRs) are legal rights that protect creations and/or inventions resulting from intellectual activity in the industrial, scientific, literary or artistic fields. The most common IPRs include patents, copyrights, trademarks and trade secrets. This article introduces the concept of creation of Biological Databases and the laws/rights which govern/regulate them and other intellectual property. This article is under the publication type "Popular Article" and as such is intended for anyone interested regardless of their educational background. With a hope that people from varied backgrounds will equally enjoy reading this article.” ioinformatics is an emerging discipline of science and technology. The National Center for Biotechnology Information (NCBI, 2001) defines Bioinformatics as "a field of science in which Biology, Computer Science, and Information Technology merge into a single discipline”.
B
Within the framework of Bioinformatics, there are three subdisciplines: First one deal with the development of new algorithms and statistics to help assess relationships among members of large data sets. The second one deals with the analysis and interpretation of various types of data viz. nucleotide (DNA/RNA) and amino acid
sequences, protein domains, and protein structures constituting biological databases and the third sub-discipline deal with the development and implementation of tools that enable efficient access and management of different types of information. The origin of this discipline in a crude form can be traced to the time when father of genetics, Gregor John Mendel selected certain traits such as height, flower color, seed shape and color in pea plants and started experimentation with these traits, along with maintaining records of the experimental data (a primitive biological database). He then converted the phenomenon observed into mathematical
relations and tried to understand how these traits move over generations after generations and came out with founding principles of classical genetics. After Mendel, biological record keeping and understanding biological phenomenon by mathematical relations have come a long way. The recent history of biological databases starts with the first protein sequences reported by Frederick Sanger for bovine insulin in 1956, consisting of 51 residues. For this pioneering work, he was awarded his first Nobel Prize in Chemistry in 1958. Later on, the first nucleic acid sequence of yeast alanine tRNA with 77 bases was reported by a group led by Robert
Bioinformatics Review | 21
Holley from Cornell University in 1965. In 1977, Sanger and colleagues introduced the "dideoxy" chaintermination method or Sanger method for sequencing DNA molecules that were eventually used to sequence the entire human genome as well as many other model systems in biology. Margaret Dayhoff, a pioneer in the application of mathematics and computational methods to the field of biochemistry gathered all the available sequence data to create the first biological protein database and published a book "Atlas of Protein Sequence and Structure". She developed many of the tools used today in database design and utilization. In 1980, Dr. Dayhoff developed an on-line database system that could be accessed by telephone line, the first sequence database available for interrogation by remote computers. The rapid advancement in genetics and molecular biology tools and technology led to an era of “-omics” such as Genomics, Proteomics, Transcriptomics, Metabolomics etc. These, in turn, have generated a massive amount of data from the high throughput molecular biology approaches with which we deal with the genome, proteome etc. This huge amount of data needed to be
made available as well as accessible and this was a challenge in itself. This challenge was successfully tackled by using information technology resources also knowns as Information and Communication Technology (ICT). It gave a flexible, smart and rapid way of storing, managing, querying and retrieving large and complex biological data (viz. sequences and structures). It has also helped in handling, managing and maintenance of data generated from various biological systems. For this purpose of maintaining growing biological data in digital form, biological databases were created that initially acted as a storehouse of biological data with limited user’s access such as student, researchers, and the pharmaceutical industry. Later on, the arrival of internet and improvement in technologies, the biological database incorporated many other types of databases related to biological material such as gene expression database, metabolic pathway database, disease database etc. Since then, biological database accessibility, sharing and spread increased across globe. Today, biological databases such as NCBI, EMBL and DDBJ are governed under the aegis of the International Nucleotide Sequence Database Collaboration (INSDC) that makes policies for access, use and advisory
to submitters and also to the user communities. Subsequently, human genes (579) got mapped by in-situ hybridization in 1981. In 1988, The Human Genome Organization (HUGO), an international organization of scientists was founded. The first complete genome map was published for the bacterium Haemophilus influenzae in 1995. Since then, many prokaryotic and eukaryotic genomes have been sequenced such as Mycoplasma genitalium (1995), Escherichia coli in 1997, Caenorhabditis elegans and Brewer’s Yeast (Saccharomyces cerevisiae) in 1998, Arabidopsis thaliana (2000) and human genome in 2001. These projects along with independent work generated huge sequences and structural databases that evolved many database institutions such as NCBI, EMBL, DDBJ, and PIR. As stated earlier, the rapid evolution in biological data generation has brought technical challenges. Some of the challenges are huge investment of tax payer’s money, a skilled human resource to develop a biological database for storage and management, software/tools to analyze the biological data and development of advanced tools and technologies to meet the future
Bioinformatics Review | 22
needs and challenges. Such creations fall under the domain of intellectual property. Therefore, it requires protection from misappropriation and piracy. Protections to software and databases can be ensured by applying appropriate national intellectual property laws. Intellectual Property Rights (IPRs) are legal rights that protect creations and/or inventions resulting from intellectual activity in the industrial, scientific, literary or artistic fields. The most common IPRs include patents, copyrights, trademarks and trade secrets. Such protection will not only protect the investment but also help in motivating the investor along with the creator of new software and databases. Ownership of the intellectual property in a biological database and the associated rights will provide a significant effect on those who are maintaining databases. Some of the important intellectual property rights given or associated with biological or bioinformatics databases are described here: 1. Copyright: Generally considered to be the exclusive legal right granted by national law to the author of a work to disclose it as his own creation, to reproduce it and to distribute or communicate it to the public in any manner or by any
means, and also to authorize others to use the work in specified ways. Most copyright laws distinguish between economic and moral rights, which together constitute copyright. There are usually certain limitations made by the law as to the kind of works eligible for protection and as to the exercise of the rights of authors comprised in the copyright. For example, Copyright is given to the database and software in India. 2. Patent: A patent is an exclusive right granted for an invention, which is a product or a process that provides, in general, a new way of doing something, or offers a new technical solution to a problem. To get a patent, technical information about the invention must be disclosed to the public in a patent application. For example, patent granted to the bioinformatics software in the USA. 3. Trademark: A trademark is a sign capable of distinguishing the goods or services of one enterprise from those of other enterprises. Trademarks are protected by intellectual property rights. For example GenBankÂŽ; BLAST is a registered trademark of the National Library of Medicine. 4. Trade Secret: Any confidential business information which provides an enterprise a competitive edge
may be considered a trade secret. Trade secrets encompass manufacturing or industrial secrets and commercial secrets. For example, algorithms of bioinformatics software. The unauthorized use of such information by persons other than the holder is regarded as an unfair practice and a violation of the trade secret. (Source: WIPO) Biological Databases, Types & Intellectual Property (IP) Protection A biological database is a large, organized, stable entity associated with information technology and software to store, update, query, and retrieve components of the data stored within the system. Biological databases have been classified as Primary, Secondary, Composite and Integrated databases. Primary databases are those biological databases which contain the raw sequences of nucleic acid (DNA and RNA), protein sequences and biochemical reactions. For example, the National Center for Biotechnology (NCBI), European Molecular Biology Laboratory (EMBL) and DNA Data Bank of Japan (DDBJ) and Protein Data Bank (PDB). Primary databases play a great role in bioinformatics research. They are updated regularly and contain a massive amount of experimentally
Bioinformatics Review | 23
obtained data. These databases are mostly maintained by public funding. Many of the primary databases are freely accessible to anyone and so are based on open access. The reason behind keeping them out of IP protections are, first, mostly the data generated are the outcome of work carried out by governmentfunded or socially funded research bodies and thus are made available free to the public for their use and secondly, it will not stifle the further development in the field of biology. Secondary databases are derived biological database from the information available in primary databases. These databases are well analyzed, upgraded and annotated version of primary databases of the nucleic acid and protein sequences. For example, Protein databases like Swiss-Prot, CATH, KEGG, OMIM, SCOP, and PROSITE. Such upgradation makes the secondary database more useful to understand the structure and function of biomolecules. The improvement and upgradation need an input of labor, skill, technology, and capital that usually come from individuals or enterprises so some of the secondary databases are not freely available in public domain and thus need or given protection under IP laws if it satisfies the criteria of originality and creativity. Protection
ensures the investors return and motivation alive to keep working in this area. Composite databases are those biological databases which contain information from a variety of primary databases. For example, NCBI. These databases fall under the norms of primary databases. Integrated databases are biological databases containing biological data obtained from different related organisms. Integrated data help in comparative analysis or studies and provide a great understanding of the evolutionary relationship and synteny between the genomes of different organisms. For example, ATIDB (Arabidopsis thaliana Integrated Database). It provides a database derived from genome and transcriptome sequences between the model organisms, Arabidopsis and related Brassica species and helps in comparative studies. At present, a common standard IP protection across globe is followed under the TRIPS agreement accepted by parties to the World Trade Organization (WTO) except EU where the European Commission (EC) itself had issued the database directive and established its own sui generis regime for biological databases protection.
TRIPS contain the provisions for the protection of databases as well as computer software. Part II of the TRIPS Agreement provides protection to the computer programs and compilations of data under copyright. Article 10.1 of the TRIPS Agreement clarifies that computer programs are open to copyright protection as works of literature and that this is irrespective of the manner in which they are presented. Correspondingly, Article 10.2 provides protection for data compilations which does not depend on whether the individual elements of the data compilation are open to copyright protection in their own right. After the TRIPS Agreement entered into force, World Intellectual Property Organization (WIPO) explicitly regulated the protection of computer programs and data compilations in an auxiliary agreement to the Berne Convention (Arts 4, 5 WIPO Copyright Treaty). Worldwide the different protection strategies exist for the biological database. Here, comparative analysis of the strategies used in the USA, EU, and India are discussed below. (A) Biological Database Protection in the USA The USA gives copyright protection to the biological databases. Before
Bioinformatics Review | 24
1991, copyright laws were used to protect databases in the USA but after that, US courts added the concept of “originality” or “creativity” factor for getting the copyright that led to exclusion from copyright protection because they did not meet the original criteria. US supreme courts interpreting the position on the copyright protection accorded to compilations held that facts are not copyrightable but the compilation of facts is, provided there is a sufficient degree of originality in the compilation in terms of indices employed etc., (Fiest Publication, Inc. v. Telephone Services Company, Inc. (1991). Thus, the biological databases are accepted as the compilation of facts and copyright protections are granted. (B) Biological Database Protection in the European Union (EU) European Union passed the EU database directive in July 1995 that suggested two-tier protections for legal protection of databases (biological database): 1. Copyright system. 2. Sui-generis or quasi-copyright System. The EU directive on copyright protection of database provides for
the protection of the content of the database coupled with protection for the database if there is originality in the selection of arrangement of material. Such protection is based on the justification that a person who has made a substantial investment in obtaining, verifying or presenting the database must have exclusive right over it. The Sui generis (Latin, of its own kind) system could be used to protect the database maker investment on some special but nonoriginal databases which involved huge capital, human resource, and material resource. It was meant for the first time to protect database by a special right, later a balance was made by adopting dual mechanism. The copyright protection was available to countries and parties who were the member of Berne convention or TRIPS agreement, while sui generis was available only to makers who are nationals of EU member states. (C) Biological Database Protection in India The Copyright Act, 1957 of India grants protection to the original expression. Computer Programs have been considered as a “literary work” under the copyright act 1957. Section 2 (o) of the act defines 'literary work' and includes
computer programs, tables, and compilations including computer databases (which includes biological databases also). This provision was inserted in 1999 and come into force in the year 2000. In Burlington Home Shopping Private Limited v. Rajnish Chibber & Another (1995 PTC 278) a matter came before Delhi High Court to decide the protection of database and court held that a compilation of addresses developed by anyone by devoting time, money, labour and skill amount to a “literary work”, though the sources might be commonly situated. Similarly in Bharat Matrimony Com. P., Ltd., Chennai v. People Interactive (I) Pvt. Ltd., Chennai, 2009 AIHC (NOC) 433(Mad) was held that “literary work” includes a computer program and compilation including computer databases. Though cases on biological database protection are absent in India, similar protection can be obtained for the compilation of biological databases (Bioinformatics database) derived from sequencing of nucleic acid and proteins. Section 13 of the copyright act 1957, provides the categories of work in which the copyright subsists which includes original literary work. The author of a work is the first owner of
Bioinformatics Review | 25
the copyright in the work. However, in case of employer-employee, if a work is made in course of employment under a contract of service or apprenticeship, the employer shall be the first owner of the copyright in the above of any contract to the contrary. The computer software is granted protection as a copyright in India unless it leads to a technical effect and is not a computer program per se. The Copyright Act protects the author's economic and moral rights in the copyrighted work as stated in section 14 and 57 respectively. Even though the TRIPS Agreement does not specifically protect the moral rights, but the same is protected under the Copyright Act, 1957. Section 51 of the Copyright Act, 1957 of India defines infringement of copyright and states that a person infringes the copyright of another if he/she although not authorized to do so, commits any act which only the copyright holder has exclusive rights to do. Civil remedies to copyright infringements are provided in chapter XII of Copyright Act, 1957, granting injunction and damages for copyright infringement and criminal liability provisions are provided in chapter XII of the Act, 1957 wherein abetment of infringement is also unlawful and punishable with imprisonment of up
to three years and a fine up to Rs. 2 Lacs.
because they don’t fulfill the abovementioned criteria.
The difference between USA and EU strategy of biological database protection is that EU gives much emphasis over the economic aspect for database creation while the USA gives primacy over “originality and creativity” in the material. On the other hand, similarity lies in the fact that both provide copyright protection to the databases. Indian copyright law has tried to strike a balance by ensuring originality along with the economic aspect of the author.
Patent protection for DNA, RNA and Protein Sequence extends only to biological and physical compositions and not to the abstract biological sequence information that describes the composition. Therefore, a patentee could only prevent from using the composition itself and not the information within the molecule. It was held by the Supreme Court of the United States in Diamond vs. Diehr (1981), that to qualify as the patentable subject matter the biological sequence has to be categorized as a process, machine or apparatus. An idea itself is not patentable and neither the principle in the abstract.
(a) Patent and Biological Databases Protection The biological databases are compilations of biological sequences and other data types, and if biological sequences are unpatentable, then biological databases are also unpatentable. To make it patentable, they have to be related to a statutory subject matter. On the other hand, the databases are not themselves patentable, patent protection may be given for the database– related inventions. Patent protection in India is given to those inventions that satisfy the criteria of novelty, inventiveness and industrial use. Biological database is not granted patent protection
In State Street Bank & Trust Co. v. Signature Financial Group, Inc. case (1998), it was held by US Court of Appeals for Federal Circuit that even if information per se is not patentable as a tangible product, a process of producing the information may be patentable. Secondly, patent protection would extend only to the process for creating the database and not the database itself. It would limit the value of the patent because a competitor wanting to infringe the patented product can simply make the product in a non-infringing way.
Bioinformatics Review | 26
(b) Trademark/Trade Secret Laws and Biological Database Protection There is no trade secret or trademark given for primary databases. For secondary databases, protection from trade secret law is available. It is because of the secondary database creator studies and analyzes the primary data and adds many important features that need huge capital and skilled human resource. Thus, he/she may not make all of it public or may charge money for accessing the database. However, it should be noted that trademark law will only protect the content of the database. The database developer can get protection from the copyright law and can prevent the third parties, from copying, accessing contents, by using contract law. For the nonâ&#x20AC;&#x201C; original database, the contract law is the only mode of protection where database developer can prevent the breach of faith and infringement. Presently, two types of contracts are available for the protection of databases namely Shrinkwrap and Click-wrap/Webwrap/Browse-wrap contracts. Shrink-wrap license is used for the databases in compact discs (CDs) and the license is put down in writing during the packaging. Once user uses the
products, he/she agrees to all the terms and conditions of the product. Click-wrap/Webwrap/Browse-wrap licenses are meant for the internet users. When the buyers want to access the content in the database, they should enter â&#x20AC;&#x153;agreeâ&#x20AC;? online which means that they have agreed to the contract. Biological databases play a great role in the present era of globalization. These databases are linked with information and communication technology across the globe and can be accessed, shared and used for development and promotion of science. In this regard, appropriate protection to database and knowledge to persons associated with this domain is very pertinent. In the next issue under this section, other aspects of Bioinformatics and IPR will be taken up and discussed in details. A comprehensive list of references and supporting documents are available with the Author upon request. For further details, kindly mail us at info@bioinformaticsreview.com
Bioinformatics Review | 27
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 28
Bioinformatics Review | 29