June 2016 VOL 2 ISSUE 6
“It is strange that only extraordinary men make the discoveries, which later appear so easy and simple.� -
Georg C. Lichtenberg
DrugQuest: Tool for Drugassociated Queries
The First International Quiz
Public Service Ad sponsored by IQLBioinformatics
Contents
June 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
5
03 Bioinformatics Quiz First International Quiz
07
05
Phylogenetics
EVOBLAST: Evolutionary Fingerprinting Analysis Module 10
04 Tools DrugQuest: Tool for Drug-associated Queries 08
06
Transcriptomics
Assembly of high-throughput mRNA-Seq data: A review 12
EDITOR Dr. PRASHANT PANT FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published quarterly for one year (4 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under license by SEWA trust. Published in India
EDITORIAL
The future and learning from your elders: experimental ecology and ‘the omics’ for emerging Bioinformaticians
Jennifer Wood
Honorary Editor
The excitement generated by the ‘omics-revolution’ has opened the door for new and innovative research. Whilst bioinformatics is advancing in many fields such as pathology, microbial ecology, agriculture, and medicine, there is room for researchers new to bioinformatics to take some important lessons from a much older field; ecology. For a biologist meeting bioinformatics for the first time, the massive amounts of data (also called ‘Big Data’) that is handled during bioinformatics projects can be overwhelming. For many, finding resources to develop a clear understanding as to the type of Big Data being generated and how it can be analyzed may be difficult and the resources themselves can seem to be impenetrable. Additionally, there is a temptation to forget that it is the robustness of the experimental design and quality of the data interpretation, not the amount of data generated, that will make the best science. Without a developed understanding of how data will be analyzed, it is impossible to generate a robust experimental design. We have at our fingertips a resource for exploring our world that is incredibly powerful but requires respect and forethought. The continued sharing of new bioinformatics developments, through journals such as Bioinformatics Review (BiR), will be paramount to the advancement of our field. However, for those arriving in bioinformatics, I would urge them to draw on lessons from established fields, such as experimental and fundamental ecology. As a biological discipline, ecology is one of the few disciplines that has been dealing with Big Data for decades, and many issues that arise in omics-based projects are discussed extensively in ecology literature. Issues such as: the need for robust experimental design and replication to discern patterns amongst large environmental heterogeneity; how to deal with uneven
Letters and responses: info@bioinformaticsreview.com
EDITORIAL
sampling depth; strategies for determining which species in a multivariate dataset respond to experimental treatments; the line between subsampling and true replication and; sample sizes that are too small for adequate power in tests of significance. As such, experimental ecology presents a relatively untapped resource for a) discussions around considerations for designing experiments that will yield Big Data, b) strategies for analyzing Big Data and c) theories that can enrich the interpretation of Big Data. Ecological theories such as Grimes CSR theory on plant strategies have recently been reinterpreted for microbial ecology. Furthermore, ecological discussions of Big Data are often presented in a manner more familiar to those with biological backgrounds and thus, can provide an understandable introduction to the basic concepts and challenges surrounding the use of Big Data. It is clear that the generation and treatment of Big Data are important elements to bioinformatics-based research that need to be fully considered before an experiment commences in order to adequately answer meaningful scientific questions. The role of journals like Bioinformatics Review (BiR) in demystifying available technologies and analysis techniques, but also fields such as ecology which has worked through issues that are paralleled in bioinformatics, will be paramount in equipping researchers with the best information to design robust experiments. Please do share your comments, feedback, and suggestions at info@bioinformaticsreview.com With Best Wishes
BIOINFORMATICS QUIZ
First International Quiz Image Credit: Stock Photos
“The Bioinformatics Review 1st International Quiz�
e are pleased to announce Bioinformatics Review 1st International Quiz. This is an extensive Bioinformatics Quiz that involves submission of a detailed answer on "Application of Bioinformatics in ________(Subject area of your choice)" and answering questions to be put up on our Facebook Page during the quiz duration(4th June to 19th June). The participants have to send the answer in a descriptive form with references by 19th of June. the best entry will be published and awarded a cash prize of Rs 500, and a certificate from the board of editors.
W
Interested? Click here for terms and conditions and to know how to submit your entries.
Bioinformatics Review | 7
TOOLS
DrugQuest: Tool for Drugassociated Queries
Image Credit: Google images
“DrugQuest [9] is a web-application which applies the knowledge discovery and text mining techniques to parse the DrugBank repository and group drugs according to the textual information.�
I
n this rising era of personalized medicine, drug discovery, data refining, chemical compound databases, drug information, identification of symptoms by various levels of biomarkers, bioinformatics have been playing a pivotal role in bringing out the best deliverables in a cost and time effective manner. There are various chemical databases available to provide information about various drugs, their pharmacokinetics, pharmacodynamics, drug interactions, etc., which in turn become the major resource for a computational approach to drug discovery. Some examples of such data repositories are PubChem [1,2] (a
database composed of PubChem Substance, PubChem Compound and PubChem Bioassay), DrugBank [3], ChEBI [4] , SIDER [5](Side Effect Resource), ChemSpider [6], ChemExper [7], TTD [8] (Therapeutic Drug Target Database). DrugQuest [9] is a web-application which applies the knowledge discovery and text mining techniques to parse the DrugBank repository and group drugs according to the textual information. DrugQuest is a text mining tool which extracts the useful information from the available data. It specifically extracts data from the DrugBank only by clustering all the records according to the textual information provided
such as symptoms, diseases, pathways, toxicity, etc. [9] (Fig.1).
Fig.1 Homepage of DrugQuest. [http://bioinformatics.med.uoc.gr/cg i-bin/drugquest/drugQuest.cgi] Workflow of DrugQuest: 1. A query is entered by the user which is searched in Boolean operations, i.e., all words (AND) and any word (OR). The query may be any Bioinformatics Review | 8
word related to the drug or disease to be searched such as the symptoms, headache, pain, aspirin, etc.
10. The results are represented and visualized in two forms: "Tag Clouds" and "Clustered Drugs".
2. DrugBank records are searched based upon the query.
4. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008;36(Database issue):D344– 50.
4. Applies TextQuest algorithm to identify non-tagged Significant Terms.
6. English words with low TF-IDF value are removed on the basis of British National Corpus (BNC http://www.natcorp.ox.ac.uk/) 7. All the common English words such as articles, prepositions, etc are removed and now the remaining words after the steps 4-7 are considered as "Significant terms". 8. Each DrugBank record is represented in the binary vector form, i.e., in the form of binary codes, 0 and 1, where 0 defines the absence of significant terms and tagged terms, and 1 defines the presence as shown in Fig.2.
2. Li Q, Cheng T, Wang Y, Bryant SH. PubChem as a public resource for drug discovery. Drug Discov Today. 2010;15(23–24):1052–7 3. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72. 16381955
3. Textual entries present in DrugBank are matched with the query and drug records are retrieved.
5. TF-IDF score (product of Term Frequency and Inverse Document Frequency) is calculated for each non-tagged terms in the records to estimate its importance (i.e., whether it should be included in the results or not).
analyzing bioactivities of small molecules. Nucleic Acids Res. 2009;37(Web Server issue):W623–33.
5. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol. 2010;6:343
Fig. 2 Workflow of DrugQuest [10] TextQuest algorithm is used to identify the significant words, for which it first calculates the TF-IDF score for each word in the database to determine its significance, then removes the words with low TF-IDF score and the common English words [10]. Tag Clouds view displays the result in the form of a cloud highlighting the Significant terms. The font size is proportional to the frequency of the query terms present in the records. The Clustered Drugs form organizes the DrugBank records in different categories with a link to respective DrugBank record. DrugQuest has a limit of 5000 textual records per analysis [10].
6. http://www.chemspider.com (accessed on date 17th June 2016) 7. www.chemexper.com (accessed on 17th June 2016) 8. Chen X, Ji ZL, Chen YZ. TTD: therapeutic target database. Nucleic Acids Res. 2002;30(1):412–5. 9. http://bioinformatics.med.uoc.gr/cgibin/drugquest/drugQuest.cgi 10. Papanikolaou et al. BMC Bioinformatics 2016, 17(Suppl 5):182
How to cite this article: Faiza, M., 2016. DrugQuest: Tool for Drug-associated Queries. Bioinformatics Review, 2(6):page 913. The article is available at http://bioinformaticsreview.com/201606 17/drugquest/
References: 9. DrugBank records are clustered using various clustering algorithms.
1. Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. PubChem: a public information system for
Bioinformatics Review | 9
PHYLOGENETICS
EVOBLAST: Evolutionary Fingerprinting Analysis Module Image Credit: Google Images
“It is a remarkable tool which identifies a. whether a multiple sequence alignment datasets (in partitions or globally) is under different rates of evolution i.e. different dN/dS rates over different regions/partitions, signifying positive or negative selection (for details read previous articles on this series), b. and if they are found to be evolving differentially, then it also depicts in a graphical form which regions in the multiple sequence alignment are under positive, negative or neutral selection.”
I
n continuation with the series under “Do you Hyphy...”, this article envisages another very important analytical tool that Datamonkey web server[1] provides us called EVOBLAST[1,2]. EVOBALST stands for Evolutionary Fingerprinting Analysis Results [3]. It is a remarkable tool which identifies a. whether a multiple sequence alignment datasets (in partitions or globally) is under different rates of evolution i.e. different dN/dS rates over different regions/partitions, signifying positive or negative selection (for details read previous articles on this series), b. and if they are found to be
evolving differentially, then it also depicts in a graphical form which regions in the multiple sequence alignment are under positive, negative or neutral selection.
Fig. 1: A sample graph obtained from EVOBLAST
To run this module, gene coding sequence dataset is required to be uploaded on data analysis page. Select a suitable substitution model (one can use the Model Select module for this, also available on the Datamonkey web server) and select Evolutionary Fingerprinting option, based on 1000 distribution samples from the drop-down menu. Depending on the size of the query dataset, the remote computer cluster will take few minutes to several minutes and will return with Maximum Likelihood Estimates (MLE) as well as the graph (such as the one shown in Figure 1) depicting the estimate of the distribution of synonymous and non-synonymous
Bioinformatics Review | 10
rates inferred from the alignments. The pink pixels and the intensity of the pink shade (light pink to dark pink) for each pixel show the maximum likelihood estimate/probability of the selection pressure of a site/region undergoing negative, positive or neutral selection. The diagonal line shows the neutral evolution (mostly no change or non-synonymous changes). Thus regions falling on the diagonal line are under neutral evolution while the regions above the diagonal line are under nonsynonymous or purifying or positive selection. Similarly, regions with pink pixels below the diagonal line are under negative selection. Although, the depiction and MLE are highly accurate and the graph output so obtained is of publication quality but still it is highly recommended that the finding of this module must be correlated with other similar data output based modules/application to strengthen the interpretation.
3.
Kosakovsky Pond SL, et al., 2010. Evolutionary fingerprinting of genes. Molecular Biology and Evolution 27:520– 536.
References: 1.
2.
Delport, W., Poon, A.F., Frost, S.D.W., Kosakovsky Pond, S.L., 2010. Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26(19): 2455–2457 Kosakovsky Pond, S.L., Frost, S.D.W., 2005. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics 21(10): 25312533
Bioinformatics Review | 11
TRANSCRIPTOMICS
Assembly of highthroughput mRNA-Seq data: A review Image Credit: Google Images
“Interrogation of gene expression in plant and animal species is prerequisite to engineering strategies for diverse applications. mRNA-Seq is a high-throughput method to investigate gene expression at a genome-wide level. This next generation sequencing method is principally based on the sequencing of millions of transcripts in parallel and concurrently, recording the digital expression in control and test samples. In this review, we explore different strategies employed for the computational reconstruction of transcriptome using mRNA-Seq data.�
T
ranscriptome represents the complete set of all expressed transcripts (RNA molecules) present in a cell or tissue at a given point of time. The transcriptome is always dynamic in nature and keeps on changing with time driven by the external and internal environment. We know that among the total transcribed RNA transcripts, only a small fraction is translated into proteins. The fraction which is translated into proteins is referred to as coding transcriptome, while the fraction which is not translated is
referred to as non-coding transcriptome. In other words, coding transcriptome is a collection of all the messenger RNA (mRNA) molecules while non-coding transcriptome is a repertoire of transfer RNA (tRNA), ribosomal RNA (rRNA), small nuclear and small nucleolar RNA (sn/snoRNA) and small RNAs (siRNA, miRNA, lsiRNA). In past, a number of technologies have been developed to investigate the expression of genes. One way of doing this is to capture the expressed mRNA transcripts and evaluate relative abundance against different
conditions. mRNA-Seq is a recently developed high-throughput method which works on the genome-wide level. In this method, total RNA is first extracted from control and test tissue/s or cell which is subsequently used to fish mRNA population using an oligo-dT probe and A-tail attached to mRNA transcripts. This mRNA population is fragmented, size selected (according to the need of the sequencing approach - single read or paired end) and reverse transcribed to obtain a cDNA library. Later on, sequencing adapters were ligated and the adapter-ligated cDNA library was again size selected. Before Bioinformatics Review | 12
sequencing, a few cycles of PCR were used to enrich the cDNA library concentration. The mRNA-Seq libraries can be sequenced through single-read or paired-end read approach. In single read format, a DNA fragment is sequenced for a particular length from one end only. On the other hand, in a paired-end format, a DNA fragment is sequenced from both ends for a defined length. Paired-end sequencing is a preferred method of sequencing as it yields more sequencing data which results in greater coverage. After cleaning of data (removal of low-quality reads), the reads (paired or unpaired) can be assembled computationally. Next-generation sequencing of mRNASeq libraries results in millions of reads. It is a computationally challenging task to stitch these reads into an actual transcriptome. Presently, various assemblers such as Velvet-oases, Trinity, SOAPdenovoTrans, Trans-ABySS, ALLPATHS-LG etc. are available to suit different needs of assembly [1, 12, 10, 2, 5]. However, it is still very difficult to create a high confidence error-free assembly. In the following section, popular strategies used for computational reconstruction of sequencing reads into a functional transcriptome are discussed: Reference-based assembly: In this method, purified sequencing reads are first mapped to a genome of the same or related species. That is, the computational assembly is built upon a reference genomic platform
and hence referred to as referencebased assembly [3]. Succeeding the mapping step, the sequencing reads which map to identical locus are independently clustered and thereafter traversed to identify different genes and their isoforms. There are several advantages of using reference-based assembly. For example, even scarcely abundant expression (of only a few reads) can be detected, overall assembly has less number of contaminants/artifacts, errors and a high confidence. Gaps in the assembled draft can be easily filled with the information of reference genome or transcriptome. Moreover, it is also possible to predict transcription start and stop sites. Transcriptomes of polyploidy crop plants are more prone to error due to their high sequence similarity among different genes and thus often result in miss-assembly of entirely different transcripts as one transcript [9, 11, 7].
required sequencing depth. On the other hand, in reference-based assembly, even a low coverage (10X) can be used to produce high confidence assembly. It is also advisable to use different k-mer to identify optimal k-mer length which will be used to generate an assembly. Many researchers use multiple k-mers to develop multiple assemblies which are then merged into a single assembly [6]. On a different note, assemblies can be generated using different assemblers. These different assemblies are then searched for common transcripts which thereafter stitched as a single assembly [7,8]. References: 1.
2.
de-novo assembly: For most of the organisms genome information is still lacking. In such cases, where a reference is not available, a de-novo method is utilized to develop an assembly. Here, multiple overlapping sequencing reads are clustered as contigs which are further reconstructed as entire transcriptome. This approach is practiced most of the time as genome drafts are available for only a few organisms. However, for performing a de-novo assembly, a very high coverage of transcriptome is required which in turn requires performing several sequencing runs to generate
3.
4. 5.
6.
7.
8.
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. 2008. ALLPATHS: de novo assembly of wholegenome shotgun microreads. Genome research 18: 810-820. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology 29: 644-652. Guttman M, Garber M, Levin JZ, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol MJ, Gnirke A, Nusbaum C. 2010. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of Nature biotechnology 28: 503-510 Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y. 2012. SOAPdenovo2: an empirically improved memory-efficient shortread de novo assembler. Gigascience 1: 18. Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, Sherlock G, Snyder M, Wang Z. 2010. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC genomics 11: 663. Martin JA, Wang Z. 2011. Next-generation transcriptome assembly. Nature Reviews Genetics 12: 671-682. Nakasugi K, Crowhurst R, Bally J, Waterhouse P. 2014. Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-
Bioinformatics Review | 13
Tetraploid Plant Nicotiana benthamiana. PloS one 9: e91776. 9. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ. 2010. De novo assembly and analysis of RNA-seq data. Nature methods 7: 909-912. 10. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. 2009. ABySS: a parallel
assembler for short read sequence data. Genome research19: 1117-1123. 11. Surget-Groba Y, Montoya-Burgos JI. 2010. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome research 20: 1432-1440. 12. Zerbino DR, Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research 18: 821-829.
Bioinformatics Review | 14
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 15
Bioinformatics Review | 16