SEPTEMBER 2018 VOL 4 ISSUE 9
“Science has not yet taught us if madness is or is not the sublimity of the intelligence.” -
Edgar Allan Poe
How to get supercomputing facilities for Bioinformatics analyses?
Algorithm and workflow of miRDB
Public Service Ad sponsored by IQLBioinformatics
Contents
September 2018
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
05
03 Tools miRBase: Explained
06
05 Bioinformatics News How to get super-computing facilities for Bioinformatics analyses? 12
04 Algorithms Algorithm and workflow of miRDB
10
FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR TARIQ ABDULLAH FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS FOZAIL AHMAD ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published monthly for one year (12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under license by SEWA trust. Published in India
EDITORIAL
Editorial: Need to reformulate the bioinformatics curricula at undergraduate and postgraduate level
Muniba Faiza
Founding Editor
Bioinformatics is interdisciplinary of Computer science and Biological science, requires the knowledge of both these broad disciplines. It combines biophysics, statistics, maths, and chemistry to provide software and tools which help in understanding biological data. Computer science and biological science are very different broad disciplines of their own. We have different researchers who specialize either in computer science with a knowledge in biology or in biological science with a knowledge in computational biology. There are not enough scientists who are completely trained in bioinformatics, which leads to difficulty in finding appropriate scientists, researchers, scholars, and graduates in the academia and industry as well. As mentioned in one of the previous articles, both bioinformaticians and bioinformaticists are the main requirements in this field of study. The foundation of becoming a bioinformatician or a bioinformaticist is laid down at the undergraduate and postgraduate level. This brings us down to a question, "what should be the curricula for bioinformatics at the undergraduate and postgraduate level in order to provide welltrained bioinformatics professionals?" This is considered one of the challenges in the field of bioinformatics nowadays [1]. In the beginning, for the students with a knowledge in both the fields of biology and computers, it is easier to grasp the concepts of bioinformatics but for those students who haven't any idea about either of these disciplines, it is a bit difficult to work in the
Letters and responses: info@bioinformaticsreview.com
EDITORIAL
same. In most of the universities, students are given an introduction to the databases, tools in bioinformatics and an introduction to computer science. Besides, the basic concepts of bioinformatics, the methods and ideologies must be included in the courses such as the basic evolutionary concepts. Further, instead of just tossing some topics of biology and computer science, the curricula must include these topics by relating to the real world scenarios and problems researchers are trying to resolve. Besides, there should be more programming courses included in the syllabi such as data management, algorithms, computer languages, advantages, based on the interests of the students. There must be more training courses covering specific topics which could help the students to understand the theoretical concepts, the experimentation involved, and the software constructs available with pros and cons. There should be some workshops organized in the interest of the students to learn the basic and biggest challenges in the field of bioinformatics, the resources available to manage them, and need for new techniques and methodologies utilizing the in-silico resources or developing new ones. Bioinformatics provides various sources and tools to study and analyze biological data and helps in answering various important questions in the same. The students trying to build their careers in this field must be provided with advanced study and focus on the main objective of the field. This is the time to broaden the horizon with developed concepts related to the real world and how are they applied to solve the existing problems. Write us at info@bioinformaticsreview.com
TOOLS
miRBase: Explained
Image Credit: Stock Photos
“miRBase is an online database which is available at www.mirbase.org [4-6]. The data can be downloaded from an FTP site (ftp://mirbase.org/pub/mirbase/CURRENT/) in different formats including FASTA and MySQL relational database dumps [7]. It provides a user-friendly interface to miRNA sequence data, its predicted gene targets, and annotations [6].� icro RNAs (miRNAs) are the short endogenous RNAs (~22 nucleotides) and originate from the non-coding RNAs [1], produced in single-celled eukaryotes, viruses, plants, and animals [2]. miRNAs are capable of controlling homeostasis [2] and play significant roles in various biological processes such as degradation of mRNA and posttranslational inhibition through complementary base pairing [3].
M
There are several miRNA databases which provide detailed information about the miRNA sequences, annotations, functions, and their predicted targets, among which miRBase is a primary online database for miRNA mature sequences and annotations [4-6]. This article explains the detailed structure and algorithm of miRBase.
miRBase is an online database which is available at www.mirbase.org [4-6]. The data can be downloaded from an FTP site (ftp://mirbase.org/pub/mirbase/CUR RENT/) in different formats including FASTA and MySQL relational database dumps [7]. It provides a user-friendly interface to miRNA sequence data, its predicted gene targets, and annotations [6]. miRBase release 10.1 consists of 5071 miRNA loci from 58 species which expresses 5922 mature miRNA sequences [7]. miRBase has three main functions:
it, sequential numerical identifiers are assigned to the miRNAs, which uses 3 or 4 letters abbreviation to designate the species. For example, hsa-miR-101 (in Homo sapiens) [8].
1. miRBase Registry:
3. miRBase Targets:
It is a confidential source for assigning independent names to the novel miRNA genes even before their publication in any peer-reviewed journal [7]. This service is being used by over 70 publications. According to
It is the database of predicted miRNA target genes. It predicts the targets for all published animal miRNAs [5,7]. The version 5 of this database predicts targets for over 5,00,000 mRNAs for all miRNAs in 24 different species. All
2. miRBase Sequences: It acts as a primary online repository for miRNA sequences and annotation. It provides the miRNA information, annotation, references, and links to other resources for all published and validated miRNAs [5,7]. This database consists of over 5000 sequences from 58 species.
Bioinformatics Review | 6
individual miRNA-target binding sites, multiple conserved sites in the species, and multiple binding sites in UTRs are assigned a P-value [9], which helps the user to determine confidence in the predicted results. miRBase has a nomenclature scheme for all predicted targets, its primary features are described as follows [7]:
the predicted miRNA name composed of three or four alphabets species name as the prefix and a number as the suffix. For example, has-mir-212.
A predicted mature miRNA sequence expressed from one or more hairpin precursor locus, composed of further numeric suffixes. For example, dme-mir6-1 and dme-mir-6-2
Related mature miRNA sequences expressed from the related hairpin loci, consists of further alphabets in their suffixes. For example, mmu-mir181a and mmu-mir-181b.
Plant miRNA genes are named as ath-MIR166a, where alphabets in the suffixes denote the distinct loci expressing all related mature miRNAs, and numbers are not used in the suffixes.
Viral miRNA names consist of the locus from which they are
derived. For example, ebvmirBART1 from the Epstein-Barr virus BART locus. miRBase Data The latest release of miRBase (release 20) has updated the database with 24,521 hairpin sequences from 206 species, and 30424 mature sequences [10]. In many cases, the 5’ and 3’ arm of the hairpin precursor expresses the mature miRNAs suggesting that either both may be functional, or there is no sufficient data available to determine the predominant product [7]. Such miRNAs are depicted as has-miR-1405p and has-miR-140-3p. The ‘Evidence’ field provides information about the origin of each sequence available in the database. The miRBase:Targets database predict targets in the UTRs of 37 different animal genomes from Ensembl [5,7]. miRBase provides a list of mRNAs overlapping each miRNA defining its type (intron, UTR or exon) and the sense (forward or reverse) [7]. miRNAs are often clustered within a genome, therefore, miRBase provides a list of such miRNAs which can be retrieved for any organism. miRBase also displays the distribution of genomic features of miRNAs, showing CpG islands, poly-A site, EST, cDNA, TSS, and DITAGs. TSSs are predicted using the Eponine-TSS software.
How does it work? miRBase uses the miRANDA algorithm [7,11] to identify all available miRNA sequences for a particular genome against 3’-UTR of that genome obtained from Ensembl [12]. The algorithm is based on dynamic programming which searches for maximum local complementarity alignments. For every pair of G:C and A:T, a score of +5 is assigned, for G:U wobble pairs, +2 is assigned, and -3 for mismatch pair, also the gap opening and gap elongation parameters are set to −8.0 and −2.0, respectively [11,12]. It calculates the optimal alignment score at the positions i, j by forming an alignment scoring matrix. The gap-elongation parameter has been used only if the extension to i, j of a given stretch of gaps ending at positions i–1, j or j–1, i (but not of stretches of gaps ending at i–k, j or j, i–k for k > 1) resulted in a higher score than the addition of a nucleotidenucleotide match at positions i, j. Complementarity scores at the first eleven residues of the miRNA 5’-end, are multiplied by a scaling factor of 2.0 to achieve the experimentally observed 5’-3’ asymmetry, e.g., G:C and A:T base pairs contributed +10 to the match score in these positions. This value of scaling factor is adjustable. There are few rules for the target prediction: the threshold for candidate target site is S > 90 and ΔG
Bioinformatics Review | 7
< −17 kcal/mol, where S is the sum of single pair of matching residue scores over the alignment trace and ΔG is the free energy of duplex formation from a completely dissociated state [11,12]. The algorithm finds the optimal local matches above this threshold between a particular miRNA and a set of 3’ UTRs in each genome, after that it checks whether the sequence of this miRNA and target site position is conserved in orthologous genes, i.e., human, mouse, or rat, or fugu, and zebrafish [12]. The alignment between the target sites is transitive in nature (UTR to miRNA to UTR) through a homologous miRNA. It is necessary that the positions of the target sites pairs should fall between ±10 residues of the aligned UTRs. When this criterion gets fulfilled, the conserved target sites with 90% or more sequence identity (human versus mouse or rat) and 70% or more (fugu versus zebrafish), are selected as the candidate miRNA target sites and stored in the database (MYSQL) [11]. John et al., 2005, has predicted 10,572 target sites which are conserved in either mouse or rat in 4,463 human transcripts, of which 2,307 transcripts of 2,273 genes contained more than one target site. Similarly, using zebrafish as a reference species, they predicted 7,057 conserved target sites
(conserved in fugu) in 4,820 zebrafish transcripts [12]. The conserved target sites for each miRNA are sorted according to the alignment score, in which free energy acts as the secondary sort criterion. When a single site of a mRNA (or within 25 nts) is targeted by multiple miRNAs, the miRNA having the highest scoring lowest energy is reported for that site [11]. Recently, MiRBase has introduced high-confidence miRNAs based on the pattern of deep-sequencing data [10]. In order to be classified under highconfidence miRNAs, a locus must fulfill the following criteria [10]:
a minimum of 10 reads must map to the two possible mature miRNAs obtained from the hairpin precursor. a minimum of half of the reads mapping to each arm of the hairpin precursor must consist of the same 5'-end. a minimum or more than half fo the bases (60%) in mature sequences must pair with the predicted hairpin structure. the predicted hairpin structure must have a folding free energy of <0.2 kcal/mol/nt.
the most abundant reads present at each arm of the hairpin precursor must be paired to the mature miRNA duplex with 0-4 nucleotides at their 3'ends.
This was all about miRBase if you would like to read more about miRNA prediction using deep sequencing data, click here. We will be discussing other miRNA databases in detail in the upcoming articles. References 1.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. cell, 116(2), 281-297.
2.
Liu, B., Li, J., & Cairns, M. J. (2012). Identifying miRNAs, targets and functions. Briefings in bioinformatics, 15(1), 1-19.
3.
He, L., & Hannon, G. J. (2004). MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5(7), 522.
4.
Griffiths-Jones, S. (2006). miRBase: the microRNA sequence database. In MicroRNA Protocols (pp. 129-138). Humana Press.
5.
Griffiths-Jones, S., Grocock, R. J., Van Dongen, S., Bateman, A., & Enright, A. J. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucleic acids research, 34(suppl_1), D140-D144.
6.
Griffiths‐Jones, S. (2010). miRBase: microRNA sequences and annotation. Current protocols in bioinformatics, 29(1), 12-9.
7.
Griffiths-Jones, S., Saini, H. K., van Dongen, S., & Enright, A. J. (2007). miRBase: tools for microRNA genomics. Nucleic acids research, 36(suppl_1), D154-D158.
Bioinformatics Review | 8
8.
Griffiths‐Jones, S. (2004). The microRNA registry. Nucleic acids research, 32(suppl_1), D109-D111.
9.
Rehmsmeier, M., Steffen, P., Höchsmann, M., & Giegerich, R. (2004). Fast and effective prediction of microRNA/target duplexes. Rna, 10(10), 1507-1517.
10. Kozomara, A., & Griffiths-Jones, S. (2013). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic acids research, 42(D1), D68-D73. 11. Enright, A. J., John, B., Gaul, U., Tuschl, T., Sander, C., & Marks, D. S. (2003). MicroRNA targets in Drosophila. Genome biology, 5(1), R1. 12. John B, Enright AJ, Aravin A, Tuschl T, Sander C, et al. (2005) Correction: Human MicroRNA Targets. doi: info:doi/10.1371/journal.pbio.0030264
Bioinformatics Review | 9
ALGORITHMS
Algorithm and workflow of miRDB
Image Credit: Stock Photos
“miRDB is an online database for miRNA target prediction and functional annotation [6]. It predicts the original results and has a wiki editing interface for miRNA annotations [6].”
A
s mentioned in the previous article, Micro RNAs (miRNAs) are the short endogenous RNAs (~22 nucleotides) and originate from the non-coding RNAs [1], produced in single-celled eukaryotes, viruses, plants, and animals [2]. They play significant roles in various biological processes such as degradation of mRNA [3]. Several databases exist storing a large amount of information about miRNAs, one of such databases miRBase [4] was explained in the previous article, today we will explain the algorithm of miRDB [5,6], another database for miRNA target prediction.
miRDB is an online database for miRNA target prediction and functional annotation [6]. It predicts the original results and has a wiki editing interface for miRNA annotations [6]. This wiki interface allows the community editing which results in a more active and interactive database. miRDB is freely accessible at http://mirdb.org/. miRDB has two databases and related web interfaces serving two main purposes [6]: 1. the retrieval of miRNA targets which are predicted by computational methods, and 2. miRNA functional annotation using the wiki editing interface.
miRDB predicts the miRNA targets for five species: human, mouse, rat, chicken, and dog. It mainly focuses on the mature miRNAs because they are the functional carriers of miRNAmediated gene regulation [6,7]. It has been reported by many studies that the miRNA having the same seed sequence targets the similar set of genes [8,9], and therefore, they are known as “functionally similar” [5]. It primarily focuses on the mature miRNAs and hence, the pages of its functional analog are organized according to the miRNAs. How do the targets are predicted in miRDB? miRNA sequences and nomenclature are taken from the Bioinformatics Review | 10
miRBase [4]. All the database tables are linked to other tables in miRDB. All the miRNA sequences and annotation files are taken from the NCBI databases [10,11]. mRNA 3’-UTR sequenced are imported from the GenBank files using BioPerl (http://www.bioperl.org) and for the genome-wide target prediction, MirTaregt2 [12] is used. Multiple mRNAs of the same gene are mapped using NCBI gene index files and the mRNA with the highest target prediction score is displayed on the website [5]. The prediction results are also made available to download as a batch file from the ‘Data Download’ page. Pathway data is also provided using the PANTHER database [13]. Target prediction is performed for each pathway to identify the miRNAs which are significantly associated with the pathways. A hypergeometric test is performed to find the statistical significance of the pathwayspecific targets using all the genes in the genome [5]. The functional annotation page consists of miRNA sequences, genes, nomenclature, references, and experimental evidence. The expression profile results of 40 human tissues detected with RTPCR [14] are also included in the miRDB. A new updated version of miRDB also allows the custom search using a user-provided miRNA or gene target sequence, and a search for unconventional
target sites in the coding region or 5'-UTR. The miRNA data is also available to download with the current version miRDB 5.0 implementing the latest version of the target prediction tool, namely, MirTarget V3 (http://mirdb.org/download.html) . Recently, miRDB has added 2.1 million predicted gene targets regulated by 6709 miRNAs [6]. If you would like to read further details about miRNA prediction using miRDB click here. We will be discussing other miRNA databases in detail in the upcoming articles.
It is also available on YouTube. Reference
functional annotations. Nucleic research, 43(D1), D146-D152.
acids
7.
Bartel D. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 004;116:281– 97.
8.
Lewis, B.P., Burge, C.B., and Bartel, D.P. 2005. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120: 15–20.
9.
Linsley, P.S., Schelter, J., Burchard, J., Kibukawa, M., Martin, M.M., Bartz, S.R., Johnson, J.M., Cummins, J.M., Raymond, C.K., Dai, H., et al. 2007. Transcripts targeted by the microRNA-16 family cooperatively regulate cell cycle progression. Mol. Cell. Biol. 27: 2240–2252.
10. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2007. GenBank. Nucleic Acids Res. 35: D21–D25.doi: 10.1093/nar/gkl986 11. Maglott, D., Ostell, J., Pruitt, K.D., and Tatusova, T. 2007. Entrez Gene: Genecentered information at NCBI. Nucleic Acids Res. 35:D26–D31. doi: 10.1093/nar/gkl993.
1.
Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. cell, 116(2), 281-297.
12. Wang, X. and El Naqa, I.M. 2008. Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics 34:325–332.
2.
Liu, B., Li, J., & Cairns, M. J. (2012). Identifying miRNAs, targets and functions. Briefings in bioinformatics, 15(1), 1-19.
3.
He, L., & Hannon, G. J. (2004). MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5(7), 522.
13. Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., Guo, N., Muruganujan, A., Doremieux, O., Campbell, M.J., et al. 2005. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res.33: D284–D288. doi: 10.1093/nar/gki078.
4.
Sam Griffiths-Jones, Russell J. Grocock, Stijn van Dongen, Alex Bateman and Anton J. Enright. miRBase: microRNA sequences, targets, and gene nomenclature. D140–D144 Nucleic Acids Research, 2006, Vol. 34, Database issue.doi:10.1093/nar/gkj112
5.
XIAOWEI WANG. miRDB: A microRNA target prediction and functional annotation database with a wiki interface. RNA (2008), 14:1012– 1017. Published by Cold Spring Harbor Laboratory Press. Copyright 2008 RNA Society.
6.
Wong, N., & Wang, X. (2014). miRDB: an online resource for microRNA target prediction and
14. Liang, Y., Ridzon, D., Wong, L., and Chen, C. 2007. Characterization of microRNA expression profiles in normal human tissues. BMC Genomics 8: 166
Bioinformatics Review | 11
BIOINFORMATICS NEWS
How to get supercomputing facilities for Bioinformatics analyses? Image Credit: Stock Photos
“Recently, High-Performance Computing (HPC) Bioinformatics Resources & Applications Facility (BRAF) at C-DAC has developed a dedicated server to provide a supercomputing facility to the researchers in the field of Bioinformatics.”
I
It is a big trouble to run molecular simulations and docking for large molecules without having any proper supercomputing facilities as they require more memory and time to finish.
Center for Development and Advanced Computing (CDAC) is located in India and famous for its supercomputing facilities. BRAF at CDAC is funded by the Department of Information Technology (MeitY), Ministry of Communications and Information Technology, Government of India. It provides computational applications to the Bioinformatics
community on demand such as cloud computing. High-Performance Computing (HPC) Bioinformatics Resources & Applications Facility (BRAF) at C-DAC is the technology promoter which provides tools for analysis, data mining, and simulation of biological data.
users to perform the following tasks using various software:
Recently, High-Performance Computing (HPC) Bioinformatics Resources & Applications Facility (BRAF) at C-DAC has developed a dedicated server to provide a supercomputing facility to the researchers in the field of Bioinformatics. This server allows
Molecular modeling
Molecular docking of large molecules
Sequence analysis
Genome alignment
RNA analysis algorithm
Repeats finding
Gene finding
Bioinformatics Review | 12
ď&#x201A;ˇ
Microarray data analysis
ď&#x201A;ˇ
ab-initio methods
All registered users can remotely log in to the server and submit their jobs. The remote access is provided to the national and international organizations as well as to the industries on request with terms & conditions applied. For further information, visit the website CDAC-BRAF.
Bioinformatics Review | 13
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 14
Bioinformatics Review | 15