Apr 2016 VOL 2 ISSUE 4
“Science is a way of thinking much more than it is a body of knowledge.” -
MOTIF:
Carl Sagan
Functional Unit of an Interaction Network
Foldalign: a tool for secondary structure alignment
Public Service Ad sponsored by IQLBioinformatics
Contents
April 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
5
03 Tools Foldalign: a tool for secondary alignment of RNA 07
07 04 Data Mining Bioinformatics data mining: an introduction 09
Systems Biology
MOTIF: Functional Unit of an Interaction Network 11
EDITOR Dr. PRASHANT PANT FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published quarterly for one year (4 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
EDITORIAL: Looking forward and beyond
EDITORIAL
It’s all warm in BiR as we are inching towards next-to-next big hit with passionate and selfless efforts. The team has been up-regulated for its responsibility and assignment, assuring high-quality content and superfine bioinformatics news to keep you restlessly-looking-forward to the next issue.
Fozail Ahmad
Section Editor
Since its inception, we at BiR have tried to touch almost every aspect of Bioinformatics encompassing tools, software, databases, and some flavor from the core biological domain and general statistics. Since, BiR tries its best to traverse every possible domain of biological sciences onto the computational platform, an entirety of content is being extracted and prepared with the intention to present well-appeased magazine for you. At present, we have various fields and domains of bioinformatics that are being projected to justify the multidisciplinary nature of this discipline. The bioinformatics news and updates are not limited to genomics, proteomics and phylogenetic studies rather studies from veterinary sciences, climate change studies, waste cleanup, comparative studies, and alternative energy sources have been also implied into the discipline. One may say, where there is bioinformatics, there is a complete world subject to the above-mentioned fields. In the pretext of newly emerging approaches, BiR presents the core and fundamentals of biological sciences as they have been mandatory for achieving the motto of this platform. Among others, Systems Biology and Structural Biology need special mention for the variety of new tools, modified database repository, a newly adopted method for visualizing active sites in a protein, advanced algorithm for assessing the relationship between two regulatory proteins, development of machine learning tools and improved statistical methods for biological network analysis. These are backbones that keep the discipline stand-straight. In fact, we are in a position to shift our paradigm from the tradition to beyond-convention by providing short news stories in a separate column in the magazines itself. Fundamentals are important, but somewhere it is missed to have at least one stereotype of each bioinformatics application for strengthening the sovereignty and for uprising standard of the discipline in totality.
Letters and responses: info@bioinformaticsreview.com
We have such a large projection of topics that are technically relevant in association with bioinformatics domain, but in spite of focusing on how to establish co-ordination between them, we are rushing towards the core of our field. A structural biologist is not in temper to cooperate with a systems biologist and on the other hand, a computer language expert never has an attitude of knowing about fundamentals of biology. The same concept of publication is applied in the magazine too. Those of chemo-informatics, even wastewater cleanup is not tangled, for they are supposed to be intangible support, becoming a part of the discipline’s implication and indeed are required with an explicit description on an open platform. It’s always nice to post an article on general statistics on any bioinformatics site, retaining the fragment of systems biology, algorithm, language, protein biology, tools, and phylogeny. We never come across to know about the indirect relationship between what we have written and their real concern. The need of providing valuable information on bioinformatics can’t be fulfilled with such an emotional thought and mood-of-blind-faith, both subscriber and team have the equal sense of liability to carry the heritage of conventional coordination of all approaches from the discipline to their real port of an application. We will also accept entries from our audience and if possible, we will also award them. To create an ecosystem of bioinformatics research reporting, we will engage all kind of people involved in bioinformatics - Students, professors, instructors and industries. We will also provide a free job listing service for anyone who can benefit out of it. We look forward to your thoughts, comments, and feedback which you can send at info@bioinformaticsreview.com
TOOLS
Foldalign: a tool for secondary structure alignment of RNA Image Credit: Google Images
“RNA secondary structure has been proved necessary to understand the regulatory functions of microRNAs, and to infer or understand the function of an RNA molecule, understanding the ability of small RNAs to regulate gene expression and so on.�
Secondary structure formation and conformational changes play a key role in understanding molecular evolution and its functional aspects. Among DNA, RNA, and Protein, secondary structure analyses of RNA and Proteins have attracted a lot of research and development.
S
RNA secondary structure has been proved necessary to understand the regulatory functions of microRNAs, and to infer or understand the function of an RNA molecule, understanding the ability of small RNAs to regulate gene expression and so on. It is leading to an increasing demand for structured RNAs in the genomic and transcriptomic context which is a difficult task. In this case, RNA local
pairwise alignment may be helpful for which some of the tools are available. Foldalign is one such tool which implements the local pairwise RNA structural alignment based on the Sankoff algorithm (Sankoff, 1985). CMfinder also performs the pairwise alignment but it is not local. Sankoff algorithm is complex and cannot handle long sequences. Foldalign overcomes these difficulties. It can align long sequences, less complex and requires less processing time and memory. Sundfeld et al., (2015) has developed a multithreaded parallel algorithm which has been implemented in C++. ALGORITHM: Foldalign 2.5 version is a multithreaded implementation of local
pairwise structural alignment of RNA. It uses only six nested loops and calculates the alignment in parallel for many pairs. The dynamic programming matrix is divided into two memories: Long-term memory and Short term memory. Long-term memory consists of the cells that can only be a part of a multi-branch loop and short-term memory consists of the other cells. In this multithreaded version of Foldalign, various threads of particular lengths are created and each thread works on its own value of the cell. It starts from the length of the sequence up to the last single cell, i.e., i=L1, L1-1,.....1 (for one sequence, named S1) and k= L2,L2-1,.....1 (for another sequence, S2). Many cells are calculated in parallel. Every thread calculates each and every cell sequentially starting from first thread of one sequence to its full length and Bioinformatics Review | 7
then restarts from the first thread of the second sequence, such that it aligns cell to cell or in other words, aligns residue to residue implementing local pairwise alignment.
structural RNA alignment. Bioinformatics, 32(8), 1238-1240.
Note: For any query, please write to muniba@bioinformaticsreview.com
Fig.1 (a) Parallel design example of two sequences. Every cell corresponds to a bidimensional matrix. Red and blue are cells processed by threads t1 and t2, respectively. Dark red/blue are cells that have already been processed, light red/blue are cells being processed and white or grey are cells to be processed next. The dashed area represents cells that are being read and written by one thread. Foldalign can take larger sequences as input and completes the structural alignment in a reasonable time and saves the computer memory, and it is able to produce accurate predictions. For further reading, click here. References: Sundfeld, D., Havgaard, J. H., de Melo, A. C., & Gorodkin, J. (2015). Foldalign 2.5: multithreaded implementation for pairwise
Bioinformatics Review | 8
DATA MINING
Bioinformatics data mining: an introduction Image Credit: Google images
“Data Mining is the process of discovering a new data/pattern/information/understandable models from a huge amount of data that already exists. It is sometimes also referred to as "Knowledge Discovery in Databases" (KDD).”
B
ioinformaticians handle a large amount of data: in TBs if not in gigs thus it becomes important not only to store such massive data but also making sense out of them. In this article, I will talk about what is data mining and how Bioinformaticians can benefit from it. What is data mining? Data Mining is the process of discovering a new data/pattern/information/understan dable models from ha uge amount of data that already exists. It is sometimes also referred to as "Knowledge Discovery in Databases" (KDD). It has been successfully applied in bioinformatics which is data-rich and requires essential findings such as gene expression, protein modeling,
drug discovery and so on. Development of novel data mining methods provides a useful way to understand the rapidly expanding biological data. Now let's discuss basic concepts of data mining and then we will move to its application in bioinformatics. I will also discuss some data mining tools in upcoming articles. As defined earlier, data mining is a process of automatic generation of information from existing data. The major goals of data mining are "prediction" & "description". The main tasks which can be performed with it are as follows:
Classification: Classification is the learning of a function that maps / reads (classifies) the input data item into one of
several predefined classes (i.e., existing data).
Estimation: It shows a value for the data input.
Prediction: Involves both classification and estimation, but the data is classified on the basis of some future behavior or estimated future value.
Association rules: It is also known as dependency modeling, where it determines the data associated with each other and what may be the outcomes.
Clustering: Separating the population into subgroups or clusters.
Description & Visualization: Representing the
Bioinformatics Review | 9
data with the help of visualization techniques / tools. Data learning is composed of two main categories:
For follow up, please write to muniba@bioinformaticsreview.com. References: 1.
Directed (Supervised) learning and Indirected (Unsupervised) learning. 2.
Classification, Estimation and Prediction falls under the category of Supervised learning and the rest three tasks- Association rules, Clustering and Description & Visualization comes under the Unsupervised learning. In the former category, some relationships are established among all the variables and the patterns are identified in the latter category.
3.
K Raza. APPLICATION OF DATA MINING IN BIOINFORMATICS, Indian Journal of Computer Science and Engineering, Vol 1 No 2, 114-118. Mohammed J Zaki, Data Mining in Bioinformatics (BIOKDD), Algorithms for Molecular Biology2007 2:4, DOI: 10.1186/1748-7188-2-4 Prof. Xiaohua (Tony) Hu, Editor, International Journal of Data Mining and Bioinformatics
Data Mining has been proved to be very effective and useful in bioinformatics, such as, microarray analysis, gene finding, domain identification, protein function prediction, disease identification, drug discovery and so on.
Bioinformatics Review | 10
SYSTEMS BIOLOGY
MOTIF: Functional Unit of an Interaction Network Image Credit: Google Images
“The idea of network motif (sub-graph) was presented by Uri Alon and his group in 2002 [1] as they were discovered in a gene regulation network of E. coli and then in a large set of a neural network. According to their occurrence and behavior in a network, “motifs are subgraph recurring repeatedly, defined by a particular pattern of interaction between vertices that reflect a framework in which particular functions are achieved.”
I
n a network, integration of elements and interacting components enables identification of conserved modules and motifs. The topological analysis, however, reveals much about the nature and functions of a network and provides sufficient statistics for any further study. Supported by several data types viz interaction data, expression data, Boolean data, and raw sequence data, modules and motifs provide an easy way to understand the specific function of a gene and protein. Basically, network motifs are characteristic network patterns comprising of both transcription regulation and protein-protein interaction that recur more often than in a random network.
The idea of network motif (subgraph) was presented by Uri Alon and his group in 2002 [1] as they were discovered in a gene regulation network of E. coli and then in a large set of a neural network. According to their occurrence and behavior in a network, “motifs are subgraph recurring repeatedly, defined by a particular pattern of interaction between vertices that reflect a framework in which particular functions are achieved”. They are of vital importance largely because they may display functional properties and may also provide deep insight into network’s functional abilities. Significant studies have been done from perspective of the biological application as well as computational theory. Biological analysis mainly endeavors to interpret the functions
of network motifs associated with genetic regulation as the first motif was found in the transcription unit of E. coli as well as Yeast and other higher organisms. Apart from those of genetic regulation, some distinct motifs were also discovered from the neural network and protein interaction network (fig-1).
Bioinformatics Review | 11
Fig: 1. Different types of motifs in the biological network. (courtesy: Google image). Statically a motif is identified as a pattern that occurs at least five times and is more significant than in a random network. With only two or at least three nodes, we may randomize to get a maximum pattern in a network. It is up to analyses that one has to perform. Patterns with two, three, four and five nodes are significant as their occurrence is more frequent in a network than any other pattern. Based on directivity, connectivity, pattern, regulation, and the number of nodes, they are classified into various categories as below: 1. Negative auto-regulation (NAR) One of the simplest and most abundant network motifs is negative autoregulation in which a transcription factor represses its own transcription (fig. 2-a). Its generalized function is in response regulation and SOS DNA repair system response. NAR was observed to speed-up the response to signals in a synthetic transcription network. It also increases the stability of the auto-regulated gene product concentration against stochastic noises, reducing variations in protein levels between different cells.
the diagram, target gene C is regulated by 2 TFs (transcription factor) A and B and in addition TF B is also regulated by TF A. Since each of the regulatory interaction may either be positive or negative, there are eight possible types of FFL motifs. Computationally, in most of the cases, FFL represent an AND & OR gate but other circuitry inputs are also possible. Fig: 2. Different types of loops and motif in biological networks; a. autoregulation, b. feedforward motif, c. coherent and incoherent loops, d. different types of patterns (motifs), commonly occurring in biological networks. They all occur in almost every biological system and represent a specific regulatory functional unit. (Courtesy- Google Images). 2. Positive auto-regulation (PAR) It is characterized by the enhancement of transcription by its own gene product (fig 2-a). Comparatively, it shows a slower response than NAR. In a case where rapid regulation is required, PAR leads to a bimodal distribution of protein levels in cell populations. 3. Feed-forward loops (FFL) This motif commonly occurs in many genetic regulatory networks and consists of three genes and three regulatory interactions (fig 2-b). In
4. Coherent type 1 FFL (C1-FFL) This is one of the sub-types of FFL, characterized by giving a pulse filtration in which a short pulse of a signal will not generate a response but a persistent response will generate a short delay. Importantly, the signal response is fastened after one shut off. Such a vital mode of signal transduction in genetic or cellular regulatory system is observed in metabolic pathways and protein-gene interaction network. 5. Incoherent type 1 FFL (I1-FFL) It is known to be a pulse generator and response accelerator. Two signal pathways function in two opposite ways, one signal activates and the other represses. After repression, a pulse dynamics is generated. Importantly, it speeds up the activation of any gene, not necessarily a gene of a transcription factor. Feedforward regulation
Bioinformatics Review | 12
shows better regulation negative feedback.
than
6. Multi-output FFLs The same regulator controls (regulates) multiple genes of the same system. 7. Single-input modules (SIM) This motif occurs when a single gene regulates a single set of a gene with no additional regulation. This is significant when genes are carrying out a specific function and therefore need to be activated in the synchronized manner. A possible confirmation of motif importance is motif conservation. In evolution conservation implies importance. The conservation of a protein in the network may be taken as an indication of the biological importance of that motif. The conservation of a motif shows the evolutionary pressure that can be
followed to find ortholog in other organism. Wuchty et al, 2003 [2] tested this hypothesis for the correlation between the protein evolutionary rate and the structure of the motif it is embedded in. The conservation of motif component was found to be tens to thousands of times higher than expected at random, suggesting conservation of motif constituents. Motifs representing small functional unit or sub-graph in a network are found using different software like Mfinder (http://www.weizmann.ac.il/mcb/U ri Alon/groupNetworkMotifSW.htm), MODIS, FANMOD (http://theinf1.informatik.unijena.dewernicke/motifs/index.htm), MAVisto (http://mavisto.ipkgatersleben.de) iGRAPH: (https://cran.rproject.org/package=igraph),
HOMER & Motif-X etc. Online tools like Amadeus, Web Motif, and MEME suite are also used for the same purpose. References: 1.
2.
U Alon, Network Motif: theory and experimental approaches, Nature Reviews Genetics 8, 450-461,(June 2007) doi :10.1038. Wuchty S, et al. (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet 35(2):176-92.Prof. Xiaohua (Tony) Hu, Editor, International Journal of Data Mining and Bioinformatics
Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.
Bioinformatics Review | 13
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 14
Bioinformatics Review | 15