N O VE MBER 2015 VOL 1 ISSUE 2
“A cell is regarded as the
true biological atom.” -
Explained: CRISPR-ERA and CRISPR/Cas9 system
George Henry Lewes
How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads
Public Service Ad sponsored by IQLBioinformatics
Contents
November 2015
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics
06 Systems
Editorial....
5
Biology
Cancer: From the Eyes of Mathematical and Systems Biology 06
Introduction to Mathematical Modelling Part-3 08 Introduction to Mathematical Modelling (Last Part) 14
17 Software IBS: Modifying the organization of biological sequences diagrammatically 17
Explore Tuberculosis: A Systems Biology Approach 20
19 Sequence
10 Tools Explained: CRISPR-ERA and CRISPR/Cas9 system 10
Installing Gromacs on Ubuntu for MD Simulation
12
Analysis
How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads 19
25
29 Genomics News
Structural Identification of Macromolecules in Solution with DARA Webserver 12
GenomeD3 plot : Easy visualization of genomes 29
15 Tools Cl-Dash- Speeding up Cloud Computing in Bioinformatics 15
CHIEF EDITOR Dr. Prashant Pant EDITORIAL BOARD EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA
SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
EDITORIAL With speculations of future in mind, moving ahead slowly and steadily is not only an option but also wisdom. BiR, in its second month, moves ahead with a similar philosophy. This month’s highlight would be BiR’s very first public showcasing and representation to scientific community at an International Conference on a concurrent and newly emerging field of importance of soil microbes as drivers of various processes going to be held in Prague, Czech Republic (EU). This has been a research area of immense interest to me and I would like to share few things on the same. Soil microbial diversity has long been seen as life less worth than others until very recently when it was discovered that more than 95% of microbial diversity from any environmental sample is unknown, uncultured and has huge biotechnological, medical, and agronomical potential in it. This kick started a new branch of genomics and bioinformatics now popularly known as metagenomics dealing with community genomes from environmental samples. Metagenomics makes use of tools and techniques of genomics along with computational biology to deal with such large data derived from multiple genomes using next generation sequencing (NGS) technologies. It is one of the sources of Big Data coming from molecular biologists. Even till today, the primary concern is to sequence DNA from environmental samples and correlate the metagenomic data with its probable functions as oppose to conventional culture based approaches. It was because of these reasons that, we chose this international platform to introduce BiR to the world’s scientific community to showcase BiR as an excellent vector for propagating scientific news and development. It is with these slow and little efforts we hope for a steady metamorphosis of BiR into a known standard for scientific reports and news.
Dr. Prashant Pant
Editor-in-Chief
Letters and responses: prashant@bioinformaticsreview.com
SYSTEM BIOLOGY
Cancer: From the Eyes of Mathematical Biology Sanjay Kumar Image Credit: Google Images “A c ell biologis t s ays it is an unc ontrolled proliferation (inc reas e in number by divis ion and growth) of c ells , molec ular biologis ts c all it a mutant variety of s ome biomolec ules forc ing a c ell to c ommit s uc h an unc ontrolled c ell divis ion c yc le. ” he month of November has just arrived with its generic glimpse of winter. We welcome this month with an evergreen and hot topic of cancer research. This time we intend to introduce you to an old research topic with a new vision…..
T
Cancer being an ailment with no remedy of full confidence has been pursued as a career by a lot of researchers. A cell biologist says it is an uncontrolled proliferation (increase in number by division and growth) of cells, molecular biologists call it a mutant variety of some biomolecules forcing a cell to commit such an uncontrolled cell division cycle. But, how does a Systems Biologist see such kind of a problem? Let us try to pursue it in a different way. Proteins if are not assigned some name based on their function or structure, scientists mark them according to their molecular weight, e.g. p53, p200, p19 etc. Scientists
have proven an abnormally high expression of p53 protein in Cancerous cells/tissues. p53 protein is actually the reason behind those other proteins which regulate the cell cycle and makes it to divide in to two as a normal scenario, p53 also helps in the manufacture of its inhibitor named Mdm2 protein. In any case of mutation in p53, that leads the failure of abnormality recognition by p53, doesn’t lead to increase in p53 and consequently Mdm2, p21 and other p53 regulated proteins. And thus, the division of abnormal cells
continues indefinitely and causes Cancer
From a Mathematical Biology perspective, systems biologists form some ordinary differential equations that look like a mathematical formula. These mathematical formulae are actually nothing else than the representative of chemical reactions and their combinations occurring inside a cell. As in our previous blogs (by Fozail Ahmad), we have mentioned about how to combine the chemical reactions in a shape of Ordinary Differential Equations (ODEs) and about how we follow Zero-Order chemical kinetics (reaction rate doesn’t depend on any participating chemical), First-Order chemical kinetics (reaction rate depends on only one participating chemical) and Second-Order chemical kinetics (reaction rate depends on two or more participating chemicals) to form the equations. In addition to that, I would like to mention that
Bioinformatics Review | 6
there are some reactions which occur with the help of some biomolecular machineries. These machines (enzymes) just help the reactions to occur, but do not take part in it themselves and thus affect the reaction in a different form of kinetics as described by the combined work of German Scientist of Biochemistry Leonor Michaelis and Canadian Scientist of Physics Maud Menten in 1913.
normally is also shown in one of the images above.
We have also mentioned a combined picture, which shows a referral of how different stages of Mathematical Biology looks like. These figures are in special contrast to Cancer cells and normal cells.
Reference: Alam MJ, Kumar S, Singh V, Singh RKB (2015) Bifurcation in Cell Cycle Dynamics Regulated by p53. PLoS ONE 10(6): e0129620. doi:10.1371/journal.pone.0129620
So, in a normal cell, when p53 senses the danger and signals the Cell by increasing p21 to combine with PCNA (Proliferating Cell Nuclear Antigen – An enzyme that helps in cell division) http://journals.plos.org/plosone/article? it stops the cell division. This type of id=10.1371/journal.pone.0129620 cell cycle division has been shown in one of the diagrams mentioned below, while for the mutated case of p53 where it can not sense the cellular damage and thus divides
Bioinformatics Review | 7
SYSTEMS BIOLOGY
Introduction to Mathematical Modelling. (Part 3 of 3)
Fozail Ahmad Image Credit: Stock Photos “For modeling the s ys tems behavior, s uitable methods have been developed. Among them are two methods , c ommonly us ed in modeling of metabolic proc es s , modeling of s ignaling and regulatory pathways .�
D
Erivation of Mathematical Equations for Understanding Systems Behaviour:
Depending upon the nature of biological process, it is essential to understand different modeling approach as numbers of methods have been used for different biological systems. Functionally, most of the cellular processes are dynamic that change with environmental change such that the signaling or regulation for specific genes when cell is exposed to an extraordinary medium. In order to describe such time-dependent phenomena it is necessary to choose mathematical equations that can capture these dynamic effects. In other biological systems where cellular products/molecules don’t change over time i.e., concentration remains same, it is
not necessary to describe details of underlying dynamics. For modeling the systems behavior, suitable methods have been developed. Among them are two methods, commonly used in modeling of metabolic process, modeling of signaling and regulatory pathways. 1. Modeling Metabolic Process Metabolism is an essential process in all living being that provide energy and building blocks for survivability, synthesis of larger molecules and degradation of unnecessary/toxic substance in a cell. Understanding metabolic mechanism have been a part of major research interest for decades but complete interplay of underlying mechanisms has yet not been understood.
flux, that is, utilization (conversion) of metabolites along metabolic pathways. Thus, it is important to understand and predict the metabolic flux for all patterns of metabolism that inculcate which biochemical routes are being utilized. Here curve of modeling is fitted into concepts of hypothetical framework or into even known biochemical route so as to identify any particular step in the production/degradation of a desirable molecule (metabolic bottlenecks) by cultured cell/bacteria that in fact limit the overall rate with which the process occur. And result of the study will direct researcher on how to genetically modify the cell or bacteria to optimize the yield of the particular end product.
One of the key parameter in any metabolic study is the metabolic
Bioinformatics Review | 8
In overall process, metabolic flux are not concerned much as they are not when biochemical process are operated in steady-state and entry of unnecessary molecule is totally blocked, leaving process to be quasi-stationary without external perturbation to take place.
For such metabolic (quasistationary) process, we may consider the conversion of sugar to sugar phosphate. In this process, an enzyme hexokinase adds a phosphate group to the glucose, (C6H12O6 )
yielding a compound, glucose-6phosphate. This reaction should be balanced in terms of atoms and electrical charges. In a chemical notation, the balanced reaction is written as C6 H12O6 + ATP -> C6H11 O6PO32- + ADP2- + H+ In this reaction, both sides of the equation are in a stoichiometric balance. Over investigating more complex metabolic network, each individual chemical reaction is bound with stoichiometric balance constraints such as mass, number of molecules,
concentration and charges on reactants that can be used to formulate mathematical equation. For such reaction constraints, one should mind that they are not independent from each other and should be solved in parallel to develop a reliable mathematical model. The validity of the models can be tested through wet lab techniques using detectable or radioactively labeled substances. Labeled atoms can be traced across a number of key metabolites, indicating the cellular influx distribution help validate or disapprove metabolic network model.
Bioinformatics Review | 9
TOOLS
Explained: CRISPR-ERA and CRISPR/Cas9 system Tariq Abdullah “When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by Cas9.”
C
RISPR/Cas9 system is a bacterial defence mechanism against bacteriophage infection. When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by Cas9.Cas9 and the RNA together floats and drifts through the cell and as soon as they encounter a sequence complementary to the RNA, it gets attached to it. Cas9 chops off the dna from there. As the viral DNA is chopped off, it prevents the virus from multiplying. Thus the bacteria defends itself by precisely snipping out the viral DNA from its genome using CRISPR/Cas9 system. The recent implementation of CRISPR/Cas9 system in human beings, animals and bacteria for gene editing has led to a lot of interesting research in this area. It
requires designing of sgRNA known as Single Guide RNA, which is a challenging process. To solve this problem, CRISPR-ERA was developed. So what is CRISPR-ERA? CRISPR-ERA is a new tool available at http://crisprera.stanford.edu developed by Honglei Liu et al. It is an acronym for Clustered Regularly Interspaced Short Palindromic Repeat-mediated Editing, Repression, and Activation.
sgRNA libraries for genetic screening in different organisms. – Bioinformatics, 31(22), 2015, 3676–3678 doi: 10.1093/bioinformatics/btv423 (Paper) How does CRISPR-ERA work?
What does CRISPR-ERA do? According to the author of CRISPRERA, The major goal of our designer tool is to address the discrepancy for designing sgRNAs that allow efficient and highly specific repression or activation of genes and for generating genome-wide
CRISPR-ERA looks up all targetable sites for each target gene, for patterns of N 20NGG (N = any nucleotide). It then calculates E and S score.
Bioinformatics Review | 10
1. E-score is the efficacy score ]based on the sequence features such as GC content (%GC), presence of polythymidine and location information
S-score is the specificity score based on the genome-wide off-target binding sites. For each sgRNA design, enomewide sequences are computed that contain an adjacent NRG (R = A or G) protospacer adjacent motif (PAM) site and zero, one, two, or three mismatches complementary to the sgRNA using Bowtie, which are regarded as off-target binding sites. The penalty score for NAG off-target is smaller than NGG off-target. The sgRNAs are finally ranked by the sum of E-score and S-score.
The result it then presented according to the E and S score. References & Further Reading
http://gizmodo.com/everything -you-need-to-know-aboutcrispr-the-new-tool1702114381
Bioinformatics (2015) 31 (22):36 763678.doi:10.1093/bioinformatic s/btv423
Bioinformatics Review | 11
BIOINFORMATICS NEWS
Structural Identification of Macromolecules in solution with DARA web server Muniba Faiza Image Credit: Google Images “ D ARA is a webs erver whic h initially “c omputes the s c attering profiles from the available s truc tures / models in PDB (Protein Data Bank) and c ompares thes e profiles with a given SAXS pattern..”
T
o study macromolecules in homogenous solution, a technique known as SAXS ( Small Angle X-ray Scattering) is used where the obtained scattering patterns are used to design the structure of macromolecules that are proteins, mucleic acids and protein:nucleic acid complexes.In this experiment, a monochromatic X- ray beam is used to illuminate the homogenous solution which forms a scattering pattern. This experiment generates a abinitio particle shape. This model is compared with the theoretical data available. By comparing the experimental scattering patterns with known scattering data is useful in determination of structure. If the experimental data matches with one or various scattering patterns then it
may provide a detailed information about the quarternary and tertiary structure. DARA is a webserver which initially“computes the scattering profiles from the available structures / models in PDB (Protein Data Bank) and compares these profiles with a given SAXS pattern.” This server is very fast, it compares more than 1,50,000 profiles very rapidly within a few seconds. It almost covers all the models available in PDB. DARA provides good and enhanced results. How DARA works ? DARA implements a new search algorithm consisting of principal component analysis and kd trees for rapid identification of the
scattering neighbours, including nucleic acids and complexes. SAXS data: For each entry in PDB all biological assemblies are retrieved from the NMR entries whose only first model has considered. The data is represented in the form of curves. The theoretical known scattering curves are obtained by a software i,e., CRYSOL 2.8, which is sufficient to cover models with maximum intraparticle distance Dmax up to 800 A˚. For each model, CRYSOL calculate its Dmax, radius of gyration(Rg), molecular weight (MW) and exclude volume of the hydrated particle (V). For proteins, secondary structure content was computed as the percentage of alpha helices and beta
Bioinformatics Review | 12
sheets. DARA computes various parameters and gives an output which is instantaneous and enhanced. It calculates for almost 100 neighbours of the query macromolecule and the neighbours are ranked according to the best fitting curve are preferred. The result shown in Fig 1 shows the best structures obtained by calculation and comaprison with various parameters considered. The result can also be downloaded.
DARA represents a quite rapid and easy way to analyze and identify macromolecules in solution which is a difficult process. It can be traced at http://www.emblhamburg.de/biosaxs/dara.html
Reference: D A R A : a web server for rapid search of structural neighbours using solution small angle X – ray scattering data Alexey G. Kikhney1,†, Alejandro Panjkovich1,†, Anna V. Sokolova2 and Dmitri I. Svergun1,*
Fig 1 Top three nearest neighbors for experimental SAXS data collected from glucose isomerase in a phosphate buffer.
Bioinformatics Review | 13
SYSTEMS BIOLOGY
Introduction to Mathematical Modelling (Last Part) Fozail Ahmad Image Credit: Google Images “ Parameters for any equation in a model des c ribe c ertain bioc hemic al features of the c omponents involved in reac tions or pathways under s tudy.�
I
n the previous section, mathematical modeling was exemplified by metabolic process and its biochemical regulation. It could also be done by signalling pathways and genetic regulatory process. At all cellular phase, one observe changing mode of a cell with effect from environmental factors. It is quite difficult to maintain cellular functions and reach to steady state. Thus, one needs to fix a range of parameters for all molecular reactions while going for mathematical modeling.
under study. For example, when
proteins & hormones), still are not
modeling the
known,
Identification
a
primarily
because the
reliable experimental data are
mathematical equations inferred
lacking. It is very often that the
from the processes must contain
kinetic parameters are measured
parameters that represent the
but no experimental validation has
kinetic features of the involved
been performed in wet lab (i.e. in
metabolic enzymes, as a number of
vitro). In practice, enzymes behave
reactions enzyme can perform
similarly found in a cell. This creates
within a given period of time (i.e.,
a hurdle which is overcome by
the rate constant). We must come
measuring the overall dynamics of
across to these kinetic parameters
the
prior to setting up well-defined
Computational
systems of differential equations.
made them easier by providing
Therefore, kinetic parameters for all
appropriate estimation techniques
the relevant reaction components
to optimize parameter values by
can be experimentally determined.
taking different multiple parameter
Parameters for any equation in a
In practice, however, a number of
set from the data set until they fit or
model describe certain biochemical
kinetic
get
features
otherwise
Model
parameters:
of
the
components
involved in reactions or pathways
network,
of
the
of
metabolic
dynamics
parameters,
even
for
system
optimized
being
studied.
procedure
for
have
available
well-investigated
experimental dataset. This method
biological components (enzyme,
to is critically dependent on the
Bioinformatics Review | 14
quality
of
the
dataset
being
impact on cellular mechanism. Due
time and distance scales of the
validated, and therefore prediction
to the closely packing of the
components in integrated into a
made from such unreliable data will
molecules in a cell, their thermal
pathway. For example metabolism
definitely
induction, and random movement
occurs in within seconds or minutes
validated parameters and to a
from
environmental
whereas genetic regulation takes
limited model of no use. In order to
conditions may cause the initiation
longer (say it hours or even days)
develop a simple network model for
of
that
times to exert their effect or to
any biological process is awfully
propagates across the cell and stops
express a particular gene induced by
lagging behind mainly due to
until reaches to its target. In order
metabolic processes from a greater
unavailability
quality
to account for such random effects,
distance. It may be that signals
experimental data which is still a
(i.e., stochastic) component must be
(enzymes, protein, hormones) have
major focus in the field of systems
incorporated into the equations of
to travel longer distance across the
biology.
the model. For rare signaling
cell
molecules, of which lesser and
system of body fluids in between
fewer effect is observed, can be
tissues. To overcome these different
neglected. Whereas, molecules of
length and time scale, we can use
which, rarest copies are existing in
multi-scale
the cell must not be neglected and
complexity of the system.
lead
to
of
unreliable
high
It is important to mention that few biological
process
cannot
be
described using such simple models that
are
based
on
concentration
of
ignoring
existence
the
importance components
of as
only
molecules, and
concerned the
changing signal
should
be
transmission
integrated
into
mathematical equations.
molecular
The next issue after optimization of
movements adorns a significant
parameter comes to be different
membrane
via circulatory
model
to
avoid
Finally, it is important to assure that developed model is as good as assumption upon which it is based.
Bioinformatics Review | 15
SOFTWARE
IBS: Modifying the organization of biological sequences diagrammatically Muniba Faiza Image Credit: Google Images ” ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein or nucleotide sequences in an easy, efficient and precise manner.
M
any a times, we need to visualize and summarize the existing information of the biological sequences like protein or DNA. For this purpose, a new software package has been introduced called ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein or nucleotide sequences in an easy, efficient and precise manner. It visualizes various functional elements. Different features have been provided in IBS such as diagramming of domains,motifs, rescaling, coloring and many more. The standalone packages of IBS were implemented in JAVA, and supported three major Operating
Systems, including Windows, Linux and Mac OS. Key Features:
the annotations of both protein and nucleotide sequences is supported by the implementation of various drawing elements.
better color visualization.
an ‘export module’ is generated with the help of which the final generated artwork can be exported to any publicationquality figure.
a user-friendly interface.
various built-in textures enables to color the black-and-white
diagrams as requirements.
easy retrieval annotations.
per
of
the
UniProt
IBS provides individual modes for both proteins and DNA, the protein or DNA sequences can be represented in individual modes. IBS may be proved as a very useful software in many biological researches, for example, with the help of IBS, one can easily diagram the translocations that occur in cancer by parallel view of the wild type arrangements existing in the sequence (as shown in Fig. 1).
Bioinformatics Review | 16
“IBS provides an assistance in
Fig.1 The main interface of IBS. ( A)
drawing
quality
The standalone software showing
diagrams of both protein and
the domain organization of E3
nucleotide
SUMO-protein ligase RanBP2 (
publication
sequences.”
Flotho and Werner,2012).( B) The online
service
presenting the
organization
of
bromodomain proteins
and
translocations in cancer.( (Muller et al.,
2011
Reference: IBS: an illustrator for the presentation and visualization of biological sequences Wenzhong Liu1,2,†, Yubin Xie1,†, Jiyong Ma1,†, Xiaotong Luo1, Peng Nie1, Zhixiang Zuo3, Urs Lahrmann4, Qi Zhao1, Yueyuan Zheng1, Yong Zhao1, Yu Xue5,* and Jian Ren1,2,3,*
)
Bioinformatics Review | 17
SEQUENCE ANALYSIS
How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads Prashant Pant Image Credit: Google Images � Chimeras are us ually formed during polymeras e c hain reac tion (PCRs ) but in s ome rare c as es they are for real. Therefore, it bec omes relevant to adopt methods whic h c an c lean the s equenc e datas ets of Chimeras .�
A TYPICAL CHIMERIC SEQUENCE OBTAINED FROM PINTAIL VERSION 1.0 etecting chimeric (or recombinant) sequences from a sequence dataset is an important part of sequence analysis especially for reconstruction of
D
deep phylogenies as well as for sequence similarity analyses. This article focuses on methods of chimera detection in high quality 16S rRNA sequences from Sanger sequencing with good read length (>750bp). With such large size they become potential candidates for chimera formation. With cultureindependent approaches for analyses of microbial diversity picking up fast with high throughput sequencing methods, the amount of chimeric sequences being published in the databases are also increasing exponentially. This is the era of Metagenomics or simply put
community DNA analyses where DNA from thousands of species gets pooled up and is then analysed. This further increases chances of chimera formation. Chimeras are usually formed during polymerase chain reaction (PCRs) but in some rare cases they are for real. Therefore, it becomes relevant to adopt methods which can clean the sequence datasets of Chimeras. Recently, a number of chimera detecting software for 16S rRNA gene sequences have been launched namely Pintail, Mallard and Bellerophon. First two software applications are available at
Bioinformatics Review | 18
http://www.bioinformaticstoolkit.org and the last one is available at http://greengenes.lbl.gov/cgibin/nph-index.cgi. Pintail and Mallard can detect chimeras and anomalies in the 16S rRNA genes based on extent of pair-wise percentage similarity between the query and related sequences. In chimera analysis by Pintail 1.0, the query sequences which could be putative recombinants are compared on a one (query)-on-one (subject) basis with a list of closely related sequences identified by BLAST searches. As Pintail is a oneon-one query-subject comparison, it is highly stringent. This is not the case with Mallard. In Mallard, one of the sequences from within a dataset of query sequences is randomly chosen as subject, while rest remain as query. A many (query)-on-one (subject) comparison follows, which is easy and completes in less time as compared to Pintail. This is to be noted that Mallard is of limited use if the query sequences are too
diverse or really novel in the first place. Another software for detecting chimeras in 16S rRNA genes i.e. Bellerophon ver 3.0 from Greengenes is more dedicated to 16S rRNA sequences. Here, the sequences are required to be submitted as NAST (Nearest Alignment Space Termination Tool) formatted file. The NAST alignment server at Greengenes has more than one million 16S rRNA sequence records. Upon submission of the NAST formatted file, the server launches a localized BLAST search for each query sequence with the 16S rRNA gene sequence library on its server. It checks for potential chimeras in the respective query-subject alignment, one-on-one. The outcome of the entire process is a couple of EXCEL sheets emailed to the user with the query sequences, their best matches, and BLAST score values. The BLAST score threshold value can be set by the user, below which the software automatically removes the
sequences not to be considered for chimera detection. Finally, it tells whether a potential break-point was found or not (in essentially Yes or No format). It is user-friendly and particularly good for large datasets with high amount of sequence diversity. The only demerit of the software is that if there is a relatively novel sequence in the query batch, it receives a low score being highly unrelated with the existing records and thus stands at a risk of getting omitted. Hence, one has to be really careful while using these programs as there could be loss of sequence diversity especially if the data is coming from an extreme site (with more newer/novel sequences) or if the data is coming from some NGS project with nice long reads and good coverage as in the case of Pac Bio Machine. It is worth mentioning here that while Pintail and Mallard can be applied for any given DNA sequence data, Bellerophon is a dedicated program for 16S rRNA.
Bioinformatics Review | 19
SYSTEMS BIOLOGY
Explore Tuberculosis: A Systems Biology Approach Fozail Ahmad Image Credit: Google Images “ The bac terial two c omponent s ys tem (TCS) is a s ignal trans duc tion s ys tem that s ens es environmental s timuli and res pons es ac c ordingly.�
S
ystems biology is not sufficient to full fill the requirement of molecular understanding of any organism at any level. It seeks to contribute multiple approaches and fields to resolve a particular issue arisen from ongoing work. In this article you will find a combinatorial approach of systems biology i.e. molecular, cellular and network biology to understand how tuberculosis is developed and how pathogen succeeds in fighting with host immune systems. A well developed mathematical model, on PhoP-PhoR two component system, is also presented and explained to demonstrate the mode of molecular regulation by pathogen. The bacterial two component system (TCS) is a signal transduction system that senses environmental stimuli and responses accordingly.
This system consists of two regulating proteins one of which functions as histidine kinase (HK) and other functions as response regulator (RR) in the course of signal cascade mechanism.Mycobacterium tuberculosis have eleven two component systems controlling expression of those genes that are critically involved in the virulence, pathogenicity and survival. Studies have demonstrated that PhoPR-TCS is one of the eleven TCSs peculiarly involved in the virulent activity of the pathogen. PhoPR-TCS is a positive regulator of many genes which encodes gene for the biosynthesis of lipids like sulphatides(SL), diacyltrehalose (DAT) and polycyltrehalose (PAT). These lipid components contribute to the virulency of M. tuberculosis. Studies have corroborated
that pks2 and msl3 are responsible for the biosynthesis of SL, DAT and PAT respectively. The expression of these lipid coding genes are regulated by PhoP in association with the autokinase activity of PhoR. In case of MycobacterialPhoPR TCS, 2+ Mg ions have not been substatilally proved to be stimulating factor for PhoR. The simulation of the model was carried out through MATLAB using RK-4 (Runga Kutta fourth order differential equation) method. Resultantly, behavior of TCS was found to be robust at all concentration of Mg2+ ions. The finding can be implicated at the time of development of drug against tuberculosis as to which gene/protein has the high sensitivity towards its stimuli.
Bioinformatics Review | 20
Fig: General presentation of model, depicting feedback mechanism of system
The regulation of TCS is affected by Mg2+ions to all possible extent which was shown by fluctuations in the level of PhoP and PhoR proteins. The ions have both positive and negative effect over TCS. The result
showed that important genes are activated even after ions are switched off from surrounding medium. So, targeting of ions influx and efflux would be of no use in terms of development of drug aginst the pathogen. With some other aspect it can be further tested for more simulations with varying concentration of ions. Since, TCS regulates those genes which are directly involved in pathogenecity and survival of Mycobaterium tuberculosis, understanding the nature and behaiour of individual protein will provide an insight into finding of novel drug target against tuberculosis. The simulation in this work represented the mechanism of gene regulation and its sensitivity
twords stimulus and provided the understading about how to deal with when targetting a molecule/protein for any other two component system of the pathogen.
Reference source: Fozail Ahmad & Ravins Dohare*, Assessing Effect of Mg ion on PhoP-PhoR tow component systems of Mycobacterium tuberculosis through Development of Mathematical Model, Int. Journal of Science and Research, (4) 7, 2285-2289, Paper ID: SUB 156569
Bioinformatics Review | 21
CLOUD COMPUTING
Cl-Dash: speeding up cloud computing in bioinformatics Muniba Faiza Image Credit: Google Images “Cl-das h is a tool whic h fac ilitates res earc h of novel bioinformatic s data us ing Hadoop – a s oftware that s tores huge amount of data and provide a very eas y ac c es s to that data in a relatively les s er time .”
A
fter a lot of work in the field of bioinformatics, many of the living organisms’ genome has been sequenced and a lot of information has been generated at RNA and protein level. This has given rise to a huge amounts of biological data whose storage is a issue now a days, because such an enormous data cannot be stored on a personal computer or on a local server. For this purpose cloud computing, a practice to manage, and process
data by using remote servers hosted on internet has been introduced in bioinformatics, though the origin of cloud computing is not very clear. Cl-dash is a tool that which facilitates research of novel bioinformatics data using Hadoop – a software that stores huge amount of data and provide a very easy access to that data in a relatively lesser time. This tool has been developed by Paul Hodor, Amandeep Chawla, Andrew Clark
and Lauren Neal from Booz Allen Hamilton, USA. The tool is “cl-dash”,it is a starter kit, which configures and apply the new hadoop clusters in a few minutes. It is provided by AWS (Amazon Web Services). According to a paper published in Bioinformatics (Nov, 2015), cl-dash is based on the distributed file system and MapReduce programming pattern. Hadoop MapReduce is a software
Bioinformatics Review | 22
framework for easily writing applications which process vast amounts of data in-parallel on large clusters of personal computers or hardwares. With the help of cldash, a user can create clusters (or nodes which stores huge amount of data) as an ‘admin’ , through a set of command line tools, which begins with ‘cl-’ (hence the name: ‘cl dash’). A YAML configuration file (config.yml) is required to make a
new cluster can be created in minutes. Once the Hadoop cluster is formed, the user can easily access the data. Such tools are required for further storage space requirement because biological data is increasing, thereby, the demand for large data storage space is also required. cldash has provided a good pathway for managing such a huge data.
NOTE: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write tomuniba@bioinformaticsreview.co m.
Bioinformatics Review | 23
TOOLS
Installing Gromacs on Ubuntu for MD Simulation Tariq Abdullah Image Credit: Google Images “For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux commands and GROMACS dependencies.”
I
In bioinformatics, GROMACS is one of the most popular Molecular Dynamics simulation software with a load of features built in. Installing GROMACS Version 5.x.x+ can be a tedious and cumbersome process on Ubuntu, especially if you are just starting out. For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux commands and GROMACS dependencies. Also the installation instructions for version 5+ available on GROMACS website does not seem to work first hand.
In this quick tutorial, I will teach you how to install Gromacs on Ubuntu 14.04 LTS. It is expected to work on any version of Ubuntu. Post in comments if you face any problem.
1. A C & C++ Compiler which comes built-in with Ubuntu.
I will also explain meanings of different the commands alongside.
3. BuildEssential – It is a reference for all the packages needed to compile a package.
To install GROMACS 5+, log into your Ubuntu system and open a terminal by pressing Ctrl+Alt+T together.
4. FFTW Library: a library used by Gromacs to compute discrete Fourier transform
You need a good internet connection as we will have to download various dependencies during the installation process. To install Gromacs, we need following softwares installed on our system:
2. CMake – A linux software to make binaries
5. DeRegressionTest Package Getting Started If you have freshly installed Ubuntu, don’t forget to update you repository information and software packages in your system. Press
Bioinformatics Review | 24
Ctrl+Alt+T and a terminal will open up. In the terminal, type: sudo apt-get update sudo apt-get upgrade Installation First step in installing Gromacs is to get cmake, In the terminal, type: sudo apt-get install cmake If asked “After this operation, 16.5 MB of additional disk space will be used. Do you want to continue?”, Press y and then Press Enter. When download and installation finishes up, you can check the version of cmake by following command
Now that we have cmake in place and we know the working directory, Its time to download Regression Tests Package. It is possible to automatically download this package during installation, but most of the time it throws me an error stating that location of file has changed, so let us do it hard way to avoid any problem during installation. Copy and Paste following commands in your terminal (Right Click to paste or Ctrl+shift+V). It basically downloads the file and saves it in your downloads folder. cd Downloads/
sudo apt-get install build-essential Before we go any further, it is good to know the path of our current working directory, in terminal, type: pwd Note down the path it shows, it is very important and will be used during real gromacs installation.
wget ftp://ftp.gromacs.org/pub/gromacs/ gromacs-5.1.1.tar.gz Now extract GROMACS archive tar xvzf gromacs-5.1.1.tar.gz Now move inside the Gromacs folder, cd gromacs-5.1.1/ Create a directory called “Build” where we will keep our compiled binaries
wget http://gerrit.gromacs.org/download /regressiontests-5.1.1.tar.gz
mkdir build
We have Regression test package in our downloads folder as compressed tar.gz archive, let us extract it with
cd build
cmake --version Next we need to install build essential with this command
Alternatively, you can download the latest version from GROMACS website.
tar xvzf regressiontests-5.1.1.tar.gz
move inside the build directory
It’s time to make Gromacs, Replace “pwdpath” with the path of working directory that you have noted earlier in following command:
sudo apt-get install libfftw3-dev
sudo cmake .. DGMX_BUILD_OWN_FFTW=OFF DREGRESSIONTEST_DOWNLOAD=O FF -DCMAKE_C_COMPILER=gcc DREGRESSIONTEST_PATH=<strong> pwdpath</strong>/Downloads/regr essiontests-5.1.1
Okay, Let us now download GROMACS 5.1.1 with this command,
If everything goes well, the message in your terminal will say
Now we need Fourier Transform Library on our system. You can download it on fftw.org or install it from repository with this following command
Bioinformatics Review | 25
“Generating Done. Build files written… “. If not, make sure you have replaced the pwd path in command with the path of your home directory. If you have forgotten it, just open another terminal and type pwd.
sudo make install
Now let’s first check and make the real thing..
After the successful installation, you may check the version of your Gromacs with a command to make
make check
Now, It may take some time depending o n your configuration. After completion, execute it: source /usr/local/gromacs/bin/GMXRC
sure installation expected.
finished
gmx pdb2gmx --version
***
Bioinformatics Review | 26
as
GENOMICS
GenomeD3 plot: Easy visualization of genomes Muniba Faiza Image Credit: Stock Images â&#x20AC;&#x153;GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven Documents Library which is used to produce dynamic, interactive data visualizations in web browsers.â&#x20AC;?
A
As the needs say the importance of sequencing of genomes, it is equally important to visualize them. There exists some tools to visualize the genomes, but they are static and standalone, and very much complex to install and use. Newer tools are required to ease the visualization of genomes utilizing various new features and which are more interactive. GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven Documents Library which is used to produce dynamic, interactive data visualizations in web browsers. GenomeD3Plot is very user-friendly and allows to
interact with data, dynamical view alteration is possible, and easy resize or reposition the visualization in the browser.
be easily imported to PNG format as per the requirements.
The goal of R Laird Matthew was to create a library with minimal external dependencies that could be integrated in to existing web applications just as a developer might include an image or table. GenomeD3 Plot uses the JSON configuration which is a standardized and well supported data format that reduces the complexity of use and provide better visualization. The image will be created in SVG format and can
Bioinformatics Review | 27
Fig.1 GenomeD3Plot circular and linear visualization of an example genome with annotation data With GenomeD3 Plot, the genome can be viewed in different tracks, such as if one wish to view a specific base pair or a series of base pairs to visualize GC content, or more.GenomeD3 Plot provides a rich API ( application program interface that specifies how software components should
interact) to dynamically manipulate visualization. A linear and circular plot can also be tied together so that manipulation of one will cause a mirror alteration in the other, such as zooming or changing the visible region of the genome. A specific region can be recenter to focus. Many other features have been introduced in GenomeD3 Plot for easy visualization and interpretation of genomes.
Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.
Bioinformatics Review | 28
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.
Log on to www.bioinformaticsreview.com
Bioinformatics Review | 29
Bioinformatics Review | 30