BIOINFORMATICS REVIEW - NOVEMBER 2015 ISSUE

Page 1

N O VE MBER 2015 VOL 1 ISSUE 2

“A cell is regarded as the

true biological atom.” -

Explained: CRISPR-ERA and CRISPR/Cas9 system

George Henry Lewes

How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads


Public Service Ad sponsored by IQLBioinformatics


Contents

November 2015

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Topics

06 Systems

Editorial....

5

Biology

Cancer: From the Eyes of Mathematical and Systems Biology 06

Introduction to Mathematical Modelling Part-3 08 Introduction to Mathematical Modelling (Last Part) 14

17 Software IBS: Modifying the organization of biological sequences diagrammatically 17

Explore Tuberculosis: A Systems Biology Approach 20

19 Sequence

10 Tools Explained: CRISPR-ERA and CRISPR/Cas9 system 10

Installing Gromacs on Ubuntu for MD Simulation

12

Analysis

How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads 19

25

29 Genomics News

Structural Identification of Macromolecules in Solution with DARA Webserver 12

GenomeD3 plot : Easy visualization of genomes 29

15 Tools Cl-Dash- Speeding up Cloud Computing in Bioinformatics 15


CHIEF EDITOR Dr. Prashant Pant EDITORIAL BOARD EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA

SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS

You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE

Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION

Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India


EDITORIAL With speculations of future in mind, moving ahead slowly and steadily is not only an option but also wisdom. BiR, in its second month, moves ahead with a similar philosophy. This month’s highlight would be BiR’s very first public showcasing and representation to scientific community at an International Conference on a concurrent and newly emerging field of importance of soil microbes as drivers of various processes going to be held in Prague, Czech Republic (EU). This has been a research area of immense interest to me and I would like to share few things on the same. Soil microbial diversity has long been seen as life less worth than others until very recently when it was discovered that more than 95% of microbial diversity from any environmental sample is unknown, uncultured and has huge biotechnological, medical, and agronomical potential in it. This kick started a new branch of genomics and bioinformatics now popularly known as metagenomics dealing with community genomes from environmental samples. Metagenomics makes use of tools and techniques of genomics along with computational biology to deal with such large data derived from multiple genomes using next generation sequencing (NGS) technologies. It is one of the sources of Big Data coming from molecular biologists. Even till today, the primary concern is to sequence DNA from environmental samples and correlate the metagenomic data with its probable functions as oppose to conventional culture based approaches. It was because of these reasons that, we chose this international platform to introduce BiR to the world’s scientific community to showcase BiR as an excellent vector for propagating scientific news and development. It is with these slow and little efforts we hope for a steady metamorphosis of BiR into a known standard for scientific reports and news.

Dr. Prashant Pant

Editor-in-Chief

Letters and responses: prashant@bioinformaticsreview.com


SYSTEM BIOLOGY

Cancer: From the Eyes of Mathematical Biology Sanjay Kumar Image Credit: Google Images “A c ell biologis t s ays it is an unc ontrolled proliferation (inc reas e in number by divis ion and growth) of c ells , molec ular biologis ts c all it a mutant variety of s ome biomolec ules forc ing a c ell to c ommit s uc h an unc ontrolled c ell divis ion c yc le. ” he month of November has just arrived with its generic glimpse of winter. We welcome this month with an evergreen and hot topic of cancer research. This time we intend to introduce you to an old research topic with a new vision…..

T

Cancer being an ailment with no remedy of full confidence has been pursued as a career by a lot of researchers. A cell biologist says it is an uncontrolled proliferation (increase in number by division and growth) of cells, molecular biologists call it a mutant variety of some biomolecules forcing a cell to commit such an uncontrolled cell division cycle. But, how does a Systems Biologist see such kind of a problem? Let us try to pursue it in a different way. Proteins if are not assigned some name based on their function or structure, scientists mark them according to their molecular weight, e.g. p53, p200, p19 etc. Scientists

have proven an abnormally high expression of p53 protein in Cancerous cells/tissues. p53 protein is actually the reason behind those other proteins which regulate the cell cycle and makes it to divide in to two as a normal scenario, p53 also helps in the manufacture of its inhibitor named Mdm2 protein. In any case of mutation in p53, that leads the failure of abnormality recognition by p53, doesn’t lead to increase in p53 and consequently Mdm2, p21 and other p53 regulated proteins. And thus, the division of abnormal cells

continues indefinitely and causes Cancer

From a Mathematical Biology perspective, systems biologists form some ordinary differential equations that look like a mathematical formula. These mathematical formulae are actually nothing else than the representative of chemical reactions and their combinations occurring inside a cell. As in our previous blogs (by Fozail Ahmad), we have mentioned about how to combine the chemical reactions in a shape of Ordinary Differential Equations (ODEs) and about how we follow Zero-Order chemical kinetics (reaction rate doesn’t depend on any participating chemical), First-Order chemical kinetics (reaction rate depends on only one participating chemical) and Second-Order chemical kinetics (reaction rate depends on two or more participating chemicals) to form the equations. In addition to that, I would like to mention that

Bioinformatics Review | 6


there are some reactions which occur with the help of some biomolecular machineries. These machines (enzymes) just help the reactions to occur, but do not take part in it themselves and thus affect the reaction in a different form of kinetics as described by the combined work of German Scientist of Biochemistry Leonor Michaelis and Canadian Scientist of Physics Maud Menten in 1913.

normally is also shown in one of the images above.

We have also mentioned a combined picture, which shows a referral of how different stages of Mathematical Biology looks like. These figures are in special contrast to Cancer cells and normal cells.

Reference: Alam MJ, Kumar S, Singh V, Singh RKB (2015) Bifurcation in Cell Cycle Dynamics Regulated by p53. PLoS ONE 10(6): e0129620. doi:10.1371/journal.pone.0129620

So, in a normal cell, when p53 senses the danger and signals the Cell by increasing p21 to combine with PCNA (Proliferating Cell Nuclear Antigen – An enzyme that helps in cell division) http://journals.plos.org/plosone/article? it stops the cell division. This type of id=10.1371/journal.pone.0129620 cell cycle division has been shown in one of the diagrams mentioned below, while for the mutated case of p53 where it can not sense the cellular damage and thus divides

Bioinformatics Review | 7


SYSTEMS BIOLOGY

Introduction to Mathematical Modelling. (Part 3 of 3)

Fozail Ahmad Image Credit: Stock Photos “For modeling the s ys tems behavior, s uitable methods have been developed. Among them are two methods , c ommonly us ed in modeling of metabolic proc es s , modeling of s ignaling and regulatory pathways .�

D

Erivation of Mathematical Equations for Understanding Systems Behaviour:

Depending upon the nature of biological process, it is essential to understand different modeling approach as numbers of methods have been used for different biological systems. Functionally, most of the cellular processes are dynamic that change with environmental change such that the signaling or regulation for specific genes when cell is exposed to an extraordinary medium. In order to describe such time-dependent phenomena it is necessary to choose mathematical equations that can capture these dynamic effects. In other biological systems where cellular products/molecules don’t change over time i.e., concentration remains same, it is

not necessary to describe details of underlying dynamics. For modeling the systems behavior, suitable methods have been developed. Among them are two methods, commonly used in modeling of metabolic process, modeling of signaling and regulatory pathways. 1. Modeling Metabolic Process Metabolism is an essential process in all living being that provide energy and building blocks for survivability, synthesis of larger molecules and degradation of unnecessary/toxic substance in a cell. Understanding metabolic mechanism have been a part of major research interest for decades but complete interplay of underlying mechanisms has yet not been understood.

flux, that is, utilization (conversion) of metabolites along metabolic pathways. Thus, it is important to understand and predict the metabolic flux for all patterns of metabolism that inculcate which biochemical routes are being utilized. Here curve of modeling is fitted into concepts of hypothetical framework or into even known biochemical route so as to identify any particular step in the production/degradation of a desirable molecule (metabolic bottlenecks) by cultured cell/bacteria that in fact limit the overall rate with which the process occur. And result of the study will direct researcher on how to genetically modify the cell or bacteria to optimize the yield of the particular end product.

One of the key parameter in any metabolic study is the metabolic

Bioinformatics Review | 8


In overall process, metabolic flux are not concerned much as they are not when biochemical process are operated in steady-state and entry of unnecessary molecule is totally blocked, leaving process to be quasi-stationary without external perturbation to take place.

For such metabolic (quasistationary) process, we may consider the conversion of sugar to sugar phosphate. In this process, an enzyme hexokinase adds a phosphate group to the glucose, (C6H12O6 )

yielding a compound, glucose-6phosphate. This reaction should be balanced in terms of atoms and electrical charges. In a chemical notation, the balanced reaction is written as C6 H12O6 + ATP -> C6H11 O6PO32- + ADP2- + H+ In this reaction, both sides of the equation are in a stoichiometric balance. Over investigating more complex metabolic network, each individual chemical reaction is bound with stoichiometric balance constraints such as mass, number of molecules,

concentration and charges on reactants that can be used to formulate mathematical equation. For such reaction constraints, one should mind that they are not independent from each other and should be solved in parallel to develop a reliable mathematical model. The validity of the models can be tested through wet lab techniques using detectable or radioactively labeled substances. Labeled atoms can be traced across a number of key metabolites, indicating the cellular influx distribution help validate or disapprove metabolic network model.

Bioinformatics Review | 9


TOOLS

Explained: CRISPR-ERA and CRISPR/Cas9 system Tariq Abdullah “When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by Cas9.”

C

RISPR/Cas9 system is a bacterial defence mechanism against bacteriophage infection. When a viral dna(Bacteriophage, in this case) integrates into the bacterial genome, it produces RNA which is taken up by Cas9.Cas9 and the RNA together floats and drifts through the cell and as soon as they encounter a sequence complementary to the RNA, it gets attached to it. Cas9 chops off the dna from there. As the viral DNA is chopped off, it prevents the virus from multiplying. Thus the bacteria defends itself by precisely snipping out the viral DNA from its genome using CRISPR/Cas9 system. The recent implementation of CRISPR/Cas9 system in human beings, animals and bacteria for gene editing has led to a lot of interesting research in this area. It

requires designing of sgRNA known as Single Guide RNA, which is a challenging process. To solve this problem, CRISPR-ERA was developed. So what is CRISPR-ERA? CRISPR-ERA is a new tool available at http://crisprera.stanford.edu developed by Honglei Liu et al. It is an acronym for Clustered Regularly Interspaced Short Palindromic Repeat-mediated Editing, Repression, and Activation.

sgRNA libraries for genetic screening in different organisms. – Bioinformatics, 31(22), 2015, 3676–3678 doi: 10.1093/bioinformatics/btv423 (Paper) How does CRISPR-ERA work?

What does CRISPR-ERA do? According to the author of CRISPRERA, The major goal of our designer tool is to address the discrepancy for designing sgRNAs that allow efficient and highly specific repression or activation of genes and for generating genome-wide

CRISPR-ERA looks up all targetable sites for each target gene, for patterns of N 20NGG (N = any nucleotide). It then calculates E and S score.

Bioinformatics Review | 10


1. E-score is the efficacy score ]based on the sequence features such as GC content (%GC), presence of polythymidine and location information

S-score is the specificity score based on the genome-wide off-target binding sites. For each sgRNA design, enomewide sequences are computed that contain an adjacent NRG (R = A or G) protospacer adjacent motif (PAM) site and zero, one, two, or three mismatches complementary to the sgRNA using Bowtie, which are regarded as off-target binding sites. The penalty score for NAG off-target is smaller than NGG off-target. The sgRNAs are finally ranked by the sum of E-score and S-score.

The result it then presented according to the E and S score. References & Further Reading 

http://gizmodo.com/everything -you-need-to-know-aboutcrispr-the-new-tool1702114381

Bioinformatics (2015) 31 (22):36 763678.doi:10.1093/bioinformatic s/btv423

Bioinformatics Review | 11


BIOINFORMATICS NEWS

Structural Identification of Macromolecules in solution with DARA web server Muniba Faiza Image Credit: Google Images “ D ARA is a webs erver whic h initially “c omputes the s c attering profiles from the available s truc tures / models in PDB (Protein Data Bank) and c ompares thes e profiles with a given SAXS pattern..”

T

o study macromolecules in homogenous solution, a technique known as SAXS ( Small Angle X-ray Scattering) is used where the obtained scattering patterns are used to design the structure of macromolecules that are proteins, mucleic acids and protein:nucleic acid complexes.In this experiment, a monochromatic X- ray beam is used to illuminate the homogenous solution which forms a scattering pattern. This experiment generates a abinitio particle shape. This model is compared with the theoretical data available. By comparing the experimental scattering patterns with known scattering data is useful in determination of structure. If the experimental data matches with one or various scattering patterns then it

may provide a detailed information about the quarternary and tertiary structure. DARA is a webserver which initially“computes the scattering profiles from the available structures / models in PDB (Protein Data Bank) and compares these profiles with a given SAXS pattern.” This server is very fast, it compares more than 1,50,000 profiles very rapidly within a few seconds. It almost covers all the models available in PDB. DARA provides good and enhanced results. How DARA works ? DARA implements a new search algorithm consisting of principal component analysis and kd trees for rapid identification of the

scattering neighbours, including nucleic acids and complexes. SAXS data: For each entry in PDB all biological assemblies are retrieved from the NMR entries whose only first model has considered. The data is represented in the form of curves. The theoretical known scattering curves are obtained by a software i,e., CRYSOL 2.8, which is sufficient to cover models with maximum intraparticle distance Dmax up to 800 A˚. For each model, CRYSOL calculate its Dmax, radius of gyration(Rg), molecular weight (MW) and exclude volume of the hydrated particle (V). For proteins, secondary structure content was computed as the percentage of alpha helices and beta

Bioinformatics Review | 12


sheets. DARA computes various parameters and gives an output which is instantaneous and enhanced. It calculates for almost 100 neighbours of the query macromolecule and the neighbours are ranked according to the best fitting curve are preferred. The result shown in Fig 1 shows the best structures obtained by calculation and comaprison with various parameters considered. The result can also be downloaded.

DARA represents a quite rapid and easy way to analyze and identify macromolecules in solution which is a difficult process. It can be traced at http://www.emblhamburg.de/biosaxs/dara.html

Reference: D A R A : a web server for rapid search of structural neighbours using solution small angle X – ray scattering data Alexey G. Kikhney1,†, Alejandro Panjkovich1,†, Anna V. Sokolova2 and Dmitri I. Svergun1,*

Fig 1 Top three nearest neighbors for experimental SAXS data collected from glucose isomerase in a phosphate buffer.

Bioinformatics Review | 13


SYSTEMS BIOLOGY

Introduction to Mathematical Modelling (Last Part) Fozail Ahmad Image Credit: Google Images “ Parameters for any equation in a model des c ribe c ertain bioc hemic al features of the c omponents involved in reac tions or pathways under s tudy.�

I

n the previous section, mathematical modeling was exemplified by metabolic process and its biochemical regulation. It could also be done by signalling pathways and genetic regulatory process. At all cellular phase, one observe changing mode of a cell with effect from environmental factors. It is quite difficult to maintain cellular functions and reach to steady state. Thus, one needs to fix a range of parameters for all molecular reactions while going for mathematical modeling.

under study. For example, when

proteins & hormones), still are not

modeling the

known,

Identification

a

primarily

because the

reliable experimental data are

mathematical equations inferred

lacking. It is very often that the

from the processes must contain

kinetic parameters are measured

parameters that represent the

but no experimental validation has

kinetic features of the involved

been performed in wet lab (i.e. in

metabolic enzymes, as a number of

vitro). In practice, enzymes behave

reactions enzyme can perform

similarly found in a cell. This creates

within a given period of time (i.e.,

a hurdle which is overcome by

the rate constant). We must come

measuring the overall dynamics of

across to these kinetic parameters

the

prior to setting up well-defined

Computational

systems of differential equations.

made them easier by providing

Therefore, kinetic parameters for all

appropriate estimation techniques

the relevant reaction components

to optimize parameter values by

can be experimentally determined.

taking different multiple parameter

Parameters for any equation in a

In practice, however, a number of

set from the data set until they fit or

model describe certain biochemical

kinetic

get

features

otherwise

Model

parameters:

of

the

components

involved in reactions or pathways

network,

of

the

of

metabolic

dynamics

parameters,

even

for

system

optimized

being

studied.

procedure

for

have

available

well-investigated

experimental dataset. This method

biological components (enzyme,

to is critically dependent on the

Bioinformatics Review | 14


quality

of

the

dataset

being

impact on cellular mechanism. Due

time and distance scales of the

validated, and therefore prediction

to the closely packing of the

components in integrated into a

made from such unreliable data will

molecules in a cell, their thermal

pathway. For example metabolism

definitely

induction, and random movement

occurs in within seconds or minutes

validated parameters and to a

from

environmental

whereas genetic regulation takes

limited model of no use. In order to

conditions may cause the initiation

longer (say it hours or even days)

develop a simple network model for

of

that

times to exert their effect or to

any biological process is awfully

propagates across the cell and stops

express a particular gene induced by

lagging behind mainly due to

until reaches to its target. In order

metabolic processes from a greater

unavailability

quality

to account for such random effects,

distance. It may be that signals

experimental data which is still a

(i.e., stochastic) component must be

(enzymes, protein, hormones) have

major focus in the field of systems

incorporated into the equations of

to travel longer distance across the

biology.

the model. For rare signaling

cell

molecules, of which lesser and

system of body fluids in between

fewer effect is observed, can be

tissues. To overcome these different

neglected. Whereas, molecules of

length and time scale, we can use

which, rarest copies are existing in

multi-scale

the cell must not be neglected and

complexity of the system.

lead

to

of

unreliable

high

It is important to mention that few biological

process

cannot

be

described using such simple models that

are

based

on

concentration

of

ignoring

existence

the

importance components

of as

only

molecules, and

concerned the

changing signal

should

be

transmission

integrated

into

mathematical equations.

molecular

The next issue after optimization of

movements adorns a significant

parameter comes to be different

membrane

via circulatory

model

to

avoid

Finally, it is important to assure that developed model is as good as assumption upon which it is based.

Bioinformatics Review | 15


SOFTWARE

IBS: Modifying the organization of biological sequences diagrammatically Muniba Faiza Image Credit: Google Images ” ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein or nucleotide sequences in an easy, efficient and precise manner.

M

any a times, we need to visualize and summarize the existing information of the biological sequences like protein or DNA. For this purpose, a new software package has been introduced called ILLUSTRATOR Of BIOLOGICAL SEQUENCE (IBS) which is used for representing the organization of protein or nucleotide sequences in an easy, efficient and precise manner. It visualizes various functional elements. Different features have been provided in IBS such as diagramming of domains,motifs, rescaling, coloring and many more. The standalone packages of IBS were implemented in JAVA, and supported three major Operating

Systems, including Windows, Linux and Mac OS. Key Features: 

the annotations of both protein and nucleotide sequences is supported by the implementation of various drawing elements.

better color visualization.

an ‘export module’ is generated with the help of which the final generated artwork can be exported to any publicationquality figure.

a user-friendly interface.

various built-in textures enables to color the black-and-white

diagrams as requirements. 

easy retrieval annotations.

per

of

the

UniProt

IBS provides individual modes for both proteins and DNA, the protein or DNA sequences can be represented in individual modes. IBS may be proved as a very useful software in many biological researches, for example, with the help of IBS, one can easily diagram the translocations that occur in cancer by parallel view of the wild type arrangements existing in the sequence (as shown in Fig. 1).

Bioinformatics Review | 16


“IBS provides an assistance in

Fig.1 The main interface of IBS. ( A)

drawing

quality

The standalone software showing

diagrams of both protein and

the domain organization of E3

nucleotide

SUMO-protein ligase RanBP2 (

publication

sequences.”

Flotho and Werner,2012).( B) The online

service

presenting the

organization

of

bromodomain proteins

and

translocations in cancer.( (Muller et al.,

2011

Reference: IBS: an illustrator for the presentation and visualization of biological sequences Wenzhong Liu1,2,†, Yubin Xie1,†, Jiyong Ma1,†, Xiaotong Luo1, Peng Nie1, Zhixiang Zuo3, Urs Lahrmann4, Qi Zhao1, Yueyuan Zheng1, Yong Zhao1, Yu Xue5,* and Jian Ren1,2,3,*

)

Bioinformatics Review | 17


SEQUENCE ANALYSIS

How To: Detecting Chimera in 16S rRNA Sanger Sequencing Reads Prashant Pant Image Credit: Google Images � Chimeras are us ually formed during polymeras e c hain reac tion (PCRs ) but in s ome rare c as es they are for real. Therefore, it bec omes relevant to adopt methods whic h c an c lean the s equenc e datas ets of Chimeras .�

A TYPICAL CHIMERIC SEQUENCE OBTAINED FROM PINTAIL VERSION 1.0 etecting chimeric (or recombinant) sequences from a sequence dataset is an important part of sequence analysis especially for reconstruction of

D

deep phylogenies as well as for sequence similarity analyses. This article focuses on methods of chimera detection in high quality 16S rRNA sequences from Sanger sequencing with good read length (>750bp). With such large size they become potential candidates for chimera formation. With cultureindependent approaches for analyses of microbial diversity picking up fast with high throughput sequencing methods, the amount of chimeric sequences being published in the databases are also increasing exponentially. This is the era of Metagenomics or simply put

community DNA analyses where DNA from thousands of species gets pooled up and is then analysed. This further increases chances of chimera formation. Chimeras are usually formed during polymerase chain reaction (PCRs) but in some rare cases they are for real. Therefore, it becomes relevant to adopt methods which can clean the sequence datasets of Chimeras. Recently, a number of chimera detecting software for 16S rRNA gene sequences have been launched namely Pintail, Mallard and Bellerophon. First two software applications are available at

Bioinformatics Review | 18


http://www.bioinformaticstoolkit.org and the last one is available at http://greengenes.lbl.gov/cgibin/nph-index.cgi. Pintail and Mallard can detect chimeras and anomalies in the 16S rRNA genes based on extent of pair-wise percentage similarity between the query and related sequences. In chimera analysis by Pintail 1.0, the query sequences which could be putative recombinants are compared on a one (query)-on-one (subject) basis with a list of closely related sequences identified by BLAST searches. As Pintail is a oneon-one query-subject comparison, it is highly stringent. This is not the case with Mallard. In Mallard, one of the sequences from within a dataset of query sequences is randomly chosen as subject, while rest remain as query. A many (query)-on-one (subject) comparison follows, which is easy and completes in less time as compared to Pintail. This is to be noted that Mallard is of limited use if the query sequences are too

diverse or really novel in the first place. Another software for detecting chimeras in 16S rRNA genes i.e. Bellerophon ver 3.0 from Greengenes is more dedicated to 16S rRNA sequences. Here, the sequences are required to be submitted as NAST (Nearest Alignment Space Termination Tool) formatted file. The NAST alignment server at Greengenes has more than one million 16S rRNA sequence records. Upon submission of the NAST formatted file, the server launches a localized BLAST search for each query sequence with the 16S rRNA gene sequence library on its server. It checks for potential chimeras in the respective query-subject alignment, one-on-one. The outcome of the entire process is a couple of EXCEL sheets emailed to the user with the query sequences, their best matches, and BLAST score values. The BLAST score threshold value can be set by the user, below which the software automatically removes the

sequences not to be considered for chimera detection. Finally, it tells whether a potential break-point was found or not (in essentially Yes or No format). It is user-friendly and particularly good for large datasets with high amount of sequence diversity. The only demerit of the software is that if there is a relatively novel sequence in the query batch, it receives a low score being highly unrelated with the existing records and thus stands at a risk of getting omitted. Hence, one has to be really careful while using these programs as there could be loss of sequence diversity especially if the data is coming from an extreme site (with more newer/novel sequences) or if the data is coming from some NGS project with nice long reads and good coverage as in the case of Pac Bio Machine. It is worth mentioning here that while Pintail and Mallard can be applied for any given DNA sequence data, Bellerophon is a dedicated program for 16S rRNA.

Bioinformatics Review | 19


SYSTEMS BIOLOGY

Explore Tuberculosis: A Systems Biology Approach Fozail Ahmad Image Credit: Google Images “ The bac terial two c omponent s ys tem (TCS) is a s ignal trans duc tion s ys tem that s ens es environmental s timuli and res pons es ac c ordingly.�

S

ystems biology is not sufficient to full fill the requirement of molecular understanding of any organism at any level. It seeks to contribute multiple approaches and fields to resolve a particular issue arisen from ongoing work. In this article you will find a combinatorial approach of systems biology i.e. molecular, cellular and network biology to understand how tuberculosis is developed and how pathogen succeeds in fighting with host immune systems. A well developed mathematical model, on PhoP-PhoR two component system, is also presented and explained to demonstrate the mode of molecular regulation by pathogen. The bacterial two component system (TCS) is a signal transduction system that senses environmental stimuli and responses accordingly.

This system consists of two regulating proteins one of which functions as histidine kinase (HK) and other functions as response regulator (RR) in the course of signal cascade mechanism.Mycobacterium tuberculosis have eleven two component systems controlling expression of those genes that are critically involved in the virulence, pathogenicity and survival. Studies have demonstrated that PhoPR-TCS is one of the eleven TCSs peculiarly involved in the virulent activity of the pathogen. PhoPR-TCS is a positive regulator of many genes which encodes gene for the biosynthesis of lipids like sulphatides(SL), diacyltrehalose (DAT) and polycyltrehalose (PAT). These lipid components contribute to the virulency of M. tuberculosis. Studies have corroborated

that pks2 and msl3 are responsible for the biosynthesis of SL, DAT and PAT respectively. The expression of these lipid coding genes are regulated by PhoP in association with the autokinase activity of PhoR. In case of MycobacterialPhoPR TCS, 2+ Mg ions have not been substatilally proved to be stimulating factor for PhoR. The simulation of the model was carried out through MATLAB using RK-4 (Runga Kutta fourth order differential equation) method. Resultantly, behavior of TCS was found to be robust at all concentration of Mg2+ ions. The finding can be implicated at the time of development of drug against tuberculosis as to which gene/protein has the high sensitivity towards its stimuli.

Bioinformatics Review | 20


Fig: General presentation of model, depicting feedback mechanism of system

The regulation of TCS is affected by Mg2+ions to all possible extent which was shown by fluctuations in the level of PhoP and PhoR proteins. The ions have both positive and negative effect over TCS. The result

showed that important genes are activated even after ions are switched off from surrounding medium. So, targeting of ions influx and efflux would be of no use in terms of development of drug aginst the pathogen. With some other aspect it can be further tested for more simulations with varying concentration of ions. Since, TCS regulates those genes which are directly involved in pathogenecity and survival of Mycobaterium tuberculosis, understanding the nature and behaiour of individual protein will provide an insight into finding of novel drug target against tuberculosis. The simulation in this work represented the mechanism of gene regulation and its sensitivity

twords stimulus and provided the understading about how to deal with when targetting a molecule/protein for any other two component system of the pathogen.

Reference source: Fozail Ahmad & Ravins Dohare*, Assessing Effect of Mg ion on PhoP-PhoR tow component systems of Mycobacterium tuberculosis through Development of Mathematical Model, Int. Journal of Science and Research, (4) 7, 2285-2289, Paper ID: SUB 156569

Bioinformatics Review | 21


CLOUD COMPUTING

Cl-Dash: speeding up cloud computing in bioinformatics Muniba Faiza Image Credit: Google Images “Cl-das h is a tool whic h fac ilitates res earc h of novel bioinformatic s data us ing Hadoop – a s oftware that s tores huge amount of data and provide a very eas y ac c es s to that data in a relatively les s er time .”

A

fter a lot of work in the field of bioinformatics, many of the living organisms’ genome has been sequenced and a lot of information has been generated at RNA and protein level. This has given rise to a huge amounts of biological data whose storage is a issue now a days, because such an enormous data cannot be stored on a personal computer or on a local server. For this purpose cloud computing, a practice to manage, and process

data by using remote servers hosted on internet has been introduced in bioinformatics, though the origin of cloud computing is not very clear. Cl-dash is a tool that which facilitates research of novel bioinformatics data using Hadoop – a software that stores huge amount of data and provide a very easy access to that data in a relatively lesser time. This tool has been developed by Paul Hodor, Amandeep Chawla, Andrew Clark

and Lauren Neal from Booz Allen Hamilton, USA. The tool is “cl-dash”,it is a starter kit, which configures and apply the new hadoop clusters in a few minutes. It is provided by AWS (Amazon Web Services). According to a paper published in Bioinformatics (Nov, 2015), cl-dash is based on the distributed file system and MapReduce programming pattern. Hadoop MapReduce is a software

Bioinformatics Review | 22


framework for easily writing applications which process vast amounts of data in-parallel on large clusters of personal computers or hardwares. With the help of cldash, a user can create clusters (or nodes which stores huge amount of data) as an ‘admin’ , through a set of command line tools, which begins with ‘cl-’ (hence the name: ‘cl dash’). A YAML configuration file (config.yml) is required to make a

new cluster can be created in minutes. Once the Hadoop cluster is formed, the user can easily access the data. Such tools are required for further storage space requirement because biological data is increasing, thereby, the demand for large data storage space is also required. cldash has provided a good pathway for managing such a huge data.

NOTE: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write tomuniba@bioinformaticsreview.co m.

Bioinformatics Review | 23


TOOLS

Installing Gromacs on Ubuntu for MD Simulation Tariq Abdullah Image Credit: Google Images “For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux commands and GROMACS dependencies.”

I

In bioinformatics, GROMACS is one of the most popular Molecular Dynamics simulation software with a load of features built in. Installing GROMACS Version 5.x.x+ can be a tedious and cumbersome process on Ubuntu, especially if you are just starting out. For beginners, installing and getting GROMACS to work is more challenging due to unfamiliarity with linux commands and GROMACS dependencies. Also the installation instructions for version 5+ available on GROMACS website does not seem to work first hand.

In this quick tutorial, I will teach you how to install Gromacs on Ubuntu 14.04 LTS. It is expected to work on any version of Ubuntu. Post in comments if you face any problem.

1. A C & C++ Compiler which comes built-in with Ubuntu.

I will also explain meanings of different the commands alongside.

3. BuildEssential – It is a reference for all the packages needed to compile a package.

To install GROMACS 5+, log into your Ubuntu system and open a terminal by pressing Ctrl+Alt+T together.

4. FFTW Library: a library used by Gromacs to compute discrete Fourier transform

You need a good internet connection as we will have to download various dependencies during the installation process. To install Gromacs, we need following softwares installed on our system:

2. CMake – A linux software to make binaries

5. DeRegressionTest Package Getting Started If you have freshly installed Ubuntu, don’t forget to update you repository information and software packages in your system. Press

Bioinformatics Review | 24


Ctrl+Alt+T and a terminal will open up. In the terminal, type: sudo apt-get update sudo apt-get upgrade Installation First step in installing Gromacs is to get cmake, In the terminal, type: sudo apt-get install cmake If asked “After this operation, 16.5 MB of additional disk space will be used. Do you want to continue?”, Press y and then Press Enter. When download and installation finishes up, you can check the version of cmake by following command

Now that we have cmake in place and we know the working directory, Its time to download Regression Tests Package. It is possible to automatically download this package during installation, but most of the time it throws me an error stating that location of file has changed, so let us do it hard way to avoid any problem during installation. Copy and Paste following commands in your terminal (Right Click to paste or Ctrl+shift+V). It basically downloads the file and saves it in your downloads folder. cd Downloads/

sudo apt-get install build-essential Before we go any further, it is good to know the path of our current working directory, in terminal, type: pwd Note down the path it shows, it is very important and will be used during real gromacs installation.

wget ftp://ftp.gromacs.org/pub/gromacs/ gromacs-5.1.1.tar.gz Now extract GROMACS archive tar xvzf gromacs-5.1.1.tar.gz Now move inside the Gromacs folder, cd gromacs-5.1.1/ Create a directory called “Build” where we will keep our compiled binaries

wget http://gerrit.gromacs.org/download /regressiontests-5.1.1.tar.gz

mkdir build

We have Regression test package in our downloads folder as compressed tar.gz archive, let us extract it with

cd build

cmake --version Next we need to install build essential with this command

Alternatively, you can download the latest version from GROMACS website.

tar xvzf regressiontests-5.1.1.tar.gz

move inside the build directory

It’s time to make Gromacs, Replace “pwdpath” with the path of working directory that you have noted earlier in following command:

sudo apt-get install libfftw3-dev

sudo cmake .. DGMX_BUILD_OWN_FFTW=OFF DREGRESSIONTEST_DOWNLOAD=O FF -DCMAKE_C_COMPILER=gcc DREGRESSIONTEST_PATH=<strong> pwdpath</strong>/Downloads/regr essiontests-5.1.1

Okay, Let us now download GROMACS 5.1.1 with this command,

If everything goes well, the message in your terminal will say

Now we need Fourier Transform Library on our system. You can download it on fftw.org or install it from repository with this following command

Bioinformatics Review | 25


“Generating Done. Build files written… “. If not, make sure you have replaced the pwd path in command with the path of your home directory. If you have forgotten it, just open another terminal and type pwd.

sudo make install

Now let’s first check and make the real thing..

After the successful installation, you may check the version of your Gromacs with a command to make

make check

Now, It may take some time depending o n your configuration. After completion, execute it: source /usr/local/gromacs/bin/GMXRC

sure installation expected.

finished

gmx pdb2gmx --version

***

Bioinformatics Review | 26

as


GENOMICS

GenomeD3 plot: Easy visualization of genomes Muniba Faiza Image Credit: Stock Images “GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven Documents Library which is used to produce dynamic, interactive data visualizations in web browsers.�

A

As the needs say the importance of sequencing of genomes, it is equally important to visualize them. There exists some tools to visualize the genomes, but they are static and standalone, and very much complex to install and use. Newer tools are required to ease the visualization of genomes utilizing various new features and which are more interactive. GenomeD3 Plot is a newly created visualization library written in Java script. It uses the D3, i.e., Data Driven Documents Library which is used to produce dynamic, interactive data visualizations in web browsers. GenomeD3Plot is very user-friendly and allows to

interact with data, dynamical view alteration is possible, and easy resize or reposition the visualization in the browser.

be easily imported to PNG format as per the requirements.

The goal of R Laird Matthew was to create a library with minimal external dependencies that could be integrated in to existing web applications just as a developer might include an image or table. GenomeD3 Plot uses the JSON configuration which is a standardized and well supported data format that reduces the complexity of use and provide better visualization. The image will be created in SVG format and can

Bioinformatics Review | 27


Fig.1 GenomeD3Plot circular and linear visualization of an example genome with annotation data With GenomeD3 Plot, the genome can be viewed in different tracks, such as if one wish to view a specific base pair or a series of base pairs to visualize GC content, or more.GenomeD3 Plot provides a rich API ( application program interface that specifies how software components should

interact) to dynamically manipulate visualization. A linear and circular plot can also be tied together so that manipulation of one will cause a mirror alteration in the other, such as zooming or changing the visible region of the genome. A specific region can be recenter to focus. Many other features have been introduced in GenomeD3 Plot for easy visualization and interpretation of genomes.

Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.

Bioinformatics Review | 28


Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.

Log on to www.bioinformaticsreview.com

Bioinformatics Review | 29


Bioinformatics Review | 30


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.