OC T OBE R 2015 VOL 1 ISSUE 1
“Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.� -
ALFALFA: explained! By Muniba Faiza
Charles Wins
Computer and Drugs: What you need to know
Public Service Ad sponsored by IQLBioinformatics
Contents
October 2015
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
03 Genomics
5
The basic concepts of genome assembly19 MUSCLE: Tool for Multiple Sequence Alignment
27
DNA test for paternity: This is how you can fail! 32
34 Proteomics How to check new peptides accuracy in Proteogenomics 17
22 Systems Biology Tumor progression prediction by variability based expression signatures 8 Basics of Mathematical Modelling
25
Introduction to mathematical modelling Part 2 29
34
99 Software ALFALFA: explained
Data Analysis
Meta-analysis of biological literature: Explained
15
Basic Concept of Multiple Sequence Alignment
23
06
BioMiner & Personalized Medicine: A new perspective 12
99 CADD Computer and Drugs: What you need to know 21
EDITORIAL
EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
Bioinformatics Review – The Road Ahead Bioinformatics, being one of the best fields in terms of future prospect, lacks one thing - a news source. For there are a lot of journals publishing a large number of quality research on a variety of topics such as genome analysis, algorithms, sequence analysis etc., they merely get any notice in the popular press.
Tariq Abdullah
Founder
EDITORIAL
One reason behind this, rather disturbing trend, is that there are very few people who can successfully read a research paper and make news out of it. Plus, the bioinformatics community has not been yet introduced to research reporting. These factors are common to every relatively new (and rising) discipline such as bioinformatics. Although there are a number of science reporting websites and portals, very few accept entries from their audience, which is expecte d to have expertise in some or the other field. Bioinformatics Review has been conceptualized to address all these concerns. We will provide an insight into the bioinformatics - as an industry and as a research discipline. We will post new developments in bioinformatics, latest research. We will also accept entries from our audience and if possible, we will also award them. To create an ecosystem of bioinformatics research reporting, we will engage all kind of people involved in bioinformatics Students, professors, instructors and industries. We will also provide a free job listing service for anyone who can benefit out of it.
Letters and responses: info@bioinformaticsreview.com
SOFTWARE
ALFALFA: explained Muniba Faiza Image Credit: Google Images “ALFALFA is a new platform for s equenc ing. It is ex tremely fas t and ac c urate at mapping long reads (> 500bp), while s till being c ompetitive for moderately s ized reads (> 100bp). Both end -to-end (i.e., global) and loc al read alignment is s upported and s everal s trategies for paired -end (i.e., global) mapping c an effic iently handle large variations in ins ert s ize (i.e., input genome to be s equenc ed)” igh throughput sequencing has revolutionized the new world of bioinformatics research. Since everyone is aware of the Human Genome project in which the human genome has been sequenced, millions of species have been sequenced so far. Sequencing is a very important aspect of bioinformatics so new faster and better sequencing techniques are needed . New sequencing platforms produce biological sequence fragments faster and cheaper.
H
Ideal read mappers should accomplish the following aspects:
Maximal speed
Minimal memory
Maximal accuracy
Shoot at a moving target (since fast evolving technologies differ in length distribution and sequencing errors). Recent advances in next generation sequencing technologies have led to
increase read lengths, higher error rates and error models showing more and longer indels (insertions and deletions). A preprocessing step of indexing reference genomes and/or sequencing reads must guarantee fast substring matching. The overall search space is pruned to candidate genomic regions by searching matching segments (called seeds) between reads and the reference genome. These candidate regions are then further investigated to look for acceptable alignments that reach a particular score. Then the sequencing is done. ALFALFA is a new platform for sequencing is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately sized reads (> 100bp). Both end-to-end (i.e., global) and local read alignment is supported and several strategies for paired-end (i.e., global) mapping can efficiently handle large variations in insert size (i.e., input genome to be sequenced). The name is an acronym for “A Long
Fragment Aligner/A Long Fragment Aligner". It is repeated twice as a pun on repetitive and overlapping fragments observed in genome sequences that heavily distort read mapping and genome assembly. The most fascinating feature of ALFALFA is that it uniquely uses the ‘enhanced sparse suffix arrays’ to index reference genome (the genome to be sequenced). Index refers to a data structure that allow for the quick location of all occurrences of patterns starting at interesting positions only. Sparse suffix array is a technology which uses LCP (Longest Common Prefix) series which reduces the solution space and forms a suffix tree efficiently. Sparse Suffix Array uses a chaining algorithm to speed up dynamic programming extensions of the candidate region. This data structure facilitates fast calculation of maximal and super-maximal exact matches. The speed-memory tradeoff is tuned by setting the sparseness value of the index.
Bioinformatics Review | 6
ALFALFA follows a canonical seedand-extend work- flow for mapping reads onto a reference genome: Root system Reference genome indexed by enhanced sparse suffix array Seed Super-maximal exact match between reference genome and read (To enable quick retrieval of variablelength seeds called super-maximal exact matches between a read and the reference genome). Flower bud Cluster of seed forms candidate genomic region (Seeds are then grouped in to nonoverlapping clusters that mark candidate genomic regions for read alignment). Flower Gaps between seeds filled by dynamic programming (Handling of candidate region is prioritized by agglomerate base pair coverage of seeds. the final extend phase sample seeds from candidate regions to form collinear chains that are bridged using dynamic programming). Features of ALFALFA: ALFALFA uses the technological evolution for the production of longer
reads by using maximal exact matches [MEMs] and super-maximal exact matches [SMEMs] as seeds. (Since MEMs between a read and a reference genome may overlap, super- maximal exact matches are defined as MEMs that are not contained in another MEM in the read ) . These seeds are then extensively filtered and then decide the order of alignment to allow for more accurate prioritization of candidate regions. To reduce the number of expensive dynamic programming computations needed, ALFALFA chains seeds together to form a gapped alignment. As a result, the extension phase (aligning the matches) is limited to filling gaps in between chains while evaluating alignment quality. The sparseness value‘s of sparse suffix arrays (controlled by the option -s) provides an easily tunable trade-off to balance performance and memory footprint. In theory, sparse suffix arrays take up 9/s + 1 bytes of memory per indexed base. A sparse suffix array with sparseness factor 12 thus indexes the entire human genome with a memory footprint of 5.8GB. It shows that ALFALFA is good to perform sequencing at maximal speed acquiring minimal memory space. ALFALFA tries to balance the number and the quality of seeds using a combination of maximal and supermaximal exact matches. The intervals *i..i + l−1+ and *j..j + l-1] correspond to a maximal exact match between a
read and a reference genome if there is a perfect match between both subsequences of length ` starting at position i in the read and at position j in the reference genome, with mismatches occurring at positions (i−1,j−1) and (i + l, j + l) just before and after the location of the matching subsequence. A combination of neighboring seeds increases the evidence that some region in the reference genome holds potential to serve as a mapping location. ALFALFA therefore sorts seeds according to their starting position in the reference genome and bins them into non-overlapping clusters using the locally longest seeds as anchors around which regions are built. This results in a list of candidate regions along the reference genome. To limit the number of candidate regions requiring further examination, only SMEMs and rare MEMs are used for candidate region identification. . Candidate regions are then ranked by their cov- erage of read bases, calculated from the seeds that make up the clusters. Sequential processing of these prioritized candidate regions halts when either a high number of feasible alignments has been found, a series of consecutive candidate regions failed to produce an acceptable alignment or read coverage drops below a certain threshold. The dimensions of a dynamic programming matrix correspond to the bounds of a candidate region, but
Bioinformatics Review | 7
computations are often restricted to a band around the main diagonal of the matrix. The width of this band depends on the minimal alignment score required.ALFALFA further reduces the dimensions of the matrix by forming a collinear chain of a subset of the seeds that make up a candidate region. Dynamic programming can then be restricted to fill the gaps in between consecutive non-overlapping seeds. The chaining algorithm starts from an anchor seed and greedily adds new seeds that do not exhibit a high skew to the chain. The skew is defined as the difference of the distances between two seeds on the read sequence and the reference genome. The amount of skew allowed is automatically decided based on the gap between the seeds and the parameters that influence the feasibility of an alignment. ALFALFA allows multiple chains per candidate region, based on the available anchor seeds. Anchor selection is based on seed length and seeds contained in chains can no longer be used as anchors in successive chain construction.
The benchmark results demonstrate that ALFALFA is extremely fast at mapping long reads, while still being competitive for moderately sized reads. Together with BWA-SW and BWA-MEM, it is one of a few mappers that scale well for read lengths up to several kilobases. Reference BMC Bioinformatics Sample
16:59
doi:10.1186/s12859-015-0533-0 Michaël Vyverman (Michael.Vyverman@UGent.be) Bernard De Baets (Bernard.DeBaets@UGent.be) Veerle Fack (Veerle.Fack@UGent.be) Peter Dawyndt
Overall, Bowtie 2 has the highest sensitivity, which reaches 100%. However, Bowtie 2 is also less able to distinguish between good and bad alignments. CUSHAW3, BWA-MEM and ALFALFA exhibit the best trade-off between true positives and false positives. The mapping quality is determined by ROC (receiver operating characteristic ) curve.
Bioinformatics Review | 8
SYSTEMS BIOLOGY
Tumor progression prediction by variability based expression signatures Muniba Faiza Image Credit: Stock Photos “Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.� ancer has become a very
signature is a group of genes in a cell
define robust and reproducible gene
common disease now a
whose combined expression pattern is
expression signatures capable of
days, but the main reason
uniquely characteristic of a biological
accurately distinguish tumor samples
of causing this is unknown
phenotype or a medical condition. But
from healthy controls.
up till now. Various reasons have been
only few of them were successfully
given and recent research says that
able to utilize in clinics and many of
improper sleeping patterns may also
them failed to perform. Since these
lead to cancer. Like cause of cancer is
signatures attempt to model the
After employing many experiments
difficult to predict, similarly, its
highly variable and unstable genomic
regarding cancer anti-profiles, the
progression and prognosis is also very
behavior of cancer, they are unable to
results indicated that the anti-profile
difficult. Despite of many advances in
predict
of
approach can be used as a more
cancer treatment, early detection is
deviation in gene expression from the
robust and stable indicator of tumor
still very difficult. While there have
normal
malignancy
been many early cancer screening
variability across cancer types can be
techniques, but are not realistic
used as a measurement of risk of
because
cost-
relapse or death. This gives rise to the
or requirement of
concept of Gene expression anti-
C
of
effectiveness invasive
the
lack
procedures.
of
cancer. The tissue,
i.e.,
degree the
hyper-
Genomic
profiles. Anti- profiles are used to
screening techniques are a promising
develop cancer genomic signatures
approach
Gene
that specifically takes advantage of
expression signatures are commonly
gene expression heterogeneity. They
used to create cancer prognosis and
explicitly
diagnosis methods. Gene expression
expression variability in cancer to
in
this
area.
model
increased gene
Differentially variable genes= antiprofile genes
than
traditional
classification approaches. The researchers’ hypothesis is that the degree of hyper-variability (w.r.t normal
samples)
is
directly
proportional to the tumor progression, i.e., degree of hyper- variability as measured with respect to the normal samples would increase with tumor progression. Corrada Bravo et al found Bioinformatics Review | 9
out a way to derive a colon-cancer
be
extended
to
model tumor
anti-profile for screening colon tumors
progression. These studies showed
by measuring deviation from normal
that Gene expression anti-profiles
colon samples. To create an anti-
capture tumor progression.
profile, they used a set of normal samples and tumor samples, probesets are then ranked by the quantity
Fig.1 Among probes that exhibit higher
σj,tumor/
σj,tumor and
DNA methylation is one of the
variability among cancers than among
σj,normal are the standard deviations
primary epigenetic mechanisms for
normals, the degree of hypervariability
among the tumor samples and normal
gene regulation, and is believed to
observed is related to the level of
samples, respectively, for probeset j)
play a particularly important role in
progression.
in descending order, and a certain
cancer. High levels of methylation in
variance ratio statistic ,log2 σ σ2tumor
number of probesets (typically 100)
÷ σ2normal- for colon dataset (Gyorffy
with the highest value are selected.
et al; GSE4183) from anti-profile
Then they calculated the normal
computed using another colon dataset
regions of each probe set and then the
(Skrzypczak et al; GSE20916). (B)
number of probe sets for which the
Distribution of variance ratio statistic
expression lies outside the normal
for Skrzypczak et al colon dataset from
region was calculated to get an anti-
anti-profile computed using Gyorffy et
profile score of the sample.
al colon dataset. (C) Distribution of
σ j,normal(where
To test their hypothesis, they
variance
(a)
ratio
Distribution
statistic
of
for
available
adrenocortical data (Giordano et al;
microarray datasets with normal,
GSE10927) for universal anti-profile
adenoma, and cancer colon samples.
probe sets.
obtained
two
publicly
promoter
are usually associated with low transcription. Cancer has loss of sharply methylation levels which is
By studying these datasets, they
Both adenoma and cancer samples
associated
plotted the distribution of variance of
show higher variability than normals
hypervariability in gene expression
cancer/adenoma samples to variance
(region to the right of x = 0), while
across multiple tumor types. They
of normal samples ratio (in log2 scale)
cancer
applied
for these probe sets on the other
hypervariability than adenomas. This
method to DNA methylation data
dataset
suggests that hypervariability is a
from thyroid and colon samples,
stable marker between experimental
where for each tissue type, normal,
datasets and that specific selection of
adenoma and cancer samples were
hypervariable genes across cancer
available.
types and the anti-profile method can
distribution
(Fig.
1A
and
B).
samples
show
higher
with
increased
the anti-profile scoring
Figure of
2 shows
the
adenoma
and
Bioinformatics Review | 10
carcinoma samples against normal
Biology, Department of Computer
samples on a principal component
Science and UMIACS, University of
plot, showing the presence of the
Maryland, College Park, MD, USA.
hypervariability methylation
pattern data: the
in normal
samples cluster tightly, while the adenomas
show
some
dispersion and the carcinomas show even greater dispersion. Since these behaviors are present for both colon and thyroid data, it again reinforces their notion that the anti-profile approach has wide application for classification
in
cancer.
figure 2. Anti-profiles applied to methylation data: first two principal components
of
(A)
thyroid
methylation data and (B) colon methylation data. Conclusion: The anti-profile approach is more suitable for cancer prognosis. It can robustly predict the tumor progression and prognosis based on the variability in the gene expressions. The results presented above also confirms that gene expression signatures based on hyper-variability
can
be
highly
HĂŠctor
valuable. Reference: Wikum
Dinalankara
and
Corrada
Bravo
Center
for
Bioinformatics and Computational
Bioinformatics Review | 11
TOOLS
BioMiner & Personalized Medicine: A new perspective Muniba Faiza Image Credit: Google Images “BioMiner is a web-based tool which provides various tools for studying the statistical analysis and a deep insight of transcriptomics, proteomics and metabolomics with cross-omics concept”
P
ersonalized medicines have
novel web based tool “Biominer” has
PRoteomics IDEntifications (PRIDE) for
become a very important part
been
which
proteomics data, or Sequence Read
of the medicine world now a
provides access to a wide variety of
Archive (SRA) of NCBI are used to store
days. They are also known as
high-throughput datasets. This tool
the
‘Individualized
Medicines’.
was developed within the scope of
datasets
Personalized medicines allow a
an international and interdisciplinary
sequencing. The only limitation with
doctor to prescribe more specific
project
Biominer
these repositories is that they store
and
a
provides the user various facilities
biological data of a dedicated set of
particular patient. This concept has
with convenient tools which help
single omic type and do not support
created many more opportunities
them to analyze the high-throughput
the cross-omics.
and aspects in the medicine world.
datasets and provides a deep insight
Personalized medicine concept is
for complex cross-omics datasets
A database namely, SystherDB has
accomplished by obtaining high-
with enhanced visualization abilities.
been developed in which the stored
throughput data sets from genomics,
Since Biominer was developed under
data is well presented and easily
transcriptomics,
Systher (System
Tools
accessible, and whose data is mined
metabolomics, but more specifically
Development for Cell Therapy and
and analyzed by the BioMiner tools.
it requires the ‘cross-omics’, i.e.,
Drug
–
A public instance of BioMiner is
linkage between transcriptomics,
www.systher.eu) project so its main
freely available online. It currently
proteomics
focus is on cancer.
contains 18 different studies, with
efficient
medicines
and
proteomics
to
and
metabolomics.
launched
recently
(SYSTHER).
Biology
Development
biological for
high-throughput next-
generation
almost 4,000 microarrays and more
Currently, there are simple webbased tools which do not allow much
Public data repositories such as Gene
than 187 Mio measured values of
access to the high throughput
Expression
and
genes, proteins, or metabolites.
datasets from the omics. But a new
ArrayExpress for microarray data,
Since BioMiner was developed in
Omnibus
(GEO)
Bioinformatics Review | 12
the SYSTHER project, most of the
Human
studies
(HMDB), and Kyoto Encyclopedia of
are
focused
on
the
glioblastoma multiforme (GBM).
Metabolome
A
Database
Genes and Genomes (KEGG).
Fig2 . Data mining with Biominer.
6.
cross-omics
screenshots of different results
relationship (e.g., a mapping of
from data mining with Biominer
Predefined
metabolites onto genes or vice versa) among the biological datasets. 7.
Pathway
and
functional
information from Reactome, KEGG, and Wiki- Pathways. 8.
Gene
Ontology
is
also
supported. 9. Correlation Fig.1 Workflow of BioMiner FEATURES: 1.
BioMiner uses Google Web
Toolkit (GWT) for the graphical user interface (GUI). 2. A separate MySQL database is created which is manually curated and used to store the Experimental data from
genomics,
proteomics
and
metabolomics. 3.
Data
import
has
to
be
performed by a dedicated specialist to ensure the data consistency. 4.
indexing methods are implemented. 5.
Metabolite data are annotated
using
three
different
identifier
systems: Golm Metabolome Database,
(statistical
analysis of any two variables) are
including the following: (a) study
based
overview,
on
Pearson
correlation
(B)
detection
of
coefficients.
differentially expressed genes, (C)
10.
Correlations are calculated for
correlation of gene expression and
high-variance genes (by default top
survival time, (d) identification of
500 genes).
significantly enriched pathways, (e)
11.
visual pathway inspection based on
BioMiner complies with public
data management standards such as
predefined
Minimum
a
biomolecule comparison of gene
(MIAME),
and protein expression. results are
Microarray
Information Experiment
About
layouts,
and
(f)
a
typically presented in synchronized,
Proteomics Experiment (MIAPE), and
parallel views composed of a table
Minimum
and
Minimum
Information Information
About About
a
a
plot.
Fig3.
Pathway
Metabolomics Experiment (MIAMET).
visualization. Interactive pathway
12. ENSEMBL database is used for
visualization of the cell cycle
Response time is with in just a
few seconds, for this purpose special
analyses
13.
cross-mapping between the genes
pathway
and proteins.
repository.
from
WikiPathways
Cross-mapping between the
genes and metabolites the combined
BioMiner is a web-based tool which
information of ConsensusPathDB and
provides various tools for studying the
HMDB is used.
statistical analysis and a deep insight of transcriptomics,
proteomics
Bioinformatics Review | 13
and
metabolomics with cross-omics concept. Results are presented in two parallel views composed of a table and a plot. Both views are interactive and userdefined selections can be synchronized. Pathway visualization is achieved by extending the PathVisio library. It also provides clinicians and physicians a platform integrating high-throughput data together with clinical parameters, thereby leading to better personalized medicines. Reference: Chris Bauer1, Karol stec1, alexander Glintschert1,
Kristina
Gruden2,
Christian schichor3, michal or-Guil4,5, Joachim
selbig6
and
Johannes
schuchhardt1
Bioinformatics Review | 14
META ANALYSIS
Meta-analysis of biological literature: Explained Manish Kumar Mishra Image Credit: Google Images “ Meta-analys is an analys is of already publis hed data by s imply rearranging it, s orting it, and trying to find hidden patters out of publis hed literature. ”
I
t’s a fine Monday morning, and
biological systems are most complex
So what is meta-analysis about?
to date and present day computer
the new intern finds his way to the laboratory of biological
The new cool word to biological
simulations fail to rival the complexity
data mining procedures. His brief
realm “meta-analysis” can be better
with equal efficiency, so any analysis
interview
concerned
understood by understanding the
narrowed down to gene must also
scientist has allowed him to have
meaning of first half of the
consider that the gene may very well
very limited understanding of the
term; META, meaning data of data,
be found in multiple organisms and
subject.... Upon his arrival he is
thence making meta-analysis an
thus may present considerably high
greeted with a humongous corpus of
analysis of already published data
amount of results irrelevant to the
mixed articles, say some 4000, and
by simply rearranging it, sorting it,
study.
he is required to assemble specific
and trying to find hidden patters out
information out of the data set, by
of published literature.
diligently
with
the
scrutinizing
the
A rigorous manual inspection of program sorted data is required to
By most rudimentary means, meta-
sort out such entries. Since meta-
analysis can be achieved by reading
analysis relies heavily on statistical
Well, the situation could be
the corpus of research and review
studies of data, researchers tend to
frightening to a purely wet lab
articles concerning a particular topic
rely on programming languages such
working biologist, but a man who
which may be as wide a whole
as Stata and R to write their specific
has had any exposure to the real
Eukaryotic genome
be
codes for analyses, R unlike Stata is
power of file handling with any
narrowed down to phyla, groups,
free, produces publication quality
programming language will know
species may be a specific disease or
outputs and provides a plethora of
how to let a simple few lines of
even any gene in particular. Where on
packages, of which a few provide
code do his bidding.
one part we try to narrow down to
programs like PDF miner, PubMed
disease or gene, one must also realize
miner etc used for accessing PubMed
components of each article.
or may
Bioinformatics Review | 15
database, these packages contain
issues. The current query system to
Need of the hour is to engage more
codes to access the database and
NCBI and sister organizations fail to
resources
extract all information off them with a
acknowledge synonymous terms and
structured and somewhat intelligent
command based interface for huge
treats them as individual entities not
query
data sets at once cutting down
linked to any, but only in association
acknowledge the gene-names and
manual efforts and time taken to
with the length of query items made
abbreviations, scientific and English
achieve the task.
alongwith. A robust query system is
names of organisms and also the
into developing well-
systems which can truly
needed to enhance the results, and
variations of presenting names of
All praises sent, the method has its
make the whole concept more
techniques
own fair share of drawbacks and
efficient.
Bioinformatics Review | 16
involved.
PROTEOGENOMICS
How to check new peptides accuracy in Proteogenomics Muniba Faiza Image Credit: Google Images “ During the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get those peptides which in actual are not but the algorithm may show them .�
P
roteogenomics is an emerging area which is an interface of proteomics and genomics. This intersection employs the genomic and transcriptobmic information to identify the novel peptides by using mass spectrometry based techniques. The proteomic data can then be used to identify the fingerprints of genic regions in that particular genome which may results in the modification of gene models and can also improve the gene annotations. So, we can say that proteogenomics has been well accepted as a tool to discover novel proteins and genes."But, during the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get those peptides which in actual are not but the algorithm may show them". Therefore, to avoid or more
accurately, to minimize the chances of false positives, a False Discovery Rate (FDR) is used. FDR is a ratio of number of decoy hits / number of targets.
It has been observed previously that, under a fixed FDR, the inflated database generated by, e.g. six-openreading-frame (6-ORF) translation of a
FDR = decoy/ target
whole genome significantly decreases the
In most conventional proteogenomic
sensitivity
of
peptide
identification.
studies, a global false discovery rate (i.e., the identifications of annotated
Recently, Krug implied that the
peptides and novel peptides are
identification
subjected to FDR estimation in
peptides is greatly affected by the
combination) is used to filter out false
completeness of genome annotation,
positives for identifying credible novel
i.e., more the genome is annotated,
peptides. However, it has been found
higher
that the actual level of false positives
identification
in novel peptides is often out of
peptides.
control and behaves differently for different genomes.
are
accuracy
the of
of
chances accurate
novel
of novel
In this recent paper, they followed the same framework as in Fu’s work to
Bioinformatics Review | 17
quantitatively subgroup
investigate
FDRs
of
the
annotated
identifications may be excluded along
transcriptome
information
is
with false positives.
unavailable, it would be also helpful to
and novel peptides identified by 6-ORF
reduce the 6-OFR translation database
translation search.
by
In this article, they have revealed that
To
the genome annotation completeness
specificity of novel gene discovery,
ratio
one should reduce the size of
is
the
dominant
factor
increase the sensitivity and
real proteins.
Reference:
of novel peptides identified by 6-ORF
possible. For example, when the
translation search when a global FDR
transcriptome information (especially
A note on the false discovery rate of
is
quality
from the strand-specific cDNAseq
novel peptides in proteogenomics
assessment. However, with a stringent
data) is available, it is apparently more
Kun Zhang1,2, Yan Fu3,*, Wen-Feng
FDR
favorable
Zeng1,2,
control
low scoring
(e.g. but
1%), true
many peptide
to
search against
as
predicted to be hardly possible to be
searched
for
as much
that are
influencing the identification accuracy
used
database
removing sequences
the
transcriptome as well than to search against the genome alone. If the
Kun
He1,2,
Hao
Chi1,
Chao Liu1, Yan-Chang Li4, Yuan Gao4, Ping
Xu4,*
and
Si-Min
Bioinformatics Review | 18
He1
GENOMICS
The basic concepts of genome assembly Muniba Faiza Image Credit: Google Images � Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the heritable information and also some regions which are not even expressed.�
G
enome, as we all know, is a
Basic
strategy
involved
behind
complete set of DNA in an
discovering the new information of
organism including all of its
genome is explained in following steps: 3. Now, the question that arises is how
genes. It consists of all the heritable
contig. Many such contigs are formed during the joining process. come we know that a fragment which
1. First of all, the whole genome of
may be a repeat has been kept in its
an organism is sequenced which
right place as a genome may have
results in thousands or hundreds
many repeated regions? To overcome
Almost 98 % of human genome has
of different unknown fragments
this, paired ends are used. Paired ends
been sequenced by the Human
starting from anywhere and
are the two ends of the same
Genome Project, only 1 to 2 % has
ending upto anywhere.
sequence fragments which are linked
information and also some regions which are not even expressed.
been understood. Still the human
2. Now, since we don't know what
together, so that if one of the end of
genome has to be discovered more
the sequence is and which fragment
the fragment is aligned in lets say
whether it would be in terms of genes
should be kept near to which one, the
contig1 then the other end which is a
or
sequencing
concept for 'Contigs' is employed.
part of the former will also be aligned
strategies and algorithms have been
Contigs are the repeated overlapping
in the same Contig as it is the
proposed for genome assembly. Here I
reads which are formed when the
consecutive part of the sequence.
want to discuss the basic strategy
broken fragments comes to each
There are various software with the
involved in genome assembly, which
other
help of which we can define different
sounds quite difficult but is not really
overlapping regions of the sequence.
complex if understood well.
It means that many fragments which
4. After that all the Contigs combine
are consecutive are joined to form
to form a scaffold, sometimes
proteins.
Many
only
by
matching
the
lengths of the paired ends.
Bioinformatics Review | 19
called
According to my experiences, more
Thank you for reading, Don't forget to
as Metacontigs or Supercontigs,
efficient algorithms are which may
share this article if you like it.
which are then further processed
provide us large information in one go.
and the genome is sequenced.
Just imagine that we got a thread of sequence with unknown base pairs,
All of this is done by different
then what would we do with that
assembly algorithms, mostly used are
thread and how would we identify and
Velvet and the latest is SPADES.
extract the useful information from this thread??
Bioinformatics Review | 20
CADD
Computer and Drugs: What you need to know Altaf Abdul Kalam Image Credit: Google Images ” Computers are being used to design drugs and it is being done on a humongous level, by almost every multinational pharma company.”
W
ould you chance your life to a lowly piece of hardware called the computer? Would you let it fabricate and determine drugs for life threatening diseases like hepatitis, cancers and even AIDS? Well, actually your answer (or your opinion) doesn’t seem to matter. Because the world has moved over to the other side. Computers are being used to design drugs and it is being done on a humongous level, by almost every multinational pharma company, the names of which you will undoubtedly find at the back of your prescription medicines at home. So what’s with all this computer stuff? Have we parted with our perspicacity, our intuition, our ready willingness to tackle any challenge head on? We have always found solutions to mankind’s biggest problems all by ourselves. As Matthew McConaughey’s character in Interstellar says "..or perhaps
we’ve forgotten pioneers?"
we
are
still
Well philosophical mubo-jumbo aside, its not that simple as its sounds. Ofcourse, most of you reading this already have some background in this topic and have already understood what I am talking about. But for those of you who haven’t the slightest clue, don’t worry, this write up is for you. Throughout this series of articles on this particluar issue, I am going to try and break it down to the basics. Lets say that by the end you would see a car not for what it is – with all its complexity and slickness- but for what made it the way it is – the nuts and bolts and rubber and.. whatever, you get the point! So where do we start? Money! Yes the thing that runs the world. Contrary to what all the losers who never made a dime say, money
simply is everything. Even Bill Gates was forced to acknowledge the fact and decalre "Money isn’t everything in this world , but you gotta have a lot of it before you say such rubbish." So that settles it then. Now lets come back. The basic modus opernadi of designing a drug is that you first find a suitable target which you believe will be key to challenging the disease. This mostly is a protein/enzyme that can metabolise a particular drug or in some cases even the disease causing genes from the pathogen itself. Finding this target is not easy, but it is not that hard either. We have documentations, intensive studies and databases dedicated to listing, characterizing and studying the drug metabolizing genes and proteins in the body. Different classes of metabolizers act on different types of chemicals (or drugs if you like). A class of metabolizers called the CYP
Bioinformatics Review | 21
enzymes metabloize over sixty percentage of the known drugs and medicines that humans consume. This includes drugs (the real ones – LSD, cocain, heroin.. get it?) and even poisons and sedatives. The metabolizers ofcourse don’t know which is which. If it suits them they metabolize it, else they are out of your system. Now, under the assumption that we have a drug target, the next step is finding the suitable drug candidate itself. Now this step is what you call finding a needle in a haystack. There are literally millions of drugs out there and if that is not enough you can go design your own and get it synthesized. In a drug target (we will call it simply the ‘protein’ henceforth) there are multiple points of action where multiple drugs can act. So for example in a protein made of 200 amino acids, we might find 50 actionable amino acids. For these fifty amino acids we may find thousands and thousands of drug candidates, all capable in bringing about some or the other change in the protein. So how do we find the One? If you asked that question about fifteen years back, the answer would have been to slog it out. Match every drug candidate you got against the protein you have and check the effects in vivo. Now countless number of
factors come into play when a drug and a protein interact – global minima, energy minimization, binding affinity, hydrogen bonding intensity and what not. We shall learn about them in more detail in upcoming articles. So to put it simply, scientists spent their whole sorry lives pitting different drug candidates against the same protein over and over again until they found something worthwile to hang on to. Even if all the above mentioned factors blended in wonderfully, they might sadly, at the end discover that the drug actually caused more harm than it did good. So the candidate gets discarded and they start the process all over again! Sometimes you got lucky and found the right drug within performing a few combinations. But mostly it took years to even zero in on a drug that could be moved further into the drug discovery pipeline which in itself is another torturous process! So coming back to the money factor. You don’t need to be a Harvard Business School garduate to learn that this tiresome task costs money, a lot of money! Money in the form of manpower, reagents, biological matter like tissues and test animals and plants, instrumentation, electiricity and what not. Another thing it costs is something which
none of use care about much – time. Picture designing a drug for some novel disease which is killing thousands of people each year. And picture having to0 do this same procedure and coming out with a drug after 10-15 years. The cost of such a life saving drug will also be high because the company or lab that made them would want to recover all the time and money they spent on it in the first place. Not exactly feasible and effective I would say. So here comes computer aided drug design. Which – brace yourself, can shave off years from the drug discovery pipeline. It can get you into the clinical trials phase within say 2-3 years as opposed to the earlier average of 7-8 years. Less money spent, less time spent, faster avilability of a potential cure and who knows, even less expensive medicines. So how does it work? How does the entry of a man made machine change everything for the better so drastically? What does a computer do that humans could not? Can you trust the results that happen in silico over something that happen in vivo? Is a computer finally that evolved that it can simulate life forms inside its mother box and processors? We will hopefully see those questions answered in the next few posts!
Bioinformatics Review | 22
SEQUENCE ANALYSIS
Basic Concept of Multiple Sequence Alignment Muniba Faiza Image Credit: Google Images “ The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity.�
M
ultiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them. The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that "maximizes the sum of similarities for all the pair of sequences", which is called as the 'Sum-of-scores or SP Score'. The SP Score is the basic of many alignment algorithms. The most widely used approach for constructing MSA is "Progressive Alignment", where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate
alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is "Consistency-based Scoring",where the alignment is consistent dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment. Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated? The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding
reference alignment which gives an accuracy score. Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available. "Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE." Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the
Bioinformatics Review | 23
reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool
specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence. However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.
Multiple sequence alignment
Robert C Edgar1 Batzoglou2
and
Serafim
BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs
Julie D. Thompson, Frédréric Plewniak and Olivier Poch
REFERENCES:
Bioinformatics Review | 24
SYSTEMS BIOLOGY
Basics of Mathematical Modelling - Part 1 Fozail Ahmad Image Credit: Google Images “ Mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, gene-gene interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. �
B
iochemical processes are simply complex, and their apparent feature does not easily allow us to investigate what exactly system means. Moreover, most of the biochemical processes obey nonlinear reaction kinetics. That is, amount of reactant (Protein/RNA/DNA/) is not directly proportional to its product. This leads to further increase in complexity level of molecular mechanism. And create biological noises such as randomization (stochasticity) of biomolecules, perturbation in cell signaling, difficulty in quantification of cell products and even unexpected response of the entire system. Here comes development and utilization of a mathematical model takes multiple factors/parameters into consideration and provides researcher with a visual understanding how complex biological system functions and responds to external (hormone/drug/cation/anion)
and internal signals (protein/enzyme/cation/anion) or adverse environmental condition such as deficiency of Fe 2+ ion during formation of Vitamin-D. Basically, mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, genegene interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. In case of discrepancies, between the two, parameter taken into consideration need to be refined. The general approach of the modeling give us the following benefit: 1. Discrepancies between mathematical model and actual experimental result point to components that still are missing from hypothetically developed model, and therefore one can develop a more comprehensive scenario of systems behavior. On
the hand, a well developed model assist in designing and clarifying the additional issues in ongoing experiment. 2. With the help of a mathematical model researcher can modify experimental parameter (e.g., by introducing modified protein associated with Mg2+ uptake into cell) and run the computer simulations. 3. Most importantly, mathematical models are not limited to environmental/experimental constraints. They may be quickly changed for multiple conditions/parameters and most suitable simulation can be assessed for developing a reliable experimental design. 4. A mathematical model may help to investigate sub-system that sometimes regulate special
Bioinformatics Review | 25
biochemical process, though all biological reactions cannot be treated same, which necessarily provides substantial information about large systems behaviour.
knowledge generation experimental design.
and
By doing multiple simulations and changing parameter values, we are able to represent the real biochemical/molecular phenomena which seem to be difficult-to-treat. → (To be Continued‌
Fig: 1. Schematic representation of biological modelling process,
Bioinformatics Review | 26
TOOLS
MUSCLE: Tool for Multiple Sequence Alignment Muniba Faiza Image Credit: Google Images “MUSCLE is one of the software which is known for its speed and accuracy on each of the four benchmark test sets ( BAliBASE, SABmark, SMART and PREFAB).”
I
n my last article I discussed about the Multiple Sequence Alignment and its creation. Now in this article, I am going to explain the workflow of one of the MSA tool, i.e., MUSCLE. MUSCLE is a software which is used to create MSA of the sequences of interest. MUSCLE is one of the softwares which is known for its speed and accuracy on each of the four benchmark test sets ( BAliBASE, SABmark, SMART and PREFAB). It is comparable to TCOFFEE & MAFFT (these tools will be explained in upcoming articles).
alphabets in the sequences will be searched & aligned together. Kimura distance is the measure which is based on the fact that multiple substitutions occurs at a single site.
MUSCLE algorithm: Two distance measures are used by MUSCLE for a pair of sequences: a kmer distance (for an unaligned pair) and the Kimura distance (for an aligned pair). A kmer is a contiguous subsequence of length k, also known as a word size or k-tuple, i.e., it decides that how much
again apply UPGMA method and forms TREE2
again followed by a progressive alignment and forms MSA2, and forms tree.
now from the last obtained tree it delete the edges which results in the formation of two sub trees,
computes the sub tree profile (align the sub trees)
Distance matrices are then compiled using UPGMA method (i,e., a method of phylogeny tree construction based on the fact that the mutations occur at the constant rate), which gives a TREE1,which is followed by an progressive alignment and forms MSA1,
and then finally gives an MSA, for which the SP Score is calculated (explained in previous article "Basic Concept of MSA"),
if the SP Score is better, then only it saves that last obtained MSA as MSA3, otherwise it discard the MSA,
compute identities
again repeat from the step 6, and finally gives a clustered MSA.
For an aligned pair of sequences, MUSCLE
construct a Kimura Distance MAtrix,
computes the pairwise percent identities ,i.e., how much percentage of the sequences are aligned/matched and convert it to a distance matrix applying Kimura.
pairwise percent from MSA1 and
Bioinformatics Review | 27
Fig.1 The workflow of MUSCLE This is how MUSCLE works. MUSCLE alignment is also used in
MEGA6 tool which is used for phylogeny tree construction. Every software or tool has its own benefits depending up on the needs under consideration. There are various other tools also available for MSA such as TCOFFEE, MAFFT, etc, which have high accuracy and speed. They will be explained in upcoming articles.
MUSCLE : multiple sequence alignment with high accuracy and high throughput Robert C. Edgar*
Reference:
Bioinformatics Review | 28
SYSTEMS BIOLOGY
Introduction to mathematical modelling - Part 2 Fozail Ahmad Image Credit: Google Images “In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway.�
G
athering of Dynamic/Kinetic information
In the previous section you might have noticed that modelling biochemical process requires calibrated set of fine parameters which fit into and across the set of chemical/reactant species (gene/protein/molecule) involved in the process. Question arises, where do we collect data from? And what are the standard criteria for determining parameters? Basically, for a researcher, it is necessary to know the source of the data first and then how to manipulate to get relevant information for modelling. Source of the data can be chosen depending upon the requirement of experimental design. For modelling, data can be taken in the form of gene-gene interaction, Gene expression (microarray) and gene-protein interaction.
Basically, interaction and expression do not simply reveal the dynamic/kinetic values of the system and therefore need to be manipulated for further implication. In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway. This can be done by accounting the following methods: 1. Genome (complete set of genes) analysis at transcription level through DNA sequencing & genotyping
3. Proteome (entire protein) analysis at cellular level (reaction between proteins and other molecule in cell) using Mass spectroscopy and 2D gel electrophoresis 4. Metabolome (total metabolites and their intermediate) analysis at cell level, interaction metabolites and regulator 13 using C labelling and NMR techniques 5. Interactome (all interacting molecule) analysis by yeast-2 hybrid screen & TAP techniques (Ligand: TAP- Tandem affinity purification, NMR- Nuclear magnetic resonance).
2. Transcriptome (all mRNA) analysis at translation level using microarrays
Bioinformatics Review | 29
Fig: 2. OMICS generates data for developing structural pathway as well as parameter values are set from the same. A mathematical model (formula), in the form of differential equations, form reaction channels are derived and then executed/solved using suitable algorithm. The resultant Simulation shows the dynamic behaviour of the system. This can be fluctuated by changing parameter values to obtain the close-up result form experimental data. Above mentioned techniques are collectively referred to as Omics. They provide us structural and dynamic data that is used to generate mathematical formula representing observable reactions, followed by development of mathematical model and comprehensive pathway of biological system. These tentative models allow us (as mentioned in part-1) to observe the effect of a stimulus on specific signalling pathway, perturbation in cellular activities and gene expression level etc. Omics are characterized by a number of features. First, they allow researcher for analyses on different
molecular levels such as gene, protein and metabolites level. These different molecular levels sometimes show asynchronous behaviour—that is, although some metabolite such as glucose is higher in a cell and corresponding enzymes are lower than sugar to catalyse the reaction or vice versa. Asynchronous behaviour is an indication of complex regulatory mechanism. Therefore, it is crucially important to evaluate the degree of synchronization of all cellular level. Second, Omics are highly parallelized. This means all genes/mRNA (read-outs in sample) can be studied simultaneously rather than having to perform separate experiments focusing on individual genes. This parallelization also allows researcher to compare the degree of expression results for the same gene and to have an interaction between resultant proteins. Third, they are very standardized and therefore needs very high automation computing, providing scientist with a large number of samples at time. In the process, after collecting huge data, most relevant information is picked up and then processed further for final analyses. Entireties of techniques in an Omics are very important in the sense that they generate numerical data based upon which we are able to develop structural pathway for mimicking the real picture of biological system and then to represent in the form of mathematical model. (
→Continue
to
part
3)
Bioinformatics Review | 30
Bioinformatics Review | 31
BIOINFORMATICS NEWS
DNA test for paternity: This is how you can fail! Tariq Abdullah Image Credit: Google Images “ DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome. 99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual.�
D
NA test, also called as DNA fingerprinting is done to verify paternity, to find out criminal involvement, in forensic science, in archaeology and other scientific fields. It is a well-established fact that DNA fingerprinting test is foolproof. It has its merits in court cases too. It is considered to be a credible evidence of criminal involvement and paternity. Case of N.D Tiwari had also got wide media attention in recent times. BUT the chances are, you can fail a DNA test even with your real father or mother! To understand how is it possible, let us look at how DNA Fingerprinting is done. The technique of DNA fingerprinting/genetic fingerprinting or
simply DNA Test was discovered by Dr. Alec Jeffreys in 1984. DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome. 99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual. To identify a person by the difference in DNA sequence, the sequence are simply compared to each other. To speed up this process of comparison, rather than each nucleotide, a molecular biologist compares the regions of high variation in DNA sequence, called minisatellites. The location and sequence of minisatellites in genome varies in every individual. The chances of occurrence of same minisatellites are very low (1 in a billion). Hence it can be
treated as unique to every individual just like a fingerprint. To perform DNA test, the set of DNA is first broken into smaller pieces by an enzyme called restriction endonuclease called Eco R1 which cuts the sequence at distinct location where the sequence is GAATTC or its complementary. The location of this repetitive sequence varies in every individual. Each fragment is then sorted according to their molecular weight (or size) by a technique called gel electrophoresis. The fragment are then compared to each other. If the fragments generated by restriction enzymes are of same size it is more likely that both the sequences originated from the same individual. Click here to read more about genetic fingerprinting. So how can DNA fingerprinting fail?
Bioinformatics Review | 32
For the DNA test to fail we have to have two different set of DNA (Genome) in our bodies. It is possible by following ways, this is a concise list for a quick reference. 1. Since the human body is a complex and dynamic system, the environmental conditions in different parts of body may lead to changes in DNA, this is a comparatively new idea studied in epigenetics. Though changes occur this way, It is not likely to change the entire DNA and location of microsatellites. 2. Transposable elements may also cause the location of some sequence to change. The occurrence of transposable elements is not so widespread as to change locations of all microsatellite. Hence, this idea does not seems satisfactory.
Sometimes, when one of the twin dies during early pregnancy, the remnant cell may be taken up and absorbed by the surviving embryo. Thus surviving embryo will have two kinds of genome in different part of the body depending upon the process of differentiation. Thus, if the cells that form sperm in your body have different set of genome than rest of the cells in your body, you may fail a DNA test! This is a rare condition and chances are low that you would get away with a crime. If you liked this article or find this article worth reading, please do not forget to share. Who knows, there might be cases of human chimera around you. :)
3. Occurrence of more than one kind of cells in terms of genome, in human, i.e. Human Chimera was recently seen in a US man who failed the paternity test with his real child.
So what is Human Chimera? To be simple and precise, human chimera is the occurrence of cells with completely different set of genes in a single individual. It is a very rare condition and may go unnoticed.
Bioinformatics Review | 33
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.
Log on to www.bioinformaticsreview.com