Bioinformatics Review - October 2015

Page 1

OC T OBE R 2015 VOL 1 ISSUE 1

“Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.� -

ALFALFA: explained! By Muniba Faiza

Charles Wins

Computer and Drugs: What you need to know


Public Service Ad sponsored by IQLBioinformatics


Contents

October 2015

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Topics Editorial....

03 Genomics

5

The basic concepts of genome assembly19 MUSCLE: Tool for Multiple Sequence Alignment

27

DNA test for paternity: This is how you can fail! 32

34 Proteomics How to check new peptides accuracy in Proteogenomics 17

22 Systems Biology Tumor progression prediction by variability based expression signatures 8 Basics of Mathematical Modelling

25

Introduction to mathematical modelling Part 2 29

34

99 Software ALFALFA: explained

Data Analysis

Meta-analysis of biological literature: Explained

15

Basic Concept of Multiple Sequence Alignment

23

06

BioMiner & Personalized Medicine: A new perspective 12

99 CADD Computer and Drugs: What you need to know 21


EDITORIAL

EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India


Bioinformatics Review – The Road Ahead Bioinformatics, being one of the best fields in terms of future prospect, lacks one thing - a news source. For there are a lot of journals publishing a large number of quality research on a variety of topics such as genome analysis, algorithms, sequence analysis etc., they merely get any notice in the popular press.

Tariq Abdullah

Founder

EDITORIAL

One reason behind this, rather disturbing trend, is that there are very few people who can successfully read a research paper and make news out of it. Plus, the bioinformatics community has not been yet introduced to research reporting. These factors are common to every relatively new (and rising) discipline such as bioinformatics. Although there are a number of science reporting websites and portals, very few accept entries from their audience, which is expecte d to have expertise in some or the other field. Bioinformatics Review has been conceptualized to address all these concerns. We will provide an insight into the bioinformatics - as an industry and as a research discipline. We will post new developments in bioinformatics, latest research. We will also accept entries from our audience and if possible, we will also award them. To create an ecosystem of bioinformatics research reporting, we will engage all kind of people involved in bioinformatics Students, professors, instructors and industries. We will also provide a free job listing service for anyone who can benefit out of it.

Letters and responses: info@bioinformaticsreview.com


SOFTWARE

ALFALFA: explained Muniba Faiza Image Credit: Google Images “ALFALFA is a new platform for s equenc ing. It is ex tremely fas t and ac c urate at mapping long reads (> 500bp), while s till being c ompetitive for moderately s ized reads (> 100bp). Both end -to-end (i.e., global) and loc al read alignment is s upported and s everal s trategies for paired -end (i.e., global) mapping c an effic iently handle large variations in ins ert s ize (i.e., input genome to be s equenc ed)” igh throughput sequencing has revolutionized the new world of bioinformatics research. Since everyone is aware of the Human Genome project in which the human genome has been sequenced, millions of species have been sequenced so far. Sequencing is a very important aspect of bioinformatics so new faster and better sequencing techniques are needed . New sequencing platforms produce biological sequence fragments faster and cheaper.

H

Ideal read mappers should accomplish the following aspects: 

Maximal speed

Minimal memory

Maximal accuracy

Shoot at a moving target (since fast evolving technologies differ in length distribution and sequencing errors). Recent advances in next generation sequencing technologies have led to

increase read lengths, higher error rates and error models showing more and longer indels (insertions and deletions). A preprocessing step of indexing reference genomes and/or sequencing reads must guarantee fast substring matching. The overall search space is pruned to candidate genomic regions by searching matching segments (called seeds) between reads and the reference genome. These candidate regions are then further investigated to look for acceptable alignments that reach a particular score. Then the sequencing is done. ALFALFA is a new platform for sequencing is extremely fast and accurate at mapping long reads (> 500bp), while still being competitive for moderately sized reads (> 100bp). Both end-to-end (i.e., global) and local read alignment is supported and several strategies for paired-end (i.e., global) mapping can efficiently handle large variations in insert size (i.e., input genome to be sequenced). The name is an acronym for “A Long

Fragment Aligner/A Long Fragment Aligner". It is repeated twice as a pun on repetitive and overlapping fragments observed in genome sequences that heavily distort read mapping and genome assembly. The most fascinating feature of ALFALFA is that it uniquely uses the ‘enhanced sparse suffix arrays’ to index reference genome (the genome to be sequenced). Index refers to a data structure that allow for the quick location of all occurrences of patterns starting at interesting positions only. Sparse suffix array is a technology which uses LCP (Longest Common Prefix) series which reduces the solution space and forms a suffix tree efficiently. Sparse Suffix Array uses a chaining algorithm to speed up dynamic programming extensions of the candidate region. This data structure facilitates fast calculation of maximal and super-maximal exact matches. The speed-memory tradeoff is tuned by setting the sparseness value of the index.

Bioinformatics Review | 6


ALFALFA follows a canonical seedand-extend work- flow for mapping reads onto a reference genome: Root system Reference genome indexed by enhanced sparse suffix array Seed Super-maximal exact match between reference genome and read (To enable quick retrieval of variablelength seeds called super-maximal exact matches between a read and the reference genome). Flower bud Cluster of seed forms candidate genomic region (Seeds are then grouped in to nonoverlapping clusters that mark candidate genomic regions for read alignment). Flower Gaps between seeds filled by dynamic programming (Handling of candidate region is prioritized by agglomerate base pair coverage of seeds. the final extend phase sample seeds from candidate regions to form collinear chains that are bridged using dynamic programming). Features of ALFALFA: ALFALFA uses the technological evolution for the production of longer

reads by using maximal exact matches [MEMs] and super-maximal exact matches [SMEMs] as seeds. (Since MEMs between a read and a reference genome may overlap, super- maximal exact matches are defined as MEMs that are not contained in another MEM in the read ) . These seeds are then extensively filtered and then decide the order of alignment to allow for more accurate prioritization of candidate regions. To reduce the number of expensive dynamic programming computations needed, ALFALFA chains seeds together to form a gapped alignment. As a result, the extension phase (aligning the matches) is limited to filling gaps in between chains while evaluating alignment quality. The sparseness value‘s of sparse suffix arrays (controlled by the option -s) provides an easily tunable trade-off to balance performance and memory footprint. In theory, sparse suffix arrays take up 9/s + 1 bytes of memory per indexed base. A sparse suffix array with sparseness factor 12 thus indexes the entire human genome with a memory footprint of 5.8GB. It shows that ALFALFA is good to perform sequencing at maximal speed acquiring minimal memory space. ALFALFA tries to balance the number and the quality of seeds using a combination of maximal and supermaximal exact matches. The intervals *i..i + l−1+ and *j..j + l-1] correspond to a maximal exact match between a

read and a reference genome if there is a perfect match between both subsequences of length ` starting at position i in the read and at position j in the reference genome, with mismatches occurring at positions (i−1,j−1) and (i + l, j + l) just before and after the location of the matching subsequence. A combination of neighboring seeds increases the evidence that some region in the reference genome holds potential to serve as a mapping location. ALFALFA therefore sorts seeds according to their starting position in the reference genome and bins them into non-overlapping clusters using the locally longest seeds as anchors around which regions are built. This results in a list of candidate regions along the reference genome. To limit the number of candidate regions requiring further examination, only SMEMs and rare MEMs are used for candidate region identification. . Candidate regions are then ranked by their cov- erage of read bases, calculated from the seeds that make up the clusters. Sequential processing of these prioritized candidate regions halts when either a high number of feasible alignments has been found, a series of consecutive candidate regions failed to produce an acceptable alignment or read coverage drops below a certain threshold. The dimensions of a dynamic programming matrix correspond to the bounds of a candidate region, but

Bioinformatics Review | 7


computations are often restricted to a band around the main diagonal of the matrix. The width of this band depends on the minimal alignment score required.ALFALFA further reduces the dimensions of the matrix by forming a collinear chain of a subset of the seeds that make up a candidate region. Dynamic programming can then be restricted to fill the gaps in between consecutive non-overlapping seeds. The chaining algorithm starts from an anchor seed and greedily adds new seeds that do not exhibit a high skew to the chain. The skew is defined as the difference of the distances between two seeds on the read sequence and the reference genome. The amount of skew allowed is automatically decided based on the gap between the seeds and the parameters that influence the feasibility of an alignment. ALFALFA allows multiple chains per candidate region, based on the available anchor seeds. Anchor selection is based on seed length and seeds contained in chains can no longer be used as anchors in successive chain construction.

The benchmark results demonstrate that ALFALFA is extremely fast at mapping long reads, while still being competitive for moderately sized reads. Together with BWA-SW and BWA-MEM, it is one of a few mappers that scale well for read lengths up to several kilobases. Reference BMC Bioinformatics Sample

16:59

doi:10.1186/s12859-015-0533-0 Michaël Vyverman (Michael.Vyverman@UGent.be) Bernard De Baets (Bernard.DeBaets@UGent.be) Veerle Fack (Veerle.Fack@UGent.be) Peter Dawyndt

Overall, Bowtie 2 has the highest sensitivity, which reaches 100%. However, Bowtie 2 is also less able to distinguish between good and bad alignments. CUSHAW3, BWA-MEM and ALFALFA exhibit the best trade-off between true positives and false positives. The mapping quality is determined by ROC (receiver operating characteristic ) curve.

Bioinformatics Review | 8


SYSTEMS BIOLOGY

Tumor progression prediction by variability based expression signatures Muniba Faiza Image Credit: Stock Photos “Gene expression signatures are commonly used to create cancer prognosis and diagnosis methods. Gene expression signature is a group of genes in a cell whose combined expression pattern is uniquely characteristic of a biological phenotype or a medical condition.� ancer has become a very

signature is a group of genes in a cell

define robust and reproducible gene

common disease now a

whose combined expression pattern is

expression signatures capable of

days, but the main reason

uniquely characteristic of a biological

accurately distinguish tumor samples

of causing this is unknown

phenotype or a medical condition. But

from healthy controls.

up till now. Various reasons have been

only few of them were successfully

given and recent research says that

able to utilize in clinics and many of

improper sleeping patterns may also

them failed to perform. Since these

lead to cancer. Like cause of cancer is

signatures attempt to model the

After employing many experiments

difficult to predict, similarly, its

highly variable and unstable genomic

regarding cancer anti-profiles, the

progression and prognosis is also very

behavior of cancer, they are unable to

results indicated that the anti-profile

difficult. Despite of many advances in

predict

of

approach can be used as a more

cancer treatment, early detection is

deviation in gene expression from the

robust and stable indicator of tumor

still very difficult. While there have

normal

malignancy

been many early cancer screening

variability across cancer types can be

techniques, but are not realistic

used as a measurement of risk of

because

cost-

relapse or death. This gives rise to the

or requirement of

concept of Gene expression anti-

C

of

effectiveness invasive

the

lack

procedures.

of

cancer. The tissue,

i.e.,

degree the

hyper-

Genomic

profiles. Anti- profiles are used to

screening techniques are a promising

develop cancer genomic signatures

approach

Gene

that specifically takes advantage of

expression signatures are commonly

gene expression heterogeneity. They

used to create cancer prognosis and

explicitly

diagnosis methods. Gene expression

expression variability in cancer to

in

this

area.

model

increased gene

Differentially variable genes= antiprofile genes

than

traditional

classification approaches. The researchers’ hypothesis is that the degree of hyper-variability (w.r.t normal

samples)

is

directly

proportional to the tumor progression, i.e., degree of hyper- variability as measured with respect to the normal samples would increase with tumor progression. Corrada Bravo et al found Bioinformatics Review | 9


out a way to derive a colon-cancer

be

extended

to

model tumor

anti-profile for screening colon tumors

progression. These studies showed

by measuring deviation from normal

that Gene expression anti-profiles

colon samples. To create an anti-

capture tumor progression.

profile, they used a set of normal samples and tumor samples, probesets are then ranked by the quantity

Fig.1 Among probes that exhibit higher

σj,tumor/

σj,tumor and

DNA methylation is one of the

variability among cancers than among

σj,normal are the standard deviations

primary epigenetic mechanisms for

normals, the degree of hypervariability

among the tumor samples and normal

gene regulation, and is believed to

observed is related to the level of

samples, respectively, for probeset j)

play a particularly important role in

progression.

in descending order, and a certain

cancer. High levels of methylation in

variance ratio statistic ,log2 σ σ2tumor

number of probesets (typically 100)

÷ σ2normal- for colon dataset (Gyorffy

with the highest value are selected.

et al; GSE4183) from anti-profile

Then they calculated the normal

computed using another colon dataset

regions of each probe set and then the

(Skrzypczak et al; GSE20916). (B)

number of probe sets for which the

Distribution of variance ratio statistic

expression lies outside the normal

for Skrzypczak et al colon dataset from

region was calculated to get an anti-

anti-profile computed using Gyorffy et

profile score of the sample.

al colon dataset. (C) Distribution of

σ j,normal(where

To test their hypothesis, they

variance

(a)

ratio

Distribution

statistic

of

for

available

adrenocortical data (Giordano et al;

microarray datasets with normal,

GSE10927) for universal anti-profile

adenoma, and cancer colon samples.

probe sets.

obtained

two

publicly

promoter

are usually associated with low transcription. Cancer has loss of sharply methylation levels which is

By studying these datasets, they

Both adenoma and cancer samples

associated

plotted the distribution of variance of

show higher variability than normals

hypervariability in gene expression

cancer/adenoma samples to variance

(region to the right of x = 0), while

across multiple tumor types. They

of normal samples ratio (in log2 scale)

cancer

applied

for these probe sets on the other

hypervariability than adenomas. This

method to DNA methylation data

dataset

suggests that hypervariability is a

from thyroid and colon samples,

stable marker between experimental

where for each tissue type, normal,

datasets and that specific selection of

adenoma and cancer samples were

hypervariable genes across cancer

available.

types and the anti-profile method can

distribution

(Fig.

1A

and

B).

samples

show

higher

with

increased

the anti-profile scoring

Figure of

2 shows

the

adenoma

and

Bioinformatics Review | 10


carcinoma samples against normal

Biology, Department of Computer

samples on a principal component

Science and UMIACS, University of

plot, showing the presence of the

Maryland, College Park, MD, USA.

hypervariability methylation

pattern data: the

in normal

samples cluster tightly, while the adenomas

show

some

dispersion and the carcinomas show even greater dispersion. Since these behaviors are present for both colon and thyroid data, it again reinforces their notion that the anti-profile approach has wide application for classification

in

cancer.

figure 2. Anti-profiles applied to methylation data: first two principal components

of

(A)

thyroid

methylation data and (B) colon methylation data. Conclusion: The anti-profile approach is more suitable for cancer prognosis. It can robustly predict the tumor progression and prognosis based on the variability in the gene expressions. The results presented above also confirms that gene expression signatures based on hyper-variability

can

be

highly

HĂŠctor

valuable. Reference: Wikum

Dinalankara

and

Corrada

Bravo

Center

for

Bioinformatics and Computational

Bioinformatics Review | 11


TOOLS

BioMiner & Personalized Medicine: A new perspective Muniba Faiza Image Credit: Google Images “BioMiner is a web-based tool which provides various tools for studying the statistical analysis and a deep insight of transcriptomics, proteomics and metabolomics with cross-omics concept”

P

ersonalized medicines have

novel web based tool “Biominer” has

PRoteomics IDEntifications (PRIDE) for

become a very important part

been

which

proteomics data, or Sequence Read

of the medicine world now a

provides access to a wide variety of

Archive (SRA) of NCBI are used to store

days. They are also known as

high-throughput datasets. This tool

the

‘Individualized

Medicines’.

was developed within the scope of

datasets

Personalized medicines allow a

an international and interdisciplinary

sequencing. The only limitation with

doctor to prescribe more specific

project

Biominer

these repositories is that they store

and

a

provides the user various facilities

biological data of a dedicated set of

particular patient. This concept has

with convenient tools which help

single omic type and do not support

created many more opportunities

them to analyze the high-throughput

the cross-omics.

and aspects in the medicine world.

datasets and provides a deep insight

Personalized medicine concept is

for complex cross-omics datasets

A database namely, SystherDB has

accomplished by obtaining high-

with enhanced visualization abilities.

been developed in which the stored

throughput data sets from genomics,

Since Biominer was developed under

data is well presented and easily

transcriptomics,

Systher (System

Tools

accessible, and whose data is mined

metabolomics, but more specifically

Development for Cell Therapy and

and analyzed by the BioMiner tools.

it requires the ‘cross-omics’, i.e.,

Drug

A public instance of BioMiner is

linkage between transcriptomics,

www.systher.eu) project so its main

freely available online. It currently

proteomics

focus is on cancer.

contains 18 different studies, with

efficient

medicines

and

proteomics

to

and

metabolomics.

launched

recently

(SYSTHER).

Biology

Development

biological for

high-throughput next-

generation

almost 4,000 microarrays and more

Currently, there are simple webbased tools which do not allow much

Public data repositories such as Gene

than 187 Mio measured values of

access to the high throughput

Expression

and

genes, proteins, or metabolites.

datasets from the omics. But a new

ArrayExpress for microarray data,

Since BioMiner was developed in

Omnibus

(GEO)

Bioinformatics Review | 12


the SYSTHER project, most of the

Human

studies

(HMDB), and Kyoto Encyclopedia of

are

focused

on

the

glioblastoma multiforme (GBM).

Metabolome

A

Database

Genes and Genomes (KEGG).

Fig2 . Data mining with Biominer.

6.

cross-omics

screenshots of different results

relationship (e.g., a mapping of

from data mining with Biominer

Predefined

metabolites onto genes or vice versa) among the biological datasets. 7.

Pathway

and

functional

information from Reactome, KEGG, and Wiki- Pathways. 8.

Gene

Ontology

is

also

supported. 9. Correlation Fig.1 Workflow of BioMiner FEATURES: 1.

BioMiner uses Google Web

Toolkit (GWT) for the graphical user interface (GUI). 2. A separate MySQL database is created which is manually curated and used to store the Experimental data from

genomics,

proteomics

and

metabolomics. 3.

Data

import

has

to

be

performed by a dedicated specialist to ensure the data consistency. 4.

indexing methods are implemented. 5.

Metabolite data are annotated

using

three

different

identifier

systems: Golm Metabolome Database,

(statistical

analysis of any two variables) are

including the following: (a) study

based

overview,

on

Pearson

correlation

(B)

detection

of

coefficients.

differentially expressed genes, (C)

10.

Correlations are calculated for

correlation of gene expression and

high-variance genes (by default top

survival time, (d) identification of

500 genes).

significantly enriched pathways, (e)

11.

visual pathway inspection based on

BioMiner complies with public

data management standards such as

predefined

Minimum

a

biomolecule comparison of gene

(MIAME),

and protein expression. results are

Microarray

Information Experiment

About

layouts,

and

(f)

a

typically presented in synchronized,

Proteomics Experiment (MIAPE), and

parallel views composed of a table

Minimum

and

Minimum

Information Information

About About

a

a

plot.

Fig3.

Pathway

Metabolomics Experiment (MIAMET).

visualization. Interactive pathway

12. ENSEMBL database is used for

visualization of the cell cycle

Response time is with in just a

few seconds, for this purpose special

analyses

13.

cross-mapping between the genes

pathway

and proteins.

repository.

from

WikiPathways

Cross-mapping between the

genes and metabolites the combined

BioMiner is a web-based tool which

information of ConsensusPathDB and

provides various tools for studying the

HMDB is used.

statistical analysis and a deep insight of transcriptomics,

proteomics

Bioinformatics Review | 13

and


metabolomics with cross-omics concept. Results are presented in two parallel views composed of a table and a plot. Both views are interactive and userdefined selections can be synchronized. Pathway visualization is achieved by extending the PathVisio library. It also provides clinicians and physicians a platform integrating high-throughput data together with clinical parameters, thereby leading to better personalized medicines. Reference: Chris Bauer1, Karol stec1, alexander Glintschert1,

Kristina

Gruden2,

Christian schichor3, michal or-Guil4,5, Joachim

selbig6

and

Johannes

schuchhardt1

Bioinformatics Review | 14


META ANALYSIS

Meta-analysis of biological literature: Explained Manish Kumar Mishra Image Credit: Google Images “ Meta-analys is an analys is of already publis hed data by s imply rearranging it, s orting it, and trying to find hidden patters out of publis hed literature. ”

I

t’s a fine Monday morning, and

biological systems are most complex

So what is meta-analysis about?

to date and present day computer

the new intern finds his way to the laboratory of biological

The new cool word to biological

simulations fail to rival the complexity

data mining procedures. His brief

realm “meta-analysis” can be better

with equal efficiency, so any analysis

interview

concerned

understood by understanding the

narrowed down to gene must also

scientist has allowed him to have

meaning of first half of the

consider that the gene may very well

very limited understanding of the

term; META, meaning data of data,

be found in multiple organisms and

subject.... Upon his arrival he is

thence making meta-analysis an

thus may present considerably high

greeted with a humongous corpus of

analysis of already published data

amount of results irrelevant to the

mixed articles, say some 4000, and

by simply rearranging it, sorting it,

study.

he is required to assemble specific

and trying to find hidden patters out

information out of the data set, by

of published literature.

diligently

with

the

scrutinizing

the

A rigorous manual inspection of program sorted data is required to

By most rudimentary means, meta-

sort out such entries. Since meta-

analysis can be achieved by reading

analysis relies heavily on statistical

Well, the situation could be

the corpus of research and review

studies of data, researchers tend to

frightening to a purely wet lab

articles concerning a particular topic

rely on programming languages such

working biologist, but a man who

which may be as wide a whole

as Stata and R to write their specific

has had any exposure to the real

Eukaryotic genome

be

codes for analyses, R unlike Stata is

power of file handling with any

narrowed down to phyla, groups,

free, produces publication quality

programming language will know

species may be a specific disease or

outputs and provides a plethora of

how to let a simple few lines of

even any gene in particular. Where on

packages, of which a few provide

code do his bidding.

one part we try to narrow down to

programs like PDF miner, PubMed

disease or gene, one must also realize

miner etc used for accessing PubMed

components of each article.

or may

Bioinformatics Review | 15


database, these packages contain

issues. The current query system to

Need of the hour is to engage more

codes to access the database and

NCBI and sister organizations fail to

resources

extract all information off them with a

acknowledge synonymous terms and

structured and somewhat intelligent

command based interface for huge

treats them as individual entities not

query

data sets at once cutting down

linked to any, but only in association

acknowledge the gene-names and

manual efforts and time taken to

with the length of query items made

abbreviations, scientific and English

achieve the task.

alongwith. A robust query system is

names of organisms and also the

into developing well-

systems which can truly

needed to enhance the results, and

variations of presenting names of

All praises sent, the method has its

make the whole concept more

techniques

own fair share of drawbacks and

efficient.

Bioinformatics Review | 16

involved.


PROTEOGENOMICS

How to check new peptides accuracy in Proteogenomics Muniba Faiza Image Credit: Google Images “ During the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get those peptides which in actual are not but the algorithm may show them .�

P

roteogenomics is an emerging area which is an interface of proteomics and genomics. This intersection employs the genomic and transcriptobmic information to identify the novel peptides by using mass spectrometry based techniques. The proteomic data can then be used to identify the fingerprints of genic regions in that particular genome which may results in the modification of gene models and can also improve the gene annotations. So, we can say that proteogenomics has been well accepted as a tool to discover novel proteins and genes."But, during the discovery of novel genes, there are huge chances of getting false results as positives, i.e., we can also get those peptides which in actual are not but the algorithm may show them". Therefore, to avoid or more

accurately, to minimize the chances of false positives, a False Discovery Rate (FDR) is used. FDR is a ratio of number of decoy hits / number of targets.

It has been observed previously that, under a fixed FDR, the inflated database generated by, e.g. six-openreading-frame (6-ORF) translation of a

FDR = decoy/ target

whole genome significantly decreases the

In most conventional proteogenomic

sensitivity

of

peptide

identification.

studies, a global false discovery rate (i.e., the identifications of annotated

Recently, Krug implied that the

peptides and novel peptides are

identification

subjected to FDR estimation in

peptides is greatly affected by the

combination) is used to filter out false

completeness of genome annotation,

positives for identifying credible novel

i.e., more the genome is annotated,

peptides. However, it has been found

higher

that the actual level of false positives

identification

in novel peptides is often out of

peptides.

control and behaves differently for different genomes.

are

accuracy

the of

of

chances accurate

novel

of novel

In this recent paper, they followed the same framework as in Fu’s work to

Bioinformatics Review | 17


quantitatively subgroup

investigate

FDRs

of

the

annotated

identifications may be excluded along

transcriptome

information

is

with false positives.

unavailable, it would be also helpful to

and novel peptides identified by 6-ORF

reduce the 6-OFR translation database

translation search.

by

In this article, they have revealed that

To

the genome annotation completeness

specificity of novel gene discovery,

ratio

one should reduce the size of

is

the

dominant

factor

increase the sensitivity and

real proteins.

Reference:

of novel peptides identified by 6-ORF

possible. For example, when the

translation search when a global FDR

transcriptome information (especially

A note on the false discovery rate of

is

quality

from the strand-specific cDNAseq

novel peptides in proteogenomics

assessment. However, with a stringent

data) is available, it is apparently more

Kun Zhang1,2, Yan Fu3,*, Wen-Feng

FDR

favorable

Zeng1,2,

control

low scoring

(e.g. but

1%), true

many peptide

to

search against

as

predicted to be hardly possible to be

searched

for

as much

that are

influencing the identification accuracy

used

database

removing sequences

the

transcriptome as well than to search against the genome alone. If the

Kun

He1,2,

Hao

Chi1,

Chao Liu1, Yan-Chang Li4, Yuan Gao4, Ping

Xu4,*

and

Si-Min

Bioinformatics Review | 18

He1


GENOMICS

The basic concepts of genome assembly Muniba Faiza Image Credit: Google Images � Genome, as we all know, is a complete set of DNA in an organism including all of its genes. It consists of all the heritable information and also some regions which are not even expressed.�

G

enome, as we all know, is a

Basic

strategy

involved

behind

complete set of DNA in an

discovering the new information of

organism including all of its

genome is explained in following steps: 3. Now, the question that arises is how

genes. It consists of all the heritable

contig. Many such contigs are formed during the joining process. come we know that a fragment which

1. First of all, the whole genome of

may be a repeat has been kept in its

an organism is sequenced which

right place as a genome may have

results in thousands or hundreds

many repeated regions? To overcome

Almost 98 % of human genome has

of different unknown fragments

this, paired ends are used. Paired ends

been sequenced by the Human

starting from anywhere and

are the two ends of the same

Genome Project, only 1 to 2 % has

ending upto anywhere.

sequence fragments which are linked

information and also some regions which are not even expressed.

been understood. Still the human

2. Now, since we don't know what

together, so that if one of the end of

genome has to be discovered more

the sequence is and which fragment

the fragment is aligned in lets say

whether it would be in terms of genes

should be kept near to which one, the

contig1 then the other end which is a

or

sequencing

concept for 'Contigs' is employed.

part of the former will also be aligned

strategies and algorithms have been

Contigs are the repeated overlapping

in the same Contig as it is the

proposed for genome assembly. Here I

reads which are formed when the

consecutive part of the sequence.

want to discuss the basic strategy

broken fragments comes to each

There are various software with the

involved in genome assembly, which

other

help of which we can define different

sounds quite difficult but is not really

overlapping regions of the sequence.

complex if understood well.

It means that many fragments which

4. After that all the Contigs combine

are consecutive are joined to form

to form a scaffold, sometimes

proteins.

Many

only

by

matching

the

lengths of the paired ends.

Bioinformatics Review | 19


called

According to my experiences, more

Thank you for reading, Don't forget to

as Metacontigs or Supercontigs,

efficient algorithms are which may

share this article if you like it.

which are then further processed

provide us large information in one go.

and the genome is sequenced.

Just imagine that we got a thread of sequence with unknown base pairs,

All of this is done by different

then what would we do with that

assembly algorithms, mostly used are

thread and how would we identify and

Velvet and the latest is SPADES.

extract the useful information from this thread??

Bioinformatics Review | 20


CADD

Computer and Drugs: What you need to know Altaf Abdul Kalam Image Credit: Google Images ” Computers are being used to design drugs and it is being done on a humongous level, by almost every multinational pharma company.”

W

ould you chance your life to a lowly piece of hardware called the computer? Would you let it fabricate and determine drugs for life threatening diseases like hepatitis, cancers and even AIDS? Well, actually your answer (or your opinion) doesn’t seem to matter. Because the world has moved over to the other side. Computers are being used to design drugs and it is being done on a humongous level, by almost every multinational pharma company, the names of which you will undoubtedly find at the back of your prescription medicines at home. So what’s with all this computer stuff? Have we parted with our perspicacity, our intuition, our ready willingness to tackle any challenge head on? We have always found solutions to mankind’s biggest problems all by ourselves. As Matthew McConaughey’s character in Interstellar says "..or perhaps

we’ve forgotten pioneers?"

we

are

still

Well philosophical mubo-jumbo aside, its not that simple as its sounds. Ofcourse, most of you reading this already have some background in this topic and have already understood what I am talking about. But for those of you who haven’t the slightest clue, don’t worry, this write up is for you. Throughout this series of articles on this particluar issue, I am going to try and break it down to the basics. Lets say that by the end you would see a car not for what it is – with all its complexity and slickness- but for what made it the way it is – the nuts and bolts and rubber and.. whatever, you get the point! So where do we start? Money! Yes the thing that runs the world. Contrary to what all the losers who never made a dime say, money

simply is everything. Even Bill Gates was forced to acknowledge the fact and decalre "Money isn’t everything in this world , but you gotta have a lot of it before you say such rubbish." So that settles it then. Now lets come back. The basic modus opernadi of designing a drug is that you first find a suitable target which you believe will be key to challenging the disease. This mostly is a protein/enzyme that can metabolise a particular drug or in some cases even the disease causing genes from the pathogen itself. Finding this target is not easy, but it is not that hard either. We have documentations, intensive studies and databases dedicated to listing, characterizing and studying the drug metabolizing genes and proteins in the body. Different classes of metabolizers act on different types of chemicals (or drugs if you like). A class of metabolizers called the CYP

Bioinformatics Review | 21


enzymes metabloize over sixty percentage of the known drugs and medicines that humans consume. This includes drugs (the real ones – LSD, cocain, heroin.. get it?) and even poisons and sedatives. The metabolizers ofcourse don’t know which is which. If it suits them they metabolize it, else they are out of your system. Now, under the assumption that we have a drug target, the next step is finding the suitable drug candidate itself. Now this step is what you call finding a needle in a haystack. There are literally millions of drugs out there and if that is not enough you can go design your own and get it synthesized. In a drug target (we will call it simply the ‘protein’ henceforth) there are multiple points of action where multiple drugs can act. So for example in a protein made of 200 amino acids, we might find 50 actionable amino acids. For these fifty amino acids we may find thousands and thousands of drug candidates, all capable in bringing about some or the other change in the protein. So how do we find the One? If you asked that question about fifteen years back, the answer would have been to slog it out. Match every drug candidate you got against the protein you have and check the effects in vivo. Now countless number of

factors come into play when a drug and a protein interact – global minima, energy minimization, binding affinity, hydrogen bonding intensity and what not. We shall learn about them in more detail in upcoming articles. So to put it simply, scientists spent their whole sorry lives pitting different drug candidates against the same protein over and over again until they found something worthwile to hang on to. Even if all the above mentioned factors blended in wonderfully, they might sadly, at the end discover that the drug actually caused more harm than it did good. So the candidate gets discarded and they start the process all over again! Sometimes you got lucky and found the right drug within performing a few combinations. But mostly it took years to even zero in on a drug that could be moved further into the drug discovery pipeline which in itself is another torturous process! So coming back to the money factor. You don’t need to be a Harvard Business School garduate to learn that this tiresome task costs money, a lot of money! Money in the form of manpower, reagents, biological matter like tissues and test animals and plants, instrumentation, electiricity and what not. Another thing it costs is something which

none of use care about much – time. Picture designing a drug for some novel disease which is killing thousands of people each year. And picture having to0 do this same procedure and coming out with a drug after 10-15 years. The cost of such a life saving drug will also be high because the company or lab that made them would want to recover all the time and money they spent on it in the first place. Not exactly feasible and effective I would say. So here comes computer aided drug design. Which – brace yourself, can shave off years from the drug discovery pipeline. It can get you into the clinical trials phase within say 2-3 years as opposed to the earlier average of 7-8 years. Less money spent, less time spent, faster avilability of a potential cure and who knows, even less expensive medicines. So how does it work? How does the entry of a man made machine change everything for the better so drastically? What does a computer do that humans could not? Can you trust the results that happen in silico over something that happen in vivo? Is a computer finally that evolved that it can simulate life forms inside its mother box and processors? We will hopefully see those questions answered in the next few posts!

Bioinformatics Review | 22


SEQUENCE ANALYSIS

Basic Concept of Multiple Sequence Alignment Muniba Faiza Image Credit: Google Images “ The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity.�

M

ultiple Sequence Alignment (MSA) is a very basic step in the phylogeny analysis of organisms. In MSA, all the sequences under study are aligned together pairwise on the basis of similar regions with in them. The major goal of MSA pairwise alignment is to identify the alignment that maximizes the protein sequence similarity. This is done by seeking an alignment that "maximizes the sum of similarities for all the pair of sequences", which is called as the 'Sum-of-scores or SP Score'. The SP Score is the basic of many alignment algorithms. The most widely used approach for constructing MSA is "Progressive Alignment", where a set of n proteins are aligned by performing n-1 pairwise alignments of pairs of proteins or pairs of intermediate

alignments guided by a phylogeny tree connecting the sequences. A methodology that has been successfully used as an improvement of progressive alignment based on the SP Score is "Consistency-based Scoring",where the alignment is consistent dependent on the previously obtained alignment, for example, we have 3 sequences namely, A,B, & C ,the pairwise alignment A-B, B-C imply an alignment of A-C which may be different from the directly computed A to C alignment. Now, the question arises that how much can we rely on the obtained MSA? and how an MSA is validated? The validation of MSA program typically uses a benchmark data set of reference alignments. An MSA produced by the program is compared with the corresponding

reference alignment which gives an accuracy score. Before 2004, the standard benchmark was BAliBASE ( Benchmark Alignment dataBASE) , a database of manually refined MSAs consisting of high quality documented alignments to identify the strong and weak points of the numerous alignment programs now available. "Recently, several new benchmark are made available, namely, OXBENCH, PREFAB, SABmark, IRMBASE and a new extended version of BAliBASE." Another parameter which is considered as basic in most of the alignment programs is fM Score. It is used to assess the specificity of an alignment tool and identifies the proportion of matched residues predicted that also appears in the

Bioinformatics Review | 23


reference alignment. Many of the times, it is encountered that some regions of the sequences are alignable and some are not, however, there are usually also intermediate cases , where sequence and structure have been diverged to a point at which homology is not reliably detectable.In such a case, the fM Score , at best, provides a noisy assessment of alignment tool

specificity, that becomes increasingly less reliable as one considers sequences of increasing structural divergence. However, after considering the reference alignments, the accuracy of results is still questionable as the reference alignments generated are of varying quality.

Multiple sequence alignment

Robert C Edgar1 Batzoglou2 

and

Serafim

BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs

Julie D. Thompson, Frédréric Plewniak and Olivier Poch

REFERENCES:

Bioinformatics Review | 24


SYSTEMS BIOLOGY

Basics of Mathematical Modelling - Part 1 Fozail Ahmad Image Credit: Google Images “ Mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, gene-gene interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. �

B

iochemical processes are simply complex, and their apparent feature does not easily allow us to investigate what exactly system means. Moreover, most of the biochemical processes obey nonlinear reaction kinetics. That is, amount of reactant (Protein/RNA/DNA/) is not directly proportional to its product. This leads to further increase in complexity level of molecular mechanism. And create biological noises such as randomization (stochasticity) of biomolecules, perturbation in cell signaling, difficulty in quantification of cell products and even unexpected response of the entire system. Here comes development and utilization of a mathematical model takes multiple factors/parameters into consideration and provides researcher with a visual understanding how complex biological system functions and responds to external (hormone/drug/cation/anion)

and internal signals (protein/enzyme/cation/anion) or adverse environmental condition such as deficiency of Fe 2+ ion during formation of Vitamin-D. Basically, mathematical modeling receives a broad domain of cellular processes such as metabolic regulation, genegene interaction, and gene-protein interaction. This has made a bridge between experimental and expected outcome. In case of discrepancies, between the two, parameter taken into consideration need to be refined. The general approach of the modeling give us the following benefit: 1. Discrepancies between mathematical model and actual experimental result point to components that still are missing from hypothetically developed model, and therefore one can develop a more comprehensive scenario of systems behavior. On

the hand, a well developed model assist in designing and clarifying the additional issues in ongoing experiment. 2. With the help of a mathematical model researcher can modify experimental parameter (e.g., by introducing modified protein associated with Mg2+ uptake into cell) and run the computer simulations. 3. Most importantly, mathematical models are not limited to environmental/experimental constraints. They may be quickly changed for multiple conditions/parameters and most suitable simulation can be assessed for developing a reliable experimental design. 4. A mathematical model may help to investigate sub-system that sometimes regulate special

Bioinformatics Review | 25


biochemical process, though all biological reactions cannot be treated same, which necessarily provides substantial information about large systems behaviour.

knowledge generation experimental design.

and

By doing multiple simulations and changing parameter values, we are able to represent the real biochemical/molecular phenomena which seem to be difficult-to-treat. → (To be Continued‌

Fig: 1. Schematic representation of biological modelling process,

Bioinformatics Review | 26


TOOLS

MUSCLE: Tool for Multiple Sequence Alignment Muniba Faiza Image Credit: Google Images “MUSCLE is one of the software which is known for its speed and accuracy on each of the four benchmark test sets ( BAliBASE, SABmark, SMART and PREFAB).”

I

n my last article I discussed about the Multiple Sequence Alignment and its creation. Now in this article, I am going to explain the workflow of one of the MSA tool, i.e., MUSCLE. MUSCLE is a software which is used to create MSA of the sequences of interest. MUSCLE is one of the softwares which is known for its speed and accuracy on each of the four benchmark test sets ( BAliBASE, SABmark, SMART and PREFAB). It is comparable to TCOFFEE & MAFFT (these tools will be explained in upcoming articles).

alphabets in the sequences will be searched & aligned together. Kimura distance is the measure which is based on the fact that multiple substitutions occurs at a single site.

MUSCLE algorithm: Two distance measures are used by MUSCLE for a pair of sequences: a kmer distance (for an unaligned pair) and the Kimura distance (for an aligned pair). A kmer is a contiguous subsequence of length k, also known as a word size or k-tuple, i.e., it decides that how much

again apply UPGMA method and forms TREE2

again followed by a progressive alignment and forms MSA2, and forms tree.

now from the last obtained tree it delete the edges which results in the formation of two sub trees,

computes the sub tree profile (align the sub trees)

Distance matrices are then compiled using UPGMA method (i,e., a method of phylogeny tree construction based on the fact that the mutations occur at the constant rate), which gives a TREE1,which is followed by an progressive alignment and forms MSA1,

and then finally gives an MSA, for which the SP Score is calculated (explained in previous article "Basic Concept of MSA"),

if the SP Score is better, then only it saves that last obtained MSA as MSA3, otherwise it discard the MSA,

compute identities

again repeat from the step 6, and finally gives a clustered MSA.

For an aligned pair of sequences, MUSCLE 

construct a Kimura Distance MAtrix,

computes the pairwise percent identities ,i.e., how much percentage of the sequences are aligned/matched and convert it to a distance matrix applying Kimura.

pairwise percent from MSA1 and

Bioinformatics Review | 27


Fig.1 The workflow of MUSCLE This is how MUSCLE works. MUSCLE alignment is also used in

MEGA6 tool which is used for phylogeny tree construction. Every software or tool has its own benefits depending up on the needs under consideration. There are various other tools also available for MSA such as TCOFFEE, MAFFT, etc, which have high accuracy and speed. They will be explained in upcoming articles.

MUSCLE : multiple sequence alignment with high accuracy and high throughput Robert C. Edgar*

Reference:

Bioinformatics Review | 28


SYSTEMS BIOLOGY

Introduction to mathematical modelling - Part 2 Fozail Ahmad Image Credit: Google Images “In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway.�

G

athering of Dynamic/Kinetic information

In the previous section you might have noticed that modelling biochemical process requires calibrated set of fine parameters which fit into and across the set of chemical/reactant species (gene/protein/molecule) involved in the process. Question arises, where do we collect data from? And what are the standard criteria for determining parameters? Basically, for a researcher, it is necessary to know the source of the data first and then how to manipulate to get relevant information for modelling. Source of the data can be chosen depending upon the requirement of experimental design. For modelling, data can be taken in the form of gene-gene interaction, Gene expression (microarray) and gene-protein interaction.

Basically, interaction and expression do not simply reveal the dynamic/kinetic values of the system and therefore need to be manipulated for further implication. In order to obtain the parameter values for analysing the kinetic behaviour of biochemical processes, we simply investigate into expression data (gene/protein) at different transcriptional and translational level that enable us to frame-out the comprehensive structural biochemical pathway. This can be done by accounting the following methods: 1. Genome (complete set of genes) analysis at transcription level through DNA sequencing & genotyping

3. Proteome (entire protein) analysis at cellular level (reaction between proteins and other molecule in cell) using Mass spectroscopy and 2D gel electrophoresis 4. Metabolome (total metabolites and their intermediate) analysis at cell level, interaction metabolites and regulator 13 using C labelling and NMR techniques 5. Interactome (all interacting molecule) analysis by yeast-2 hybrid screen & TAP techniques (Ligand: TAP- Tandem affinity purification, NMR- Nuclear magnetic resonance).

2. Transcriptome (all mRNA) analysis at translation level using microarrays

Bioinformatics Review | 29


Fig: 2. OMICS generates data for developing structural pathway as well as parameter values are set from the same. A mathematical model (formula), in the form of differential equations, form reaction channels are derived and then executed/solved using suitable algorithm. The resultant Simulation shows the dynamic behaviour of the system. This can be fluctuated by changing parameter values to obtain the close-up result form experimental data. Above mentioned techniques are collectively referred to as Omics. They provide us structural and dynamic data that is used to generate mathematical formula representing observable reactions, followed by development of mathematical model and comprehensive pathway of biological system. These tentative models allow us (as mentioned in part-1) to observe the effect of a stimulus on specific signalling pathway, perturbation in cellular activities and gene expression level etc. Omics are characterized by a number of features. First, they allow researcher for analyses on different

molecular levels such as gene, protein and metabolites level. These different molecular levels sometimes show asynchronous behaviour—that is, although some metabolite such as glucose is higher in a cell and corresponding enzymes are lower than sugar to catalyse the reaction or vice versa. Asynchronous behaviour is an indication of complex regulatory mechanism. Therefore, it is crucially important to evaluate the degree of synchronization of all cellular level. Second, Omics are highly parallelized. This means all genes/mRNA (read-outs in sample) can be studied simultaneously rather than having to perform separate experiments focusing on individual genes. This parallelization also allows researcher to compare the degree of expression results for the same gene and to have an interaction between resultant proteins. Third, they are very standardized and therefore needs very high automation computing, providing scientist with a large number of samples at time. In the process, after collecting huge data, most relevant information is picked up and then processed further for final analyses. Entireties of techniques in an Omics are very important in the sense that they generate numerical data based upon which we are able to develop structural pathway for mimicking the real picture of biological system and then to represent in the form of mathematical model. (

→Continue

to

part

3)

Bioinformatics Review | 30


Bioinformatics Review | 31


BIOINFORMATICS NEWS

DNA test for paternity: This is how you can fail! Tariq Abdullah Image Credit: Google Images “ DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome. 99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual.�

D

NA test, also called as DNA fingerprinting is done to verify paternity, to find out criminal involvement, in forensic science, in archaeology and other scientific fields. It is a well-established fact that DNA fingerprinting test is foolproof. It has its merits in court cases too. It is considered to be a credible evidence of criminal involvement and paternity. Case of N.D Tiwari had also got wide media attention in recent times. BUT the chances are, you can fail a DNA test even with your real father or mother! To understand how is it possible, let us look at how DNA Fingerprinting is done. The technique of DNA fingerprinting/genetic fingerprinting or

simply DNA Test was discovered by Dr. Alec Jeffreys in 1984. DNA contain genetic information of an individual. The whole set of genes in an organism is known as Genome. 99.9% of genome is exactly same in every human individual but, 0.1% differs. 0.1% of each individual is unique to the person thus making it possible to identify each individual. To identify a person by the difference in DNA sequence, the sequence are simply compared to each other. To speed up this process of comparison, rather than each nucleotide, a molecular biologist compares the regions of high variation in DNA sequence, called minisatellites. The location and sequence of minisatellites in genome varies in every individual. The chances of occurrence of same minisatellites are very low (1 in a billion). Hence it can be

treated as unique to every individual just like a fingerprint. To perform DNA test, the set of DNA is first broken into smaller pieces by an enzyme called restriction endonuclease called Eco R1 which cuts the sequence at distinct location where the sequence is GAATTC or its complementary. The location of this repetitive sequence varies in every individual. Each fragment is then sorted according to their molecular weight (or size) by a technique called gel electrophoresis. The fragment are then compared to each other. If the fragments generated by restriction enzymes are of same size it is more likely that both the sequences originated from the same individual. Click here to read more about genetic fingerprinting. So how can DNA fingerprinting fail?

Bioinformatics Review | 32


For the DNA test to fail we have to have two different set of DNA (Genome) in our bodies. It is possible by following ways, this is a concise list for a quick reference. 1. Since the human body is a complex and dynamic system, the environmental conditions in different parts of body may lead to changes in DNA, this is a comparatively new idea studied in epigenetics. Though changes occur this way, It is not likely to change the entire DNA and location of microsatellites. 2. Transposable elements may also cause the location of some sequence to change. The occurrence of transposable elements is not so widespread as to change locations of all microsatellite. Hence, this idea does not seems satisfactory.

Sometimes, when one of the twin dies during early pregnancy, the remnant cell may be taken up and absorbed by the surviving embryo. Thus surviving embryo will have two kinds of genome in different part of the body depending upon the process of differentiation. Thus, if the cells that form sperm in your body have different set of genome than rest of the cells in your body, you may fail a DNA test! This is a rare condition and chances are low that you would get away with a crime. If you liked this article or find this article worth reading, please do not forget to share. Who knows, there might be cases of human chimera around you. :)

3. Occurrence of more than one kind of cells in terms of genome, in human, i.e. Human Chimera was recently seen in a US man who failed the paternity test with his real child.

So what is Human Chimera? To be simple and precise, human chimera is the occurrence of cells with completely different set of genes in a single individual. It is a very rare condition and may go unnoticed.

Bioinformatics Review | 33



Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.

Log on to www.bioinformaticsreview.com



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.