FEBRUARY 2016 VOL 2 ISSUE2
“Science never solves a problem without creating ten more.” -
George Bernard Shaw
What is Numerical Taxonomy? How is it useful?
Molecular Evolutionary Genetic Analysis
Public Service Ad sponsored by IQLBioinformatics
Contents
February 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
5
03 Phylogenetics Molecular Evolutionary Genetic Analysis 06
34 Bioinformatics 22 CADD
Programming What is Numerical Taxonomy? How is it useful? 11
Disease Modeling: Better way to understand Pathogenic Pathways 09
9
EDITOR-IN-CHIEF DR. PRASHANT PANT FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA NABAJIT DAS
REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published monthly for one year (12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
EDITORIAL
Editorial
Team BiR strives to give a daily dose of Bioinformatics News to motivate our young researchers for out-of-box thinking. This is what research is all about. It is an initiative where we talk about the current affairs in this newly emerging discipline to create awareness among people and the new generation of scientists and researchers. Bioinformatics is not just about simulating interaction based studies or decoding biological system into mathematical models or convert sequence data or marker data into taxonomic interrelationships. There are several other prospects and issues to be dealt with using Bioinformatics viz. Cloud computing, biological data management, gene prediction and many others. Although still in nascent stage, potential of several aspects of this discipline are not yet fully understood and is limited by our imagination. Lack of advancement and technology makes it further more challenging to get popular as a subject which in itself is unison of Biology with informatics. Another reason could be lack of awareness. To address this concern, BiR plans to keep scientific community updated with a small dose of Bioinformatics to keep the beginners excited and advanced user updated. With a zeal to keep progressing BiR to the next level, a new section will be introduced next month on the Legal and IPR aspects of Bioinformatics. In this section, we plan to being you to the IPR and other legal aspects such as copyright, patents and other related aspects which are yet untouched in any Bioinformatics Magazine till date. We hope that our readers will find it exciting and informative and also hoping that with this step BiR expand its horizon of reader and subscribers.
Muniba Faiza
Founding Editor
Letters and responses: info@bioinformaticsreview.com
PHYLOGENETICS
Molecular Evolutionary Genetic Analysis Image Credit: Google Images
“MEGA, Molecular evolutionary phylogenetic analysis which performs both sequence analysis and phylogenetic analysis in a very sophisticated manner.”
M
EGA: Evolutionary
Molecular Genetic Analysis It is important to know the basic molecular relationship between two living organisms as one begins performing comparative studies for knowing the evolutionary aspects and for contributing to knowledge base. Several tools and software have been introduced for meeting the task of such analysis. Each tool has different algorithm and method to perform molecular phylogeny. Examples include; ClustalW, Dendroscope, Hyphy, PAUP and Phylip etc. Among them is the most efficient tool, MEGA, Molecular evolutionary phylogenetic analysis which performs both sequence analysis and phylogenetic analysis in a very sophisticated manner. MEGA’s functionality include the creation and exploration of sequence alignments, the enumeration of
sequence divergence, the construction and visualization of phylogenetic trees, and the testing of molecular evolutionary hypotheses. Previously, many versions of MEGA had been released which integrate Web-based sequence data acquisition and their alignment capabilities with the evolutionary analyses. It makes comparative analyses much easier to conduct in a single computing environment. Over the period of time, this tool has come to boost up the classroom learning experience as its use by educators, researcher and students in different disciplines has expanded. This tool is contended with three distinct functionalities, along with some other features, which is why it is exercised for performing fine quality phylogenies by a large number of researchers and professionals as outlined below. First, Caption Expert software module; generates descriptions for
every result obtained by MEGA4. This enunciation informs the user about all of the options used in the analysis, including the data subset, the selected option for the handling of sites with gaps and missing data, the evolutionary model of substitution (e.g., nucleic acid substitution pattern, uniformity of evolutionary convergence or divergence and its rate among sites, and homogeneity or heterogeneity assumption among descendants, and the algorithms applied for estimating pair wise distances and for inferring and testing phylogenetic trees. The caption is also included with specific citations for any algorithm, method and software used in analysis. The availability of these descriptions is to promote a better understanding of the assumptions used in analyses, and of the results produced. This is needed because MEGA’s instinctive graphical interface makes it easy for both new and expert users to perform a variety of Bioinformatics Review | 6
computational and statistical analyses. Sometime users don’t realize the underlying assumptions and data-handling options intricate in each analysis. Even expert population and molecular geneticists may not recognize all of the assumptions for immediate. Generally, a description of algorithm or method and results is useful for researchers and beginners when preparing tables and figures for presentation and publication.
MULTIPLE SEQUENCE ALIGNMENT Second, Maximum Composite Likelihood (MCL) method is included for estimating evolutionary distances between nucleic acid sequences, which can be frequently employed by users for divergence times, inferring phylogenetic trees, and average sequence divergences between and within groups of sequences. In this approach, score is obtained as the sum of log likelihood for all sequence pairs in an alignment, and then is
maximized by the common parameters for nucleotide substitution pattern to every sequence pair. This method was previously referred to as the ‘‘Simultaneous Estimation’’ (SE) method, because all distances are simultaneously estimated. This approach is different from current approaches for evolutionary distance estimation. In current approach, each distance is estimated independently of others, either by statistical formulas or by likelihood methods. The Maximum Composite Likelihood method has many advantages over the Independent Estimation (IE) approach. The IE method for estimating evolutionary distance for each pair of sequences often causes rather large errors unless very sequences are not estimated. One the hand, MCL reduces these errors, as a single set of parameter is applied to ever distance estimation. Inference of Phylogenetic trees by distance-based method is considered more accurate when error is low for estimation. This is in fact the case for the NeighborJoining method. The use of the MCL distances leads to a much higher accuracy with higher bootstrap values and even equal same topology of tree is expected to obtain. In addition, for pair wise distance calculation, IE method is not reliably applicable, because analytical formulas may
become negative by chance due to algorithm’s arguments.
DISTANCE-BASED METHOD Such cases may increase with increase in number of sequence data, evolutionary distances become larger and substitution within sequences become more complex. The MCL method overcome these problems effectively and generates sophisticated models for inferring phylogenies from a larger number of diverse sequences. MEGA implicates the use of MCL method for evaluating average distances between and within groups, pair wise distances and average pairs, with their variances calculated by a bootstrap approach. The implementation of the MCL approach allows consideration of substitution rate variation from site to site, by an approximation of the gamma distribution divergence/convergence rates, and the assimilation of heterogeneity of nucleotide base composition in
Bioinformatics Review | 7
different sequences for species. We also have the leniency to determine the numbers of mutation per site separately. Intrinsically, the use of MCL method for inferring phylogenetic trees by distance-based methods, along with the bootstrap tests proves worth doing.
Bioinformatics Review | 8
CADD
Disease Modeling: Better way to understand Pathogenic Pathways Image Credit: Google Images
“Studying disease models aids understanding of how the disease develops and testing potential treatment approaches. It requires the use of disease-specific pluripotent stem cells as the starting materials for generating surrogate models of human disease. ”
A
disease model is an animal or cells displaying all or some of the pathological processes that are observed in the actual human or animal disease.
Studying disease models aids understanding of how the disease develops and testing potential treatment approaches. It requires the use of disease-specific pluripotent stem cells as the starting materials for generating surrogate models of human disease. Pluripotent stem cells are the cells which are capable to differentiate into any kind of cell type. They are also known as ‘True stem cells’. So, it is possible to model a disease in a dish which allows a pathophysiological insight of that disease and better ways to cure one. Bioinformatics plays an important role in disease modeling as it makes easy to tarce the pathway of pathogens inside the organism and also helps in
study the gene or protein responsible for the disease and many such ways. Eiges et al. (2007) were first to make and derive novel early developmental information from disease-specific pluripotent stem cells. Human disease specific pluripotent cells could be made only by genetic modification of existing human embryonic stem cells (hESCs) or by the generation of new hESCs from embryos carrying those monogenic diseases detectable. The advent of induced pluripotent stem cells (iPSCs) has transformed the prospects for disease modeling. The iPSCs are kind of Pluripotent stem cells which can be generated directly from adult cells. iPSCs can be made with cells taken from babies or adults of all ages with full medical records and suffering from virtually any genetic disease, whether simple or complex.
METHODOLOGY: Cells can be taken from a living patient or a frozen tissue bank then these tissues are used to generate iPSCs. Differentiation of the chosen pluripotent cell type into the target cell type(s) of choice is then performed. If possible, such cells should be compared with similar cells taken from a patient (ideally the somatic cell donor), even if the main target of a particular study is a much earlier appearing progenitor. This is the ‘‘reality’’ check which is done because it can validate the fidelity of in vitro differentiation. iPSCs can be either obtained by the isolation of ESCs (Embryonic Stem Cells) from preimplantation diagnosed embryos, or by genetic modification of existing hESCs, hiPSCs, or somatic progenitors. Differentiation of these pluripotent cells is to study the emergence of the disease phenotype in vitro or be used for drug screening or development. On the basis of the results obtained Bioinformatics Review | 9
from differentiation of the PSCs, the modifications are made to get a desired function in the cell or to develop a desired drug or personalized medicines.
Fig. 1. The Use of hiPSCs/hESCs in Disease Modeling and Drug Development Pluripotent cells carrying a disease-linked mutation can be derived via iPSC from patient tissue samples, by isolation of ESCs from preimplantation diagnosed embryos, or by genetic modification of existing hESCs, hiPSCs, or somatic progenitors. Differentiation of these pluripotent cells can be used to study the emergence of the disease phenotype in vitro or be used for drug screening or development. Disease modeling acts a synthetic system to study the pathways of diseases to make up the unknown causes of various diseases. For example, Spinal Muscular Atrophy (SMA) was studied by Ebert et al. (2009). Spinal muscular atrophy (SMA), an autosomal-recessive disorder, is caused by mutations in the SMN1 gene that lead to reduced SMN protein levels. This loss leads to
selective decreased generation of alpha motor neurons at a very early age (around 6 months) with death following around 2 years of age. They made iPSCs from the fibroblasts of a SMA patient and his unaffected mother. Less SMN protein was seen in the iPSC(SMA), and after motor neuron differentiation of the two iPSCcultures, smaller numbers of mutant motor neurons accumulated on prolonged culture. SMN levels in iPSC(SMA) and derived motor neurons could be increased by addition of the drugs valproic acid and tobramycin. For detailed study you may click here Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.
Bioinformatics Review | 10
BIOINFORMATICS PROGRAMMING
What is Numerical Taxonomy? How is it useful? Image Credit: Google Images
“Numerical taxonomy is a system of grouping of species by numerical methods based in their character states.”
C
lassification of biological species is one of the important concern while studying taxonomic and or evolutionary relationships among various species. Classification is either based on only one / a few characters known as “Monothetic”, or based on multiple characters known as “Polythetic”. It is obviously much more difficult to classify organisms on the basis of multiple characters rather than a few characters. The traditional approaches of taxonomists are tedious. The arrival of computer techniques in the field of biology has made the task easier for the taxonomists. Numerical taxonomy is a system of grouping of species by numerical methods based in their character states. It was first initiated by Peter H.A.Sneath et al.
Before going further I would like to clear the difference between two common terms,namely, “Classification” & “Identification”. When the organisms are classified on the basis of like properties, then it is called Classification, and after the classification, when the additional unidentified objects are allocated, then it is known as Identification. The purpose of taxonomy is to group the objects to be classified in to natural taxa. Conventional taxonomists equate the taxonomic relationships with the evolutionary relationships, but the numerical taxonomists defined them as three kinds: Phenetic: similarity.
based on
overall
How does the classification done by Numerical Taxonomy? The objects to be classified are known as Operational Taxonomic Units (OTUs). They may be species, genera, family, higher ranking taxonomic groups, etc., The characters are numerically recorded either in the form of appropriate numbers or may be programed in such a way that the differences among them are proportional to their dissimilarity. Let’s say, a character called ‘hairness of leaf’, it may be recorded as: hairless = 0 sparsely haired = 1 regularly haired = 2
Cladistic: based on common line of descents. Chronistic: temporal relation among various evolutionary branches. Bioinformatics Review | 11
densely haired = 3
Fig.1 OTUs (black dots) represented in a multidimensional space. Such a numerical system implies that the dissimilarity between densely haired and hairless is 3 times than that of sparsely haired and hairless. The other method of implementing numerical taxonomy is that the characters are always represented by only two states, i.e., 0 for the absence and 1 for the presence of a particular character. This method is usually implemented in the field of microbiology. After that, all the characters and the taxonomic units
are arranged in the form of data matrix and the similarity among all possible pairs of OTUs is computed based on the considered characters. The similarity (more specifically, dissimilarity) is the distance between OTUs is represented in a multidimensional space, where the characters can be visualized as the coordinates. The objects that are very similar are plotted close to each other and those which are dissimilar are plotted farther apart. Then these straight lines are computed. The similarity among the OTUs is calculated by ‘similarity matrix’ having few color schemes, where the darkshaded areas are highly similar. This matrix is then rearranged to get the clusters of similar OTUs. The results of numerical taxonomy are generally represented in the form of phenograms.
Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com.
Bioinformatics Review | 12
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com