Jan 2017 VOL 3 ISSUE 1
“I seem to have been only like a boy playing on the seashore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.� -
Alignment-free approaches for Sequence Analysis
Isaac Newton
Alignment-free approaches for Sequence Analysis
Public Service Ad sponsored by IQLBioinformatics
Contents
January 2017
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
03 Sequence Analysis Alignment-free approaches for Sequence Analysis 06
05
EDITOR Dr. PRASHANT PANT FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR TARIQ ABDULLAH FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS FOZAIL AHMAD ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com
PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published quarterly for one year (4 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India
EDITORIAL
Bioinformatics Review (BiR): Bridging Between The Two Worlds Informatics and Biology are two sciences which are as different from each other as possible. One runs on the core concept of variation and another on strict reasoning. But still, these two have combined in a most natural way under the realm of “Bioinformatics”. For a biologist today it’s difficult to imagine a world without all biological databases and further no branch to decipher the huge enigma that it brings. Bioinformatics Review (BiR) journal is a platform to discover the latest happenings in this melting pot of two varied fields.
Dr. Roopam Sharma
Honorary Editor
The era of “omics” kick-started with the drafting of Human Genome Project (HGP) in 2003. Since then, a number of technological advancements especially, NGS has been generating mind-boggling data for the knowledge banks. Latest inventions like single-cell transcriptomics or metagenomics of most unusual habitats show how the evolution of technological advancements is directly resulting in breakthroughs in biological sciences. Among various areas of biology which has benefited from these advancements is Pathology. In fact, deciphering the molecular and genetic basis of diseases in humans was the guiding force behind human genome sequencing Project. Bioinformatics has led to an impressive increase in recognition of possible pathogenic factors in varied systems, so much so that new techniques are being devised to increase the speed to actually test these factors in the wet lab. If we consider computationally, smaller but ever-changing genomes and transcriptomes of these pathogens, make them a much suitable candidate to test out many hypotheses for Bioinformatics studies. Effector Bioinformatics involves building custom pipelines for distinct species based on characteristics of effectors and size of the genome involved. These can be based on Homology or feature extraction or both, e.g. discovery of RXLR motifs in Oomycete effectors allowed many more effectors to be identified. This collaboration of two sciences for plant pathology has led to the development of many general use platforms like Broad-Fungal Genome Initiative, EuPathDB, PhytoPath and so on, but there is much need of developing specified resources like PHI-base for specific
Letters and responses: info@bioinformaticsreview.com
areas like effector biology. The use of machine-learning techniques like artificial neural network approach (which is actually based on biological neural networks) really shows how the two branches are so distinct yet so intertwined. All in all, it’s a brave new world where artificial communication is not only stimulating but also helping us understand the communication (between host and pathogen) going within the realm of life.
EDITORIAL
In this issue, BiR focusses on reviews related to some of the very basic techniques which have been used in computational biology and its applications in various biological studies. We look forward to continued support from our readers and contributors. For suggestions and feedback, do write to us at info@bioinformaticsreview.com
SEQUENCE ANALYSIS
Alignmentfree approaches for Sequence Analysis Image Credit: Stock Photos
“A handful of physics-based theories such as Information Theory, Chaos Theory and, Linear Algebra and Statistical Theory have been proposed to be implemented for the multiple sequence comparison [5,6]. Among these proposed theories, Information Theory has been found to be more promising in the multiple sequence analysis [5], however, efforts are still being made to implement this theory.�
ultiple Sequence Alignment (MSA) is a fundamental aspect of Bioinformatics in order to identify the species, their functions, phylogeny, study the novel genes/ proteins, and so on. Multiple MSA tools are available with different specifications, which are based on the heuristic algorithm focusing on the speed rather than the accuracy.
M
As MSA is the basic need for sequence analysis, most of the research depends on it, but unluckily, it has some limitations also that can affect the results, which are being noticed since a very long time.
These shortcomings/ errors may be due to the different recognition of indel events, gap introduction, gap penalty, etc., applied by different algorithms/tools. These limitations may interfere with the results generated by different tools. It has been reported that phylogenetic trees constructed using different phylogeny programs generate varied trees [1]. Also due to the presence of various benchmarks available such as OXBENCH, SABMARK, etc. for MSA may also affect the alignment accuracy and reliability [2-4]. These limitations of MSA programs has led to the futuristic development of alignment-free sequence analysis
[5]. A handful of physics-based theories such as Information Theory, Chaos Theory and, Linear Algebra and Statistical Theory have been proposed to be implemented for the multiple sequence comparison [5,6]. Among these proposed theories, Information Theory has been found to be more promising in the multiple sequence analysis [5], however, efforts are still being made to implement this theory. Another current approach BBO (Biogeography-based Optimization) has found to attempt to solve the problem of MSA [7]. It is based on the concept of emigration and
Bioinformatics Review | 6
immigration of species from one habitat to another. It has now been more improved and proposed in the form of IBBOMSA (An Improved Biogeography-based Approach for Multiple Sequence Alignment) [8]. Its algorithm implements a mutation operator which calculates the probability of mutation in the given species and according to their comparison tests, IBBOMSA was found to be most accurate among the other considered tools [8]. A similar approach to the alignmentfree analysis of DNA sequences has been made by Zhou et al., (2016) which is based on the characterization of complex networks [9]. It is based on a code of three cis nucleotides in a gene that could code for an amino acid [9]. Graphical representation of DNA has also been proposed for sequence comparison [10], which were later improved in 2D [11-16], 3D [17-21], 4D [22], 5D [23], and 6D [24] representations of DNA sequences in the form of matrices. Sequence comparison without aligning the sequences may be a good alternative to alignment programs, but it requires a lot of work by the scientific community to be fully usable. References
1.
Wong. K. M., Suchard M. A., Huelsenbeck J P, (2008). Alignment uncertainty and genomic analysis. Science, 319, 473–476.
2.
Thompson, J. D., Plewniak, F., &Poch, O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research, 27(13), 2682–2690. http://doi.org/10.1093/nar/27.13.2682
3.
Raghava, G. P. S., Searle, S. M. J., Audley, P. C., Barber, J. D., & Barton, G. J. (2003). OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47.
4.
5.
6.
Van Walle, I., Lasters, I., &Wyns, L. (2005). SABmark - A benchmark for sequence alignment that covers the entire known fold space. Bioinformatics, 21(7), 1267–1268. Vinga, S., & Almeida, J. (2003). Alignmentfree sequence comparison—a review. Bioinformatics, 19(4), 513-523. Shannon, C.E. (1948) A mathematical theory of communication. The Bell System Technical J., 27, 379–423, 623–656.
7.
Simon D. Biogeography-based optimization. IEEE Trans Evol Comput. 2008;12:702–13.
8.
Yadav, R. K., & Banka, H. (2016). IBBOMSA: An Improved Biogeography-based Approach for Multiple Sequence Alignment. Evolutionary Bioinformatics Online, 12, 237.
9.
Zhou, J., Zhong, P., & Zhang, T. (2016). A Novel Method for Alignment-free DNA Sequence Similarity Analysis Based on the Characterization of Complex Networks. Evolutionary Bioinformatics Online, 12, 229.
10. Qi X, Wu Q , Zhang Y, Fuller E, Zhang C-Q. A novel model for DNA sequence similarity analysis based on graph theory. Evol Bioinform Online. 2011;7: 149–58 11. Guo X, Randić M, Basak SC. A novel 2-D graphical representation of DNA sequences
of low degeneracy. Chem Phys Lett. 2001;350:106–12 12. Randić M, Vraćko M, Lerś N, Plavśić D. Analysis of similarity/dissimilarity of DNA sequence based on novel 2-D graphical representation. J Chem Inform Comput Sci. 2003;371:202–7. 13. . Randić M, Vraćko M, Zupan J, Novic M. Compact 2-D graphical representation of DNA. Chem Phys Lett. 2003;373:558–62 14. Randić M. Graphical representations of DNA as 2-D map. Chem Phys Lett. 2004;386:468–71. 15. Liu X, Dai Q , Xiu Z, Wang T. PNN-curve: a new 2D graphical representation of DNA sequences and its application. J Theor Biol. 2006;243:555–61 16. Huang G, Liao B, Li Y, Liu Z. H curves: a novel 2D graphical representation for DNA sequences. Chem Phys Lett. 2008;462:129– 32. 17. Liao B, Wang T. 3-D graphical representation of DNA sequences and their numerical characterization. J Mol Struct (Theochem). 2004;681:209–12 18. Qi X, Wen J, Qi Z. New 3D graphical representation of DNA sequence based on dual nucleotides. J Theor Biol. 2007;249:681–90 19. Qi Z, Fan T. PN-curve: a 3D graphical representation of DNA sequences and their numerical characterization. Chem Phys Lett. 2007;442:434–40 20. Cao Z, Liao B, Li R. A group of 3D graphical representation of DNA sequences based on dual nucleotides. Int J Quantum Chem. 2008;108:1485–90. 21. Yu J, Sun X, Wang J. TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications. J Theor Biol. 2009;261:459–68.
Bioinformatics Review | 7
22. Chi R, Ding K. Novel 4D numerical representation of DNA sequences. Chem Phys Lett. 2005;407:63–7. 23. Liao B, Li R, Zhu W. On the similarity of DNA primary sequences based on 5-D representation. J Math Chem. 2007;42:47– 57 24. Liao B, Wang T. Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping trinucleotides of nucleotide bases. J Chem Inform Comput Sci. 2004;44:1666–70.
Bioinformatics Review | 8
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 9
Bioinformatics Review | 10