BIOINFORMATICS REVIEW- FEBRUARY 2019

Page 1

FEBRUARY 2019 VOL 5 ISSUE 2

“If we knew what it was we were doing, it would not be called research, would it?” -

Albert Einstein

Interview with Professor G.P.S Raghava -discussing Bioinformatics, Research, & Science in India

India ranks 4th among the Top 20 Bioinformatics Database Contributors in the world


Public Service Ad sponsored by IQLBioinformatics


Contents

February 2019

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Topics Editorial....

05

03 Bioinformatics News India ranks 4th among the Top 20 Bioinformatics Database Contributors in the world 06

05 Tools How to create a pangenome of isolated genome sequences using Roary and Prokka? 16

04 Interview Interview with Professor G.P.S Raghava discussing Bioinformatics, Research, & science in India 07


FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR TARIQ ABDULLAH FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS FOZAIL AHMAD ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS

REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com


PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published monthly for one year (12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under license by SEWA trust. Published in India


BiR-Taking it to the next level: Editorial

EDITORIAL

As we dive into a new year, BiR desires to take a step forward towards the new developments and achievements. BiR has achieved a lot since the time of its existence in the form of our readers who have been a wonderful motivation for us.

Muniba Faiza

Founding Editor

Bioinformatics has become a broad field now, covering the important aspects of our lives such as drug designing and in the last three-to-four years, BiR has not only tried but succeeded in covering almost every domain of Bioinformatics including sequence analysis, structural bioinformatics, docking, phylogeny, evolution, tools, software, and so on. This coming year BiR will focus on several other facets of Bioinformatics covering a wide range of domains including cheminformatics, more articles focussing on bioinformatics programming, big data, and more tutorials regarding new software/tools. We have received several suggestions and appreciation from our readers all over the world including some interesting topics to cater more articles about. We are currently working on the suggested topics and soon will be made accessible to all. Besides, BiR would like to welcome new authors who are interested in bioinformatics and sharing their knowledge worldwide. Last year BiR commenced the annual listing of top Indian Bioinformaticians acknowledging our respected scientists and researchers working in the same field. This year BiR will try to arrange talkings and conferences with the bioinformaticians. Besides, BiR hopes to introduce new projects and internships to young researchers working in the same field. We have many miles to go which is not possible without the support of our readers, subscribers, and contributors. We are wholeheartedly thankful to all of you and wish you a very

Letters and responses: info@bioinformaticsreview.com


prosperous and happy new year with great achievements ahead. Keep sharing and spreading knowledge. Please share your thoughts info@bioinformaticsreview.com. With best wishes!

EDITORIAL

Bioinformatics Review (BiR)

and

suggestions

at


BIOINFORMATICS NEWS

India ranks 4th among the Top 20 Bioinformatics Database Contributors in the world Image Credit: Database Commons

“India holds the fourth rank among these top 20 countries with 318 total number of databases.� atabase Commons (a catalog of biological databases) [1] has revealed new statistics based on database rankings showing the top 20 countries/regions with the highest number of databases managed by them (Fig. 1). India holds the fourth rank among these top 20 countries with 318 total number of databases.

D

Fig. 1 Top 20 countries/regions showing database numbers. This ranking is based on Z-index. The United States holds the first position with 1,196 total databases, followed by China with 547 number of databases, which is followed by the UK (357 databases) and India (318 databases). Exome Aggregation Consortium (ExAC) from the USA has been ranked first worldwide with a Zindex of 706. The Human Protein Reference Database (HPRD) [2,3] scored the highest Z-index (160.75) among the other top 20 databases in India as shown in the table below.

[table id=4 /] *The ranking is based on the Z-index calculated by the Database Commons [1]. For further reading, click here. References 1.

http://bigd.big.ac.cn/databasecommons/stat#

2.

Peri, S. et al. (2003). Development of Human Protein Reference Database as an initial platform for approaching systems biology in humans. Genome Research. 13, 2363-2371.

3.

Prasad, T. S. K. et al. (2009). Human Protein Reference Database - 2009 Update. Nucleic Acids Research. 37, D767-72.

Bioinformatics Review | 6


INTERVIEW

Interview with Professor G.P.S Raghava - discussing Bioinformatics, Research, & science in India Image Credit: Stock Photos

“I believe in a few years, they will believe more on the Bioinformaticians and they will utilize the knowledge for their own experimental work.�

T

his is the first episode of an ongoing series of interview of famous bioinformaticians around the world and today we are sitting with Prof. G.P.S. Raghava from Indraprastha Institute of Information Technology Delhi (IIITD). We will be talking about bioinformatics and a lot of other things today. Prof. Raghava is a very famous bioinformatician in India and the world. He has developed many software, more than 200 webservers, and over 30 databases, and his research group has published successfully more than 200 research papers in many reputed journals. Hello, Prof.

Raghava and thanks for giving this time for meeting you. Question: So, more than 200 web servers and 30 databases, and more than 200 research papers in reputed journals. So how do you come up with a team and diverse ideas to implement those? Answer: That's an interesting question, I never ever thought that I would publish 200 papers or 200 web servers. After completing my M. Tech from IIT Delhi, I joined a job in 1986 at the Institute of Microbial Technology, Chandigarh. I was really happy to get a Class-1 officer position in central government at the age of 23. I was an

odd one out in the institute as most of the scientists were a biologist. I was hired by the institute to look after the computer center, so I was called as the computer scientist. Initially, I felt why they gave me the designation of a computer scientist though my duty is mainly to operate and maintain computers in the institute. During 1986 to 1990, I developed the different types of general software, like Payslip, inventory management, etc. I realized that this type of general software has a limited impact (service at institute level) and limited half-life. I was not happy with the work as it was not utilizing my full potential, so I was searching for more challenging problems. Meantime, in 1990 a Bioinformatics Review | 7


research scholar Mr. Anish Joshi came to me and request to implement software in immunology. The software was simple, "Calculation of antibodyantigen concentration from ELISA data". I studied the algorithm of software and realized that I can develop a better algorithm than existing software. Thus, we developed a more efficient software and demonstrated experimentally that our software is better than any existing software. We sent paper based on our software to Journal of Immunological Methods (Impact factor ~3) and it got published without any revision. It was basically a wow moment! for a person who has been hired as the computer scientist has published a paper in high impact factor journal. After that, I thought whether I will be there or not but my contribution/papers will be there in the literature. This way I may provide better service to the whole scientific community rather than limited services to the users of my institute. So, that was the starting point, after that we developed software for the different types of biological problems. We talk to the scientific community and read literature to identify problems faced by the community. We developed software for these problems and publish software-based papers. In order to provide service to the community, we distribute these software packages including source code to the public. We talk to researchers/students to understand their problem particularly in the field of biology, so we may provide a solution

to them. In simple words, we are not working for our-self instead we are working for the community. This is the reason, we contributed to diverse fields over the years. Question: Sir, bioinformatics is really an interesting field, how did you get interested in this particular area? Is it like you were actually in biology or computer science together? Answer: As I told you, I was hired by the institute to provide computer services, so my background is computer science, during my M. Tech I did a lot of programming, that's why my programming is strong. Though at the time when I started my career I had nearly no knowledge of biology, only up to high school like everyone else. I applied my computer science skills in the field of biological science; the major challenge was to understand problems faced by the biologist. The computer in biology has always been in demand though in India it is popular since the last 20 years only. In the 1970s, Prof. G.N. Ramachandran computed possible dihedral angles in a protein structure, it is known as "Ramachandran plot". I consider bioinformatics as challenging field as it integrates two diverse fields computers and biology. In addition, it has a huge potential because it allows mining useful information from experimental data that is growing with exponential rate. This is the reason, it is my favorite field and I am working in this area from the last 30 years.

Question: Sir, you have worked with international organizations like EBI, Oxford, UAMS, so, based on your experience, what suggestions do you have for people working in India become at par with those levels? Answer: There are two to three points I would like to mention. One of the major challenges in India is to form a team to solve complex problems. As an individual, we Indians are as good as anybody in the world, so there is no problem in performing at an individual level. The major challenge in India is to solve a problem as a team due to internal fightings. It has been commonly observed that team members blame each other; if you talk to the junior students, then they will blame their seniors or guides. Similarly, if you talk to the guides they will blame their students, this is the major problem in India, otherwise, as an individual, we are as good as anybody in the world. In interdisciplinary areas like bioinformatics, a team is must as it is nearly impossible for an individual to become experts in multiple fields. The second thing is 'infrastructure' because the infrastructure is important for performing world-class research in India. For example, in 1996, Oxford University has all facilities like fast internet, intranet via fiber optics at that time we had limited facilities, the internet was slow and not available at most of the organizations in India. Another point is that they know that hire-and-fire policy is there, so what happens if you would work then you get all the advantages and if you don't work then you will get fired. In India, Bioinformatics Review | 8


this is not the situation, particularly in the government sector, that hire-andfire policy is not effective. Once you join, then whether you are working or not, you are getting the same salary, so the people adopt that kind of behavior. In India, those who have a strong desire to contribute to society, are performing as per international standards. Question: Sir, you have been a significant contributor to the open source community, you have developed many web servers and databases, and you have kept everything free for the users. Although you could have made a fortune over that and you could have commercialized the codes. What inspires you to do that? Answer: I will go again back to the time during 1986-1990 when I joined the institute where I was appointed as the computer scientist. At that time, there was no internet and email was only electronic media of communication. Fortunately, in our institute, we established an email facility in 1991, which was not there in most of the institutes in our country. At that time, we access/retrieve biological data from databases to our biologists. At that time, there was an email server at EMBL (EBI was not there), which maintained the repository of free software. One can send an email with the appropriate commands to EMBL for downloading the desired software with source code. Even at that time, you could not download the big software in an email but we can get

them in the pieces, we compiled them and provided service to the users. Organizations in Europe and the USA maintain email server and provide all resource free to the public. These facilities or resources are heavily used by the Indian scientific community. I have not seen a similar trend in India, most of us do not share our resources with the community. In India, we are so possessive, we think that we will live forever, and we are not sharing information with anybody, so that's why we are not growing. We have done a lot of work in the past to be claimed but it is not well-documented and not shared with the public. As a result, what happens, the person who generates the data will die and this data will also die, this is the common problem. Instead of following the Indian system, we followed an international standard where any discovery will be documented and shared with the public, in order to utilize the full potential of the discovery. The USA has already made a number of public repositories like PubMed and Europe also maintain resources at EBI, why India can't do it. Our group contributed to open source software/resources over the years, sometimes we also got objection why we are not charging. Our logic for providing free resources was that Indians are also using resources developed by other countries. Developing countries like India where it is difficult to afford the costly commercial software; freeware provides alternative to commercial software. Thus, researchers in India

can use state-of-art free software without spending huge money on commercial software. So, it will save a lot of money in the country, so indirectly, I am helping in saving the unnecessary spending of money. We have shown in one of the calculations of usage of our software/resources, saved more than Rs 800 crores of India. It's a big amount, it's not visible, but heavy usage of our software by scientific community indicate it clearly. Question: Sir, what's up with your lab nowadays? What are you working on? Answer: I have around 8-10 Ph.D. students who are working in my lab. My research group is different than others as I do not follow a traditional path. In the traditional system, most of the times guide assign research project for a student, depending on the need of guides' project. Most of PI’s are working in the focused area so the student has to work in that area only. This is not the situation in my group because I am not interested in my own career I am interested in solving the biological problems as well as to train Ph.D. students. In my lab, when a student joins my group, then I ask a few questions such as "what are your interests?", "what are your skills?" and “what important problem you wish to solve?�. Based on their expertise and set of skills students have, we assign a problem to the student. This is the reason in my group if I have 10 students, they are working on different problems, one of the students is working on "Probiotics and Prebiotics". Though this is a new field to me the

Bioinformatics Review | 9


student is interested, she wants to understand what are the probiotics and prebiotics, so she is working on that. Recently, a girl joined my group as an M. Tech student, she wants to work on rare genetic diseases. I have never worked before on rare disease but I allowed students to work on rare diseases. In this process, I will also learn a new topic with my student. So, currently, we are working on a rare disease caused by lysosomal enzymes. Other students are working on different problems such as proteinligand interactions, immunosuppressive peptides because vaccine development is a major part of our lab, we are working from the last 20 years in the field of immunology. Recently, we have completed a very interesting project on "P-features". In this project, we integrated algorithms for computing features of a protein which have been discovered by different researchers in the last 30 years. Generation of protein features an essential part of any software developed for classification/annotation of a protein. To predict a portion of a protein, you have to complete the features of the protein like the amino acid composition, and all these features have been discovered over the years by my group as well as by the other groups. The problem is if a new student will join and learn these tools, it will take a long time, so what we did in the last one month, my whole group worked together and all the possible features in the protein have been calculated by a single software and

that software will be available in the form of a web server as well as the source code. So, anybody who will be working in the field of protein annotation/protein structure prediction, he can easily use our tools and wouldn't have to reinvent the wheel. Question: Sir, what do you think are the most interesting areas in bioinformatics? Answer: That's an interesting question, in my opinion, all fields are interesting. We need to understand the difference between biologists, experimentalists, and bioinformaticists. Bioinformaticians are not generating their our own data, so we cannot discover the new things entirely. In order to work in bioinformatics, one should have sufficient data. For example, we are getting a lot of data particularly in the field of genomics and proteomics, so genomics-based biomarkers and proteomics-based biomarkers have a lot of scopes. As a lot of data is available, so we can discover different types of biomarkers. If you are going to work in bioinformatics, first you have to see what is your interest, what are the major problems you wish to address, once you figure out the problem. Next problem is to assess your own potential, whether you can do it or not. It is important to judge your won capabilities to solve this problem. A large number of researchers come into the bioinformatics field without judging themselves whether they are capable

of developing tools or not. So, that's the important part, you have to judge your strength. After knowing your strength, understand the problem, the next question is whether sufficient data is available or not, even if you are highly skilled and you figured out the problem but sufficient data is not available, you can't do anything with it. Because in bioinformatics everything is based on the data, unavailability of data would be a problem. Overall, I would say, it more or less depends on the person, whether a particular problem excites you or not, whether you have the capability of solving it or not. Question: As new technologies are evolving, where do you see a bioinformatician working in 50 years? Does he have a future? Answer: Yes, the reason is simple, this biological field, despite all development and the progress over the years, such as microarray data, Chipseq data, or RNA-Seq data, we are still unable to understand even 1% of our living organisms. We have limited knowledge, we are working on pieces that's why a lot of data has been already generated still we do not fully understand the function of a cell. The bioinformatics has huge scope in the future; because biologists are generating data with an exponential rate, even to maintain this data is a challenge. Mining important information from big biological data is a major challenge for bioinformaticians. The most important question is whether an individual has

Bioinformatics Review | 10


an ability or not, to solve these bioinformatics associated problems. That's the big problem for anybody who wants to jump into bioinformatics by considering only its scopes, will not be successful, you have to check what is your capability, where you can fit in, what you can do. It's not that whether genomics is important or proteomics is important or protein structure prediction is important, there is a lot of areas in bioinformatics, you have to see where you fit in. If you are good in computers, you have to take a particular type of problem, if you are good in biology, then you have to take another type of problems, if you are good in chemistry, then you have to take other problems. So, a lot of scopes is there without any doubt. Question: What method have you found most helpful in training your research staff/team in the use of databases? Which technique have you found quite helpful? Answer: I think regarding the learning of the group, frankly, speaking I do not support the traditional approach, where one person will teach and all others will learn. In my group, we support learning from each other, like a network one-to-many and many-toone. I am there but they work in a healthy environment, they are talking/learning to each other. If they are not able to understand, they come to me, similarly, if I do not understand then I talk/learn from my students. So, training human resource development is a major challenge for me (not for database development, for any

bioinformatics work) because if we do not train the next generations, we will lose that information, then eventually, the data will be lost. So, I want to make sure the field should grow, therefore, I have different concepts for learning. First thing, it is my record in the old years, if anybody is not getting any pay or facing problems in Ph.D. like in my previous institute also, when a student who is facing, problems come into my group, I try my best to shape their career. So, the trust between me and the student is quite strong. We maintain a healthy environment in the group, where we talk to each other and learn. Currently, we are developing software, I gave you the example of Pfeature where we have computed all the features of the protein and made it a package, so if anyone will start working in that area, it will take a long time, but if he will use my software, he can learn in a few days. Therefore, in the same way, in my group, infrastructure is there, everybody has access even they have access to each other's workspace, so, let's say, one web server will be developed by student X then student Y also has the access, so that they can learn from each other because you can learn faster by examples than by big lectures. So, openness is helping in training my own students because they are not fighting or competing with each other, they are happily working together. We have organized a large number of workshops, seminars, training program and conferences in the last 25 years to share knowledge with a new generation. If you see the

average, we have organized at least two programs every year. In these programs, we are providing training about the latest trend including databases. For example, earlier they were using MySQL database to store data as the size of data was small and structured. Due to the growth of data, particularly, the growth of unstructured data, we need to use NoSQL technology. Thus, we are teaching modern DBMS like Hadoop, MongoDB, It is important for students to learn the next-generation database management systems. Question: What is your all-time favorite piece of bioinformatics software and why would you prefer that? Answer: If you are asking about software developed by other groups, I would say BLAST, specifically, PSIBLAST. It is not only doing the searching but it also helps you to generate the evolutionary profiles (in the form of PSSM profiles). Evolutionary information is important to predict the structure or function of a protein. Question: Sir, you have developed a lot of software and tools, which computer language do you use for developing them? Answer: In the last 30 years, I have used many computer programming languages for developing software, I enjoy to learn a new programming language. My first scientific software was in GW-BASIC, the next was in Pascal, another was in FORTRAN, so I Bioinformatics Review | 11


enjoy to write. If you would ask about a figure for programming languages, then I would say that I know at least 20 programming languages. So, the important question is which language you are using and in what kind of work. Earlier, during the initial phase, I frequently used C as it has many advantages. In 1996, I started to use Perl, which I learned during my stay at the Oxford group. Most of my work, where I used structure predictions, Perl can do it and probably fast, I am not saying that Perl is a fast language than other programming languages but if you are doing small jobs, Perl is one of the best choices for it. So, for the number of software, I have used Perl. Similarly, nowadays, I am switching more towards Python, the reason is that Python has developed a lot of libraries and even if you are not coding, you can use these libraries to implement the machine learning techniques or data mining techniques. So, nowadays, we are focusing more on Python. Question: So, coming to Python and Perl, big data, AI, the blockchain, as a bioinformatician which one are you focussing on like big data? Do you see any role of blockchain in the coming future for bioinformaticians? Answer: Regarding the AI and big data, that's an interesting question, it is like old wine in a new bottle. For example, in deep learning, we are using a neural network with a large number of hidden layers/units as well as a combination of neural networks. This concept has been developed a long

time back, the only limitation was that at that time we have limited data and resources. Our group used a combination of neural networks called hybrid and cascade network in methods developed for predicting the structure and function of proteins. One should be careful in the implementation of AI techniques, in the last two decades, we used support vector machine (SVM) heavily for developing methods. SVM is specifically useful when training/testing dataset is small, it has a minimum over optimization. The neural network is fast and gave excellent performance on large training/testing dataset. One should be careful in selecting the AI technique for mining their data based on type and size. In the case of Big data, one should use NoSQL technology for managing the data and systems like Hadoop to process data efficiently. Fortunately, the implementation of new technology is easy as the number of free software is available to implement these technologies. I have advised to youngsters, to learn all these new technologies, it is not difficult but important for your growth. Regarding blockchain, it is an important technology which can be used to protect/secure our personal data particularly genomics and proteomics. Due to advancement in technology, in the future, person-specific medical data will grow with the exponential rate which needs to be protected using encrypting technologies.

Question: Currently, one of the biggest concerns in bioinformatics is data deluge. A few weeks ago, I read an article published in nature and those people were actually confused about which data to archive and which data to discard. Because from our point of view, everything is important. So, what do you think what measures should we take or what should we do? And recently, some researchers are trying to reuse the keywords which are already present in the datasets. What measures should we take regarding this? Answer: This is a big challenge to maintain important data generated by our experimental researchers. It is difficult to answer as nobody knows about it, we are processing a lot of data and most of the data is in no use. Even if you see TCGA data, it's huge and unfortunately, the data we need is not there. That's the problem here, we use a lot of cancer genomics data but only limited samples are there. So, the requirement is too high and the existing techniques or storage capacity is not up to the level. We should take care and I think maybe somebody would come and do better mining than the previous ones. It happens in case of microarray data, earlier researchers were submitting finished information (the final results only). Later, academic community forced data producers to make the raw data available too because maybe the new individual is smarter than the previous one in data mining. So, limitations are there and I cannot comment whether we should Bioinformatics Review | 12


discard it, ideally, we should maintain it. Question: One concern in bioinformatics is that, unlike the software which you developed that are freely available, there is some software which is not available for free and they are charged or overcharged sometimes. So, that also impacts research for the people who cannot afford it. What are your views about it? Answer: This is one big challenge because I have sat in most of the committees, most of the bioinformatics researchers request for commercial software. These software packages are costly because the vendor sees your pocket rather than the actual cost. So, I am the strong opponent for the commercial software and I simply, say, "no you have to use academic software" because I am not seeing any commercial software which is better than the academic software from the algorithmic point of view. Academic software is as good as the commercial software and they are free, you can implement them in your software. Unfortunately, most of the researchers are not economic; in fact, we feel the pride to bring grant rather minimum input and maximum. We are not computing the cost per research paper instead we show performance by the amount of grant. Our young researchers should learn how to minimize per paper cost, it is possible if we use free software. I want to give an important message here, if you want to become a good researcher in your life,

you have to think about why you are doing this. You are doing the research because you want to serve the community but if you are consuming a lot of money from the country unnecessarily, then you are not serving in the country. So, we should think about it and use the minimum budget and should get maximum output using the freeware tools/data. Question: As a Ph.D. student, I have faced a problem that when I am working on a particular project and I need some software, I will search and I will definitely find one but as a Ph.D. student, we should be aware of all these pieces of software because they are very important for bioinformatics, so what do you think how important is it for students to make aware, especially the research scholars, they are not aware of all kinds of software because this is difficult to read each and every issue of the journals, there are many, so what do you think what should we do in this regard and how important is it? Answer: I learn a lot particularly new topics from the Internet, which is easily available. In case, I wish to learn about a topic; first I search in Wikipedia which provides summarized information about the topic. Secondly, I will go to Google and will search tutorial/documents (mainly ppt, pdf and doc file), which provides most of the information on a given topic. Thirdly, I will search video on the topic particularly in YouTube, most of the time I got excellent material on the

topic. In order to read more about the topic, I search for information on a topic in PubMed and Google Scholar, which provides scientific papers published on a given topic. This way, I got most of the information on a topic, I learn most of the new topics or latest information using the above process; the student may adopt a similar strategy. If you are in research, go to the Pubmed and type the keywords, if you are entering the right keywords, you will get the published articles related to your topic. In case reprint is not freely available, you may write/email to authors for reprints, most of the cases authors send a reprint. The only challenge for our students is to change their process of learning; most of our students learn in class, which is not practically possible in research. Most of us have a habit of spoon-feeding, for everything we go to the teacher and say "please Sir tell me this". That habit should be removed at the level of Ph.D. Self-learning is most important for Ph.D. students or a researcher, it is nearly impossible to cope with time if we do not have selflearning capabilities. Students are fortunate that nowadays you are in this era of internet where you can get all the information you want, just put the right keywords for search information. Question: A question that every bioinformatician in India and the world wants to know that how can someone join your lab and what is the criteria that you look for or what are the things that you look for in an ideal

Bioinformatics Review | 13


Ph.D. candidate or someone joining your lab? Answer: That's an interesting question, I was thinking about whether I should answer it or not. The reason is that I am not too much worried about the students' qualifications, I feel any student who has a master degree have a unique set of skills. For a Ph.D. in bioinformatics, ideally student should have knowledge of both but it is not possible so we prefer a student with biology or computer background. If you are an M.Sc. holder, then you should qualify the national exam for a fellowship to get admission at our institute. I have shown the examples in the past, where students have been thrown out by some PIs because that student is not good enough for them, their performance was excellent in my group. In my view achieving high performance from highly talented students is not a challenge; a major challenge to train an average student to achieve high performance. If any student comes to me and wants to do Ph.D., for me that's a challenge. So, I work as a team rather than as a guide with students. I learn from students as much as they learn from me, so it is an exchange of knowledge in both directions. This is the reason, we are contributing to a diverse area successfully over the years. Rather than being a single source of knowledge, we prefer multiple sources of knowledge to learn in networking fashion. For me, training a student is a service to society, if we trained our students than they will further contribute to society. This chain is

important for the growth of society and science. Frankly speaking, I am not interested in my career; I already got the job of a Government official in 1986. So, during my whole life, I just worked to provide service to the community. The service is in four or five forms. First is, basically, to train the manpower, I am providing training, whatever knowledge we are gaining in our group, we are also giving it to others, so the competition will also be there. It's not like that we set some expertise in our group and it is not available outside, so, whatever expertise we have developed in our group, we have shared it. Second thing, for me, my students are not just to do research for me, they are future of science, I try my best to make them future researcher. In the last 20 years, more than 30 students have completed their Ph.D. in my group as well as there is a number of students working on projects, without any internal fighting. One of the major reasons for the success of our group is that we learn how to work in a team; working in a team is our strength. Question: At last, I want to ask, what is your opinion about bioinformatics? Do you think it is just making the castle in the air or this is just prediction-based or simulationbased, and nothing more we can do with it? Answer: For me, bioinformatics is to extract/mine knowledge from biological data to provide service to the community. In simple, words we are providing an interface between user

and knowledge generated by experimental data. Sometimes, there is a misconception that with bioinformatics we can predict anything, which is not, I consider that bioinformatics will help you to prioritize things. Let's say if you are a biologist, you want to work on a problem, how you will plan your experiment so you can optimize your cost and time. For example, you want to identify epitopes in a protein of 200 amino acids of length 9 amino acids; there are nearly 192 possible combinations, which will take a lot of time and money. Alternatively, one can predict potential 10 epitopes in the lab, maybe all 10 will not be epitopes but at least 7 or 8 will be actual epitopes. This way we can save cost as well as time to perform experimental validation. In simple words, experimental research and bioinformatics are complementary to each other. Recently, we combined bioinformatics and experimental approach to discover drug delivery peptides. In this project, first, we developed highly accurate methods for predicting cell penetrating peptides then we scanned the whole SwissProt database to predict best cellpenetrating peptides in protein. We synthesized these predicted peptides and tested in wet-lab. It was observed that some of the peptides have better efficacy than any existing cell penetrating peptides discovered in the world. In contrast, our counterpart’s biologists are not able to discover these peptides over the years, which we were able to discover in the last 5 years. So, bioinformatics has a lot of power, we

Bioinformatics Review | 14


demonstrated that if you combine experimental science and this theoretical science, you can do better. This is the reason our papers are heavily cited by the scientific community. These papers are not being read by only bioinformaticists, they are also utilized by the biologists. Unfortunately, in India, we don't respect each other's fields, if you are a biologist, they would say, what is it they’re in bioinformatics and viceversa. That's why they are not collaborating, they are not taking the full advantage of each other. I believe in a few years, they will believe more on the bioinformaticians and they will utilize the knowledge for their own experimental work.

Bioinformatics Review | 15


TOOLS

How to create a pangenome of isolated genome sequences using Roary and Prokka? Image Credit: Stock Photos

“In this article, we will learn how to create the pangenome of a few isolated genome sequences using Roary [1] and Prokka [2].�

R

oary is a pangenome genome pipeline, which calculates pangenome of a set of related prokaryotic isolates [1]. It takes annotated assemblies in the gff3 format generated by Prokka [2] and provides the pangenome. The working methodology has been explained in our previous article. In this article, we will learn how to create the pangenome of a few isolated genome sequences using Roary [1] and Prokka [2].

Input for Roary

1. Genome sequences in the form of gff3 files. Downloading the genome sequences At first, you need to download genome sequences as per your need, which you can easily download yourself or by using ncbi-genomedownload package. It provides several scripts to download genome sequences from NCBI FTP servers. To install this package, open a terminal (Ctrl + T) and type the following commands: $ pip install download

ncbi-genome-

After downloading this package, you can download the genome assemblies as per your requirements such as fasta sequences of all bacteria, viral genome, RefSeq genome sequences in GenBank format, fungal genomes and so on. (Remember, while downloading gff3 files, you need to download Genbank files with the nucleotide sequence because gff3 files on the NCBI website contain annotation only). I will download all bacterial sequences in fasta format using the following command (showing this example with only a few sequences only):

Bioinformatics Review | 16


$ ncbi-genome-download format fasta bacteria

--

Annotating the genome sequences Go into the directory of Roary, create a new folder, let's name it as example, and save download these sequences. After downloading, you will see many fasta files in the same folder. Now start annotating them to determine the attributes and location of the genes present in them, and also to obtain gff3 files which are used as an input in roary. This can be easily done with Prokka [2]. Open the terminal and type the following commands: $ cd Downloads/Roary/example/ $ prokka --kingdom Bacteria -outdir prokka_GCA_000006285 --genus Salmonella --locustag GCA_000006285 GCA_000006285.2_ASM828v3_gen omic.fna

You can further add other description such as organism details (genus, species, etc.). Make sure you annotate all the genome sequences you are dealing with and remember to change the output directory name, locus tag, and assembly version accordingly. After running this command, a new directory will be created in the name of each sequence and it will be consisting of 12 files with different extensions including the gff3 file. Creating pangenome/Running Roary We have got gff3 files of the genome sequences in the directories, now we need to copy the gff3 file from each

directory into another directory (let's say, gff_all). After that, open the terminal again and type the following command to run roary: $ roary -f ./tutorial -e -n v ./gff_all/*.gff

Core genes (99% <= strains <= 100%) 2031 Softcore genes (95% <= strains < 99%) 0 Shell genes (15% <= strains < 95%) 2497 Cloud genes (0% <= strains < 15%) 0 Total genes (0% <= strains <= 100%) 4528

At this stage, Roary will get all the coding sequences, translate them into protein sequences, and generate preclusters. After that, roary will look for the paralogs by using blastp [3] and will create clusters using MCL [4]. Finally, it will take every isolate and order them according to the presence/absence of orthologs. This will take time depending upon the number of sequences (or gff3 files) you are using. If you want to create a pangenome without the core alignment, then use the following command: $ roary -f ./tutorial ./gff_all/*.gff

-v

If you want to change the percentage identity of blastp (not advised to go below 90%), then use the following command: $ roary -f ./tutorial -i 90 v ./gff_all/*.gff

These commands will result in a new directory called tutorial (as given name in the command), where all result files will be found. You can see the summary statistics in the file named 'summary_statistics.txt', it will look like this:

Visualizing results Similarly, you will find some other output files such as 'gene_presence_absence.csv', 'accessory_binary_genes.fa.newick'. 'roary_plots.py' script (written by Marco Galardini) will be used to visualize the results, which is present inside the directory named contrib in the main roary directory. Open the terminal, go into the tutorial directory (where all the result files are present) and type the following: $ cd tutorial $ /home/user/Downloads/roary/c ontrib/roary_plots/roary_plo ts.py accessory_binary_genes.fa.ne wick gene_presence_absence.csv

You will see three png files will be added in the same tutorial directory: pangenome_frequence.png (Fig. 1), pangenome_matrix.png (Fig. 2), and pangenome_pie.png (Fig. 3) as shown below.

summary_statistics.txt

Bioinformatics Review | 17


This article demonstrated the creation of a pangenome of isolated genome sequences using roary. In case of any query, please write to us at info@bioinformaticsreview.com or tariq@bioinformaticsreview.com. References 1.

Fig. 1 showing the number of genes present in each genome sequence. 2.

3.

4.

Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T., ... & Parkhill, J. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22), 3691-3693. Seemann, T. (2014). Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), 2068-2069. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. Dongen S van. Graph Clustering by Flow Simulation. University of Utrecht; 2000.

Fig. 2 showing gene clusters.

Fig. 3 represents a pie chart showing different kinds of genes present in the genome sequences. Additionally, you can also visualize the Newick file in a phylogeny software such as Mega for further analysis.

Bioinformatics Review | 18


Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com

Bioinformatics Review | 19


Bioinformatics Review | 20


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.