SEPTEMBER 2019 VOL 5 ISSUE 9
“I believe there are no
questions that science can't answer about a physical universe.” -
Stephen Hawking
How to cluster peptide/protein sequences using cd-hit software?
How to search motif pattern in FASTA sequences using Perl hash?
Public Service Ad sponsored by IQLBioinformatics
Contents
September 2019
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Topics Editorial....
05
03 Bioinformatics Programming
How to search motif pattern in FASTA sequences using Perl hash? 06
05 Tools How to blast against a particular set of local sequences (local database)? 10
04 Software How to cluster peptide/protein sequences using cd-hit software? 08
FOUNDER TARIQ ABDULLAH EDITORIAL EXECUTIVE EDITOR TARIQ ABDULLAH FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS FOZAIL AHMAD ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ„¢ is published monthly for one year (12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under license by SEWA trust. Published in India
Should predatory journals be eliminated completely from the research community? Muniba Faiza
EDITORIAL
Founding Editor
The fast-emergence of predatory journals is a new problem in the scientific research area. These journals send attractive advertising emails to the authors to seek money and in turn, they don't provide any proper service including a peer review for their article. As a result, they publish research studies irrelevantly for the sole purpose of draining the money down from the authors. This is affecting the research quality in the scientific field. As we all are aware of Beall's list [1] created by Jeffery Beall, an academic librarian at the University of Colorado in Denver, it lists all the predatory journals and publishers as mentioned in one of our previous articles. It represents a valuable tool for researchers to be aware of predatory journals. This list is growing continuously over time. Although Beall's list has a few shortcomings for which it has been criticized for some reasons by some researchers and publishers. These reasons include the weak methodology used by Beall's list to classify a predatory journal and some journals pointed out that Beall added newly started journals without contacting and discussing their publishing policies. Now, various scientists are objecting to the predatory journals [2] and some of them have urged publishing companies to establish their standards so that they could easily differentiate predatory journals from real scientific ones [3]. Besides, these journals are causing great harm to the scientific community. There are some predatory journals that even display a fake impact factor on their website just to attract the authors, which as compared to the real
Letters and responses: info@bioinformaticsreview.com
scientific journals is higher. This is just increasing the number of publications in scientific literature without any significant contributions. Some of the major problems and harms caused by these journals include the lack of significant information, less and irrelevant data, damaged reputation, lack of knowledge and quality control. Young researchers need to think about the harmful effects of publishing in predatory journals. What is the use of such research which can not benefit the scientific community? They must be aware of their external reputation damage as well.
EDITORIAL
Journal Citation Report (JCR) provides a list of journals with an official impact factor. But most of the new researchers are not aware of this. The question arises here is that either the predatory journals should be completely eliminated? or the scientific journals and publishers should strengthen the open-access concept and contribute significantly toward the scientific community. Further, there is a strict demand for a new system to identify predatory journals and the articles published in it. References Butler, D. (2013). Investigating journals: The dark side of publishing. Nature News, 495(7442), 433. Richtig, G., Berger, M., Lange�Asschenfeldt, B., Aberer, W., & Richtig, E. (2018). Problems and challenges of predatory journals. Journal of the European Academy of Dermatology and Venereology, 32(9), 1441-1449. Strielkowski, W. (2017). Predatory journals: Beall's List is missed. Nature, 544(7651), 416.
Write to us at info@bioinformaticreview.com.
BIOINFORMATICS PROGRAMMING
How to search motif pattern in FASTA sequences using Perl hash? Image Credit: Stock photos
“A simple Perl script to search for motif patterns in a large FASTA file with multiple sequences.�
H
ere is a simple Perl script to search for motif patterns in a large FASTA file with multiple sequences.
Suppose, your multifasta file is "input.fa", in which you want to search for the motif patterns. Case-I Search a pre-defined motif pattern. use strict 'vars'; use warnings; my $regex = "motif_pattern"; my %sequences = %{ read_fasta_file( 'input.fa' ) };
open( STDOUT, ">", "output.fa" ) or die!$; foreach my $header ( keys %sequences ) { if ( $sequences{$header} =~ /$regex/ ) { print $header, "\n"; print $sequences{$header}, "\n"; } } sub read_fasta_file { my $filename = shift; my $current_header = ''; my %sequences; open FILE, "$filename" or die $!;
if ( $line =~ /^(>.*)$/ ) { $current_header }
= $1;
elsif ( $line !~ /^\s*$/ ) { # skip blank lines $sequences{$current_hea der} .= $line; } } close FILE or die $!; while ( my $line = <FILE>) { chomp $line; return \%sequences; }
Bioinformatics Review | 6
Case-II Search a user input motif pattern. use strict 'vars'; use warnings; print "Enter a motif pattern to search"; my $regex = ; my %sequences = %{ read_fasta_file( 'input.fa' ) }; open( STDOUT, ">", "output.fa" ) or die!$; foreach my $header ( keys %sequences ) { if ( $sequences{$header} =~ /$regex/ ) { print $header, "\n"; print $sequences{$header}, "\n"; } }
$sequences{$current_hea der} .= $line; } } close FILE or die $!; return \%sequences; }
Save this script with .pl extension and run as perl script.pl in terminal (in Linux) or in command prompt (in Windows).
sub read_fasta_file { my $filename = shift; my $current_header = ''; my %sequences; open FILE, "$filename" or die $!; while ( my $line = <FILE> ) { chomp $line; if ( $line =~ /^(>.*)$/ ) { $current_header }
= $1;
elsif ( $line !~ /^\s*$/ ) { # skip blank lines
Bioinformatics Review | 7
SOFTWARE
How to cluster peptide/protein sequences using cd-hit software? Image Credit: Stock Photos
â&#x20AC;&#x153;Cd-hit is used for sequence-based clustering by making clusters of a particular cut off provided as an input. It uses a single linkage clustering and finds a representative sequence for each cluster.â&#x20AC;?
C
d-hit is one of the most widely used programs to cluster biological sequences [1]. It helps in removing the redundant sequences and provides better results in the sequence analyses. Cd-hit is used for sequence-based clustering by making clusters of a particular cut off provided as an input. It uses a single linkage clustering and finds a representative sequence for each cluster. In this article, we will learn how to cluster a set of protein sequences using cd-hit software. Cd-hit package has many programs for clustering different kinds of sequences. For example, the cd-hit program is used to cluster peptide sequences, cd-hit-est is used to cluster nucleotide sequences, and even this package can compare two different databases using cd-hit-2d
and cd-hit-est-2d to compare peptide and nucleotide databases respectively [1]. In this tutorial, we are using the cd-hit program which is used to cluster a group of peptide sequences. The complete package of cd-hit can be downloaded from here. Prepare input file The input file consists of all the peptide or protein sequences in FASTA format. There is no need to format the FASTA header of the sequences. The software manages it on its own. Basic commands $ cd-hit -i input.fasta -o db100 -c 1.00 -n 5 -M 2000
where,
-o = output -c = cut-off -n = word size: n=5 for thresholds 0.7 ~ 1.0 n=4 for thresholds 0.6 ~ 0.7 n=3 for thresholds 0.5 ~ 0.6 n=2 for thresholds 0.4 ~ 0.5 -M = maximum available memory To cluster the sequences at 97% similarity cut-off $ cd-hit -i input.fasta -o db90 -c 0.97 -n 5 -M 2000
Output The output of cd-hit provides two different files:
-i = input
Bioinformatics Review | 8
1. A FASTA file of the representative sequences of all the clustered sequences. 2. A text file listing all the clusters showing a representative sequence signified with a '*' at the end of the header of the sequence. There are many other options which you can define in the command line including -G to use global sequence identity, -t to set tolerance for redundancy, -l to set length of throw_away_sequences, and -d to adjust the description of sequences in the .clstr output file. You can read about the commandline options either in the user guide provided at the cd-hit website (http://www.bioinformatics.org/cd -hit/cd-hit-user-guide.pdf) or by entering the help command ($ cdhit --help).
References 1. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658-1659.
Bioinformatics Review | 9
TOOLS
How to blast against a particular set of local sequences (local database)? Image Credit: Stock Photos
â&#x20AC;&#x153;The command-line package of NCBI-Blast offers several useful features. These features include making a BLAST database of a set of nucleotide or protein sequences, blast a query sequence against them or all-against-all blast.
B
LAST [1,2] is a local alignment tool widely used as a preliminary step for the identification of gene or protein functions. The command-line package of NCBIBlast offers several useful features. These features include making a BLAST database of a set of nucleotide or protein sequences, blast a query sequence against them or all-against-all blast. In this article, these commands are explained. The NCBI-Blast+ package [3] is freely accessible and can be downloaded from here. There are
both Linux and Windows packages available. A blast database is required made up of the local sequences in order to blast a single query sequence or multiple sequences. Therefore, to make a blast database, open a terminal and type the following commands.
Here, -parse_seqids is used because it may later help in parsing the sequence ids of the given sequences for further analyses. -in refers to the input file, -dbtype can be protein or nucleotide and -out is the name of the BLAST database to be created. If your input file is present in another directory then provide the complete path.
1. Making BLAST database of local sequences
2. BLAST the local database against a single sequence
The input file must consist of sequences in FASTA format. $ makeblastdb -in input.fasta -parse_seqids -dbtype prot -out blastdb
$ blastp -db blastdb -query seq.fasta -outfmt 0 -out result.txt -numthreads 4
where, -db is the BLAST database created in the previous step, -query Bioinformatics Review | 10
is a file consisting of FASTA sequence, -outfmt is the output format which can be defined in several ways as shown here, and numthreads refers to the number of CPUs to be used during the search. In the case of nucleotide sequences, use blastn or any other appropriate blast executable.
3.
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics, 10(1), 421.
3. all against all To BLAST local sequences against the local database created from the same input sequences, the input sequences are used as a query file in FASTA format. $ blastp -db blastdb -query input.fasta -outfmt 0 -out result.txt -numthreads 4
As you can see in the above command, the database is the same local database created in the first step and the query are the input sequences from which the local database was created in the first place. If you want to use the Windows version, then run the same commands by providing the path to the executables. The installation tutorial will be explained in the upcoming article. References 1.
Altschul, S. F. (2001). BLAST algorithm. eLS.
2.
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402.
Bioinformatics Review | 11
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics. Log on to https://www.bioinformaticsreview.com
Bioinformatics Review | 12
Bioinformatics Review | 13