5 minute read
using cd hit software? 0
How to cluster peptide/protein sequences using cd-hit software?
Image Credit: Stock Photos
Advertisement
“Cd-hit is used for sequence-based clustering by makingclusters of a particular cut off provided as an input. It uses a single linkage clustering and finds a representative sequence for each cluster.” d-hit is one of the most widely used programs to cluster biological sequences [1]. It helps in removing the redundant sequences and provides better results in the sequence analyses. Cd-hit is used for sequence-based clustering by making clusters of a particular cut off provided as an input. It uses a single linkage clustering and finds a representative sequence for each cluster. In this article, we will learn how to cluster a set of protein sequences using cd-hit software. Cd-hit package has many programs for clustering different kinds of sequences. For example, the cd-hit program is used to cluster peptide sequences, cd-hit-est is used to cluster nucleotide sequences, and even this package can compare two different databases using cd-hit-2d and cd-hit-est-2d to compare peptide and nucleotide databases respectively [1]. In this tutorial, we are using the cd-hit program which is used to cluster a group of peptide sequences. The complete package of cd-hit can be downloaded from here. Prepare input file The input file consists of all the peptide or protein sequences in FASTA format. There is no need to format the FASTA header of the sequences. The software manages it on its own. Basic commands $ cd-hit -i input.fasta -o db100 -c 1.00 -n 5 -M 2000 where, -i = input -o = output -c = cut-off -n = word size: n=5 for thresholds 0.7 ~ 1.0 n=4 for thresholds 0.6 ~ 0.7 n=3 for thresholds 0.5 ~ 0.6 n=2 for thresholds 0.4 ~ 0.5 -M = maximum available memory To cluster the sequences at 97% similarity cut-off $ cd-hit -i input.fasta -o db90 -c 0.97 -n 5 -M 2000 Output The output of cd-hit provides two different files: C
1. A FASTA file of the representative sequences of all the clustered sequences.
2. A text file listing all the clusters showing a representative sequence signified with a '*' at the end of the header of the sequence.
There are many other options which you can define in the command line including -G to use global sequence identity, -t to set tolerance for redundancy, -l to set length of throw_away_sequences, and -d to adjust the description of sequences in the .clstr output file. You can read about the commandline options either in the user guide provided at the cd-hit website (http://www.bioinformatics.org/cd -hit/cd-hit-user-guide.pdf) or by entering the help command ($ cdhit --help).
References
1. Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13), 1658-1659.
How to blast against a particular set of local sequences (local database)?
Image Credit: Stock Photos
“The command-line package of NCBI-Blast offers several useful features. These features include making a BLAST database of a set of nucleotide or protein sequences, blast a query sequence against them or all-against-all blast. LAST [1,2] is a local alignment tool widely used as a preliminary step for the identification of gene or protein functions. The command-line package of NCBIBlast offers several useful features. These features include making a BLAST database of a set of nucleotide or protein sequences, blast a query sequence against them or all-against-all blast. In this article, these commands are explained. The NCBI-Blast+ package [3] is freely accessible and can be downloaded from here. There are both Linux and Windows packages available. A blast database is required made up of the local sequences in order to blast a single query sequence or multiple sequences. Therefore, to make a blast database, open a terminal and type the following commands. 1. Making BLAST database of local sequences The input file must consist of sequences in FASTA format. $ makeblastdb -in input.fasta -parse_seqids -dbtype prot -out blastdb Here, -parse_seqids is used because it may later help in parsing the sequence ids of the given sequences for further analyses. -in refers to the input file, -dbtype can be protein or nucleotide and -out is the name of the BLAST database to be created. If your input file is present in another directory then provide the complete path. 2. BLAST the local database against a single sequence $ blastp -db blastdb -query seq.fasta -outfmt 0 -out result.txt -numthreads 4 where, -db is the BLAST database created in the previous step, -query B
is a file consisting of FASTA sequence, -outfmt is the output format which can be defined in several ways as shown here, and - numthreads refers to the number of CPUs to be used during the search. In the case of nucleotide sequences, use blastn or any other appropriate blast executable.
3. all against all
To BLAST local sequences against the local database created from the same input sequences, the input sequences are used as a query file in FASTA format.
$ blastp -db blastdb -query input.fasta -outfmt 0 -out result.txt -numthreads 4
As you can see in the above command, the database is the same local database created in the first step and the query are the input sequences from which the local database was created in the first place.
If you want to use the Windows version, then run the same commands by providing the path to the executables. The installation tutorial will be explained in the upcoming article.
References
1. Altschul, S. F. (2001). BLAST algorithm. eLS.
2. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), 3389-3402. 3. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: architecture and applications. BMC bioinformatics, 10(1), 421.
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.
Log on to
https://www.bioinformaticsreview.com