EXPLORING DATABASE AND ANALYZING PROTEIN SEQUENCE COURSE: GEB-207 DEPARTMENT OF GENETIC ENGINEERING AND BIOTECHNOLOGY, UNIVERSITY OF DHAKA.
AUBHISHEK ZAMAN ROLL:O8 4/26/2009
C
ontents
Chapter 1: About Bioinformatics 1.1 General 1.2 Resources
7-20 08 10
1.2.1 Gateways 1.2.2 Database 1.2.3 Software or Tools
10 12 14
1.3 Application and Importance 1.4 Project Aim
15 18
Chapter2: Working with Protein Sequences 2.1 General 2.2Fetching protein sequence from Database 2.2.1 Database 2.2.2 Method 2.2.3 Result
2.3 Analyzing Protein Sequences
21-65 22 22 22 23 24
24
2.3.1 Understanding the general, physical, chemical properties of a 24 Protein sequence. 2.3.1.1 Software/ Tools 2.3.1.2 Method 2.3.1.3Result
2.3.2 Searching Database for similar sequences 2.3.2.1Software/ Tools 2.3.2.2 Methods 2.3.2.3 Result
2.3.3 Sequence Alignment Study 2.3.3.1 Pair wise alignment 2.3.3.1.1 Software/ Programs 2.3.3.1.2 Methods 2.3.3.1.3 Results
2.3.3.2 Multiple Sequence Alignment 2.3.3.2.1 Software/ Programs 2.3.3.2.2 Methods 2.3.3.2.3 Results
2.3.4 Phylogenetic tree construction 2.3.4.1 Software/ Tools 2.3.4.2 Methods 2.3.4.3 Result
2.3.5 Secondary Structure Prediction 2.3.5.1 Software/ Tools 2.3.5.2 Methods 2.3.5.3 Result
Chapter 3: Discussion 3.1 General 3.2 Exploring Database 3.3 Analyzing Protein Sequences 3.4 Conclusion
25 25 26
27 27 28 30
41 41 42 43 44
47 49 49 51
54 54 56 58
60 60 61 63
66-70 67 67 68 70
Page | 2
List of abbreviation
ABBREVIATION BLAST
ELABORATION Basic Local Alignment Search Tool
DDBJ
D NA Data Bank of Japan
EBI
European Bionformatics Institute
EMB
European Molecular Biolog Laboratory
Expasy
Expert Protein Analysis System
H
Hierachical Neural Network
CBI
National Centre for Biotechnological Information
CI
National Cancer Institute
IH
National Institute of Health
LM
United States National Library of Medicine
PDB
Protein DataBank
PSI-BLA ST
Protein Specific Iterated Blast
PSI RED
Protein Secondary Information Prediction
SIB
Swiss Institute of Bioinformatics
URL
Universal Resource Locator
Page | 3
List of figures Figure no. 1.1 1.2 1.3 1.3
Name Of Table Bioinformatics; an interdeciplinary subject Submission and updates between three databases Use of informatics in drug designing. The Catalytic mechanism of Chymotrypsin.
Page No. 8 10 16 19
1.4
The overview of the project
20
2.1
The flow of data from primary data sources into component databases of universal protein resourse.
22
2.2
FASTA format result of p00766
24
2.3
Graphical presentation of BLASTp results
31
2.4
Graphical presentation of PSI-BLAST search result
36
2.5
Graphical representation of pair-wise alignment
44
2.6
Algorithm of a software performing multiple sequence alignment
48
2.7
Multiple Sequence Alignment(MSA)
53
2.8
Multiple Sequence Alignment(MSA) Jalview results
54
2.9
Newick presentation
58
2.10 2.11
Phylogenic Tree (cladogram) from Homologous sequence of p00766 Phylogenic Tree (Phylogram) from Homologous sequence of p00766
58 59
2.12
Phylogenetic tree by JalView
59
2.13
The graphical presentation of HNN
60
2.14
Secondary structure by HNN
63
2.15
Secondary structure by PSI-Pred.
65
Page | 4
List of Tables Table no. 1.1 1.2
Name Of Table Tools at EBI Available tools at Bioinformatics Group - University College London
Page No. 11 12
1.3 1.4 1.5 2.1 3.1
Primary Sequence Databases Meta-bases Software used in the project Pair-wise alignment results for retreived sequences to identify similarities Different uses of BLAST programs.
13 14 15 45 69
Page | 5
List of Web Addresses: • • • • •
http://www. ncbi.nlm.nih.gov http://www.ebi.ac.uk http://bioinf.cs.ucl.ac.uk/psipred/ http://www.expasy.org http://www.pdb.org/
Reference Source: • Bioinformatics: a practical guide to the analysis of genes and proteins B.F. Ouelette and A.D. Baxevanis • Discovering Genomics, Proteomics and Bioinformatics A.M. Campbell and L.J. Heyer • Post Genome informatics Minoru Kanesha • Bioinformatics-Sequence And Genome Analysis D.W. Mount • Bioinformatics for Dummies G,M. Claverie and C. Notredame • www.wikipedia.org
Page | 6
Chapter 1: About Bioinformatics
Page | 7
B
IOINFORMATICS is an interdisciplinary subject. It may be termed as a blend of biological and computational sciences. Bioinformatics involves storing, retreiving and manipulation of biological data using computational texhniques.
Computer Science
Biology
BIOINFORMATICS
Mathematics
Statistics
Figure1.1: Bioinformatics; an interdeciplinary subject
1.1 General
B
iological data are flooding in at an unprecedented rate. For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries and the SWISSPROT database of protein sequences contained 88,166. On average, the amount of information stored in these databases is doubling every 15 months. Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying informatics techniques (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications. Bio stands for life, informatics comes from the word information. So, Bioinformatics refers to the science that deals with the information that comes from living system. However, bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical Page | 8
techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
The National Center for Biotechnology Information (NCBI) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information." The terms bioinformatics and computational biology are often used interchangeably. However bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data. Important sub-disciplines within bioinformatics and computational biology include: • the development and implementation of tools that enable efficient access to, and use and management of, various types of information • the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences
Storing, retreiving and manipulating biological data in a meaningful way to interpret the biological system is the prime objective of Bioinformatics. To do so in the initial phase the data produced by the thousands of research teams all over the world are collected and organized in databases specialized for particular subjects. GDB (Gene Data Bank), SWISS-PROT, GenBank, PDB (Protein Data Bank) etc are some well known examples. As informations kept growing in size and complexities need of specialized tools with diverse algorithmic approach started growing too. It resulted in application of specialized softwares such as BLAST, CLUSTALW, BIOEDIT, SRATCH, Swiss PDB Viewer etc for better data manipulation and sorting out.
Page | 9
1.2. The Resources
R
esources of Bioinformatics are consisted of The Gateways, Databases and softwares.
ENTREZ NCBI • • • •
submission updates
submission updates GenBANK
EBI
EMBL DDBJ CIB SRS
getentry • •
submission updates
Figure1.2: Data flow for new submission and updates between three databases
1.2.1 Gateway
A
gateway in Information Technology (IT) is thought to be an open door through which a user collects a specialized information. A gateway can be reached at a specific Universal Resource Locator (URL).
There are several gateways for software and databases that offer access to many of the sites in bioinformatics. The gateways and databases are listed below:
ational Centre for Biotechnology Information ( CBI) Web site: http://www.ncbi.nlm.nih.gov The National Center for Biotechnology Information ( CBI) is part of the United States ational Library of Medicine ( LM), a branch of the National Institutes of Health. The NCBI houses genome sequencing data in GenBank and an index of biomedical research articles in PubMed Central and PubMed, Page | 10
as well as other information relevant to biotechnology. In addition to GenBank. NCBI provides Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), the Unique Human Gene Sequence Collection, a Gene Map of the Human genome, a Taxonomy Browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. All these databases are linked through a unique search and retrieval system, called Entrez., that also include cross-referenced information integrate these resources
European Bioinformatics Institute (EBI) Web Site: http://www.ebi.ac.uk The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. The mission of the EBI is to provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress and to contribute to the advancement in molecular biology and genome research through basic investigator-driven research in bioinformatics Table 1.1 Tools at EBI
Tool Align ClustalW CpG Plot/CpGreport GeneMark Genetic Code Viewer Wise2 Mutation Checker Pepstats/Pepwindow/Pepinfo Promoter
wise
Reverse Translator SAPS Transeq
Description Pairwise global and local alignment tool (EMBOSS). Multiple sequence alignments. CpG Island finder and plotting tool (EMBOSS). Gene prediction service. Review of genetic code differences. Compares a protein sequence or a protein profile HMM to a DNA sequence. Sequence validation. EMBOSS programs for basic protein sequence analysis (EMBOSS). Compares two DNA sequences allowing for inversions and translocations, ideal for promoters. Reverse complement checker. Statistics on protein sequences. DNA sequence translation tool (EMBOSS).
ExPASy Molecular Biology Server-Expert Protein Analysis System, Swiss Institute of BioinformaticsWeb Site: http://www.expasy.org The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss Institute of Bioinformatics (SIB) which analyzes protein sequences and structures and two-dimensional gel electrophoresis (2-D Page electrophoresis). The server functions in collaboration with the European Institute of Bioinformatics. ExPASy also produces the protein sequence knowledgebase, UniProtKB/Swiss-prot, and its
Page | 11
computer annotated supplement, UniProtKB/Trembl.
Bioinformatics Group - University College London Web Site: http://www.bioinf.cs.ucl.ac.uk
The Bioinformatics Group was originally founded as the Joint Research Council funded Bioinformatics Unit within the Department of Computer Science at University College London. The group's main aim is to develop and apply state-of-the-art mathematical and computer science techniques to problems now arising in the life sciences, particularly those now appearing in the postgenomic era. Available tools and software are: Table 1.2: Available tools at Bioinformatics Group - University College London Protein Structure Prediction
Threading (THREADER) Ab initio folding simulations Secondary structure prediction (PSIPRED) Protein disorder prediction (DISOPRED) Protein domain prediction (DomPred)
Protein Sequence Analysis
Amino acid substitution matrices Hidden Markov Models (collaboration with N. Goldman, Cambridge, & J. Thorne, NCSU)
Genome Analysis
Genomic Threading Database (GTD) Genomic fold recognition (GenTHREADER) Genome annotation using software agents
Protein Structure Classification
Comparison of structure classifications (CATH/SCOP/FSSP) CATH (collaboration with J. Thornton & C. Orengo, UCL Biochemistry)
Transmembrane Protein Modelling
MEMSAT Folding In Lipid Membranes (FILM)
Biological Applications of Datamining and Machine Learning Techniques
Information extraction for biological research (BioRat)
1.2.2 Databases
A
database in internet is actually consisted of a Database management system (DBMS) which has two interface- one is for user to use and input and another one is for management in the host computer. A database is compilation of entities in correspondence to its marked out attributes.
Page | 12
Database (or data base) is a collection of data in an organised way so that its contents can easily be accessed, managed, and modified by a computer. It is also called data bank. The most prevalent type of database is the relational database which organizes the data in tables; multiple relations can be mathematically defined between the rows and columns of each table to yield the desired information. An object-oriented database stores data in the form of objects which are organized in hierachical classes that may inherit properties from classes higher in the tree structure. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. There are two main functions of biological databases: 1. Make biological data available to scientists. As much as possible of a particular type of information should be available in one single place (book, site, database). Published data may be difficult to find or access, and collecting it from the literature is very time-consuming. And not all data is actually published explicitly in an article (genome sequences!). 2. To make biological data available in computer-readable form. Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step. Databases for bioinformatics are Primary and added-value databases Sequence Vs organism databases ‘Federated’ databases: global computer networks … WWW
Primary or archived databases contain information and annotation of DNA and protein sequences, DNA and protein structures and DNA and protein expression profiles. Secondary or derived databases are so called because they contain the results of analysis on the primary resources including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. Information from the literature is contained in bibliographic databases, such as Medline. The following table represent widely used databases for analyzing DNA and protein sequences as well as databases and types of researches can be performed for DNA, protein structure and protein function. Table1.3: Primary Sequence Databases Databases Nucleic Acid
Software tools NCBI (National Centre for Biotechnology information) - GenBank
Web Site http://www.ncbi.nlm.nih.gov/
EBI (European Bioinformatics Institute) – EMBL
http://www.ebi.ac.uk/
Page | 13
Databases
Protein
Databases
DISC – DNA Information and Stock Center, Japan NCBI – GenPept ExPasy – SwissProt and TrEMBL EBI (European Bioinformatics Institute) – SwissProt, TrEMBL, PIR DISC – DNA Information and Stock Center, Japan
http://www.dna.affrc.go.jp/ http://www.ncbi.nlm.nih.gov/ http://www.expasy.ch/ http://www.ebi.ac.uk/ http://www.dna.affrc.go.jp/
Meta-databases: A meta-database can be considered a database of databases, rather than any one integration project or technology. They collect data from different sources and usually make them available in new and more convenient form, or with an emphasis on a particular disease or organism. Table 1.4: Meta-bases
Name Web Site Entrez (National Center for Biotechnology http://www.ncbi.nlm.nih.gov Information) euGenes (Indiana University) GeneCards (Weizmann Inst.) SOURCE (Stanford University)
http://eugenes.org http://www.genecards.org http://genome-www4.stanford.edu/cgibin/SMD/source/sourceSearch mGen containing four of the world biggest http://www.cyberdatabases GenBank, Refseq, EMBL and indian.com/bioperl/index.html DDBJ - easy and simple program friendly gene extraction Bioinformatic Harvester (Karlsruhe http://harvester.fzk.de Institute of Technology) - Integrating 26 major protein/gene resources. MetaBase(KOBIC) - A user contributed http://BioDatabase.Org database of biological databases.
1.2.3. Software/Tools
S
oftware tools are computer programs for sequence analysis, database construction and management, evolutionary relations, structural analysis, pathways. The software tools are integrated into databases.
Page | 14
The Bioinformatics Toolbox offers computational molecular biologists and other research scientists an open and extensible environment in which to explore ideas, prototype new algorithms, and build applications in drug research, genetic engineering, and other genomics and proteomics projects. These tools range from a collection of standalone tools with a common data format under a single, slick standalone or webbased interface, to integrative and extensible bioinformatics workflow development environments. The important software programs in Bioinformatics that have been used in our project are given in the following table: Table 1.5: Software used in the project
ame of the Software ProtParam
Application and purpose Source Predict physicochemical http://www.expasy.org/ properties from sequence
BLAST
finds regions similarity sequences
ClustalW
Multiple sequence alignment http://www.ebi.ac.uk/ tool
PSIPRED
Secondary prediction tool
Hierarchical Network
Neural Secondary prediction tool
of local http://www.ncbi.nlm.nih.gov/ between
structure http://bioinf.cs.ucl.ac.uk/psipred/ structure http://www.expasy.org/
1.3 Application of Bioinformatics
B
ioinformatics is being used in following fields: Gene expression study
Many expression studies have so far focused on devising methods to cluster genes by similarities in expression profiles. This is in order to determine the proteins that are expressed together under different cellular conditions. Briefly, the most common methods are hierarchical clustering, self-organising maps, and K-means clustering. Hierarchical methods originally derived from algorithms to construct phylogenetic trees, and group genes in a bottom-up fashion; genes with the most similar expression profiles are clustered first, and those with more diverse profiles are included iteratively. In contrast, the self-organising map and Kmeans methods employ a top-down approach in which the user pre-defines the number of clusters for the dataset. The clusters are initially assigned randomly, and the genes are regrouped iteratively until they are optimally clustered.
Drug development
One of the earliest medical applications of bioinformatics has been in aiding rational drug design. Figure 1.3 outlines the commonly cited approach, taking the MLH1 gene product as an example drug target. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to Page | 15
mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer (126). Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can then be used to find homologues in model organisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today's medicines.
Figure 1.3: Use of informatics in drug designing. Pharmacogenomics
Clinical medicine will become more personalized with the development of the field of pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyse a patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning. Gene therapy
Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a person’s defective genes. Currently, this field is in its infantile stage with clinical trials for many different types of cancer and other diseases ongoing. Page | 16
Detection of Antibiotic-resistant pathogens
Scientists have been examining the genome of Enterococcus faecalis, a leading cause of bacterial infection among hospital patients. They have discovered a region made up of a number of antibioticresistant genes that may transform the bacterium from a harmless gut bacterium to a menacing invader. The discovery of this region could provide useful marker for detecting pathogenic strains and help to control the spread of infection inwards. Agriculture
Bioinformatics tools can be used to sequence the genomes of plants and animals and elucidate the functions of different genes. This specific genetic knowledge could then be used to produce nutrient rich, drought, disease and insect resistant plants and improve the quality of livestock making them healthier, more disease resistant and more productive. Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means that the amount of insecticides being used can be reduced and hence the nutritional quality of the crops is increased. Improved nutritional quality
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron and other micronutrients. Scientists have also inserted a gene from yeast into tomato, the result is a plant whose fruit stays longer on the vine and has an extended shelf life. Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for practical applications in industry and government-funded environmental remediation. These microorganisms thrive in water temperatures above the boiling point and therefore may provide the DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in industrial processes. Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the environment, our bodies, the air, food and water. Traditionally, a variety of microbial properties have been applied in the baking, brewing and food industries. The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. Waste management Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation resistant organism known. Scientists are interested in this organism because of its potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals. Microbial Genome Program (MGP) scientists are determining the DNA sequence of C. crescentus one of the organisms responsible for sewerage treatment.
Page | 17
Maintenance of climatic balance
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source. Evolutionary Studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary studies can be performed in a quest to determine the tree of life and the last universal common ancestor. Forensic studies
Bioinformatics has created a great opportunity to ease the forensic experiment. It has been guaranteed the highest possible accuracy to detect the right culprit in forensic investigations. Forensic analysis of microbes
Scientists used their genomic tools to help distinguish between the strains of Bacillus anthryacis that was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains. Bioweapon creation
Scientists have recently built the virus Poliomyelitis by entirely artificial means using genomic data available on the internet and materials from a mail order chemical supply.
1.4 Project Aim
T
he aim of our project was to get introduced with the field of Bioinformatics. More specifically the target was to Be familiar with biological databases and available tools to analyze the information in such databases. Finding the sequence of the protein and study the physicochemical properties. Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Clustering protein sequences into families of related sequences and the development of protein models. Developing methods to predict the structure and/or function and resive the secondery structure.
A well known protein Chymotrypsin (PDB Id- P00766) was studied as the in the project. Chymotrypsin is a proteolytic enzyme. This enzyme catalyzes the hydrolysis of peptide bonds of Page | 18
proteins in the small intestine. It is selective for peptide bonds with aromatic or large aromatic hydrophobic side chains (Tyr, Trp, Phe) on the carboxyl side of this bond. Chymotrypsin also catalyzes the hydrolysis of ester bonds. It is termed as serine Protease because it has a reactive serine residue in its active site. Three amino acid residues have been found to play the key role in catalysis: Ser195, His57 and Asp102. Together these residues are termed as “Catalytic Triad�. Although far apart in the primary structure the protein folding brings these residues close and in correct orientation in tertiary structure. Chymotrypsin was the first discovered Serine protease. Its crystal structure was first resolved by David Blow in 1967. this discovery provided provided a key understanding of the catalytic mechanism of a great variety of enzymes. The mechanism of chymotrypsin action is illustrated in the following page.
Figure1.4: the Mechanism of Action of Chymotrypsin Using bioinformatics tools we have performed a number jobs concerned with Chymotrypsin, such as
Retrieving the sequence of the protein. Determining the physio-chemi chemical properties from the sequence. Performing BLAST search for finding similar sequences. Pair wise and multiple sequence alignment of Chymotrypsin with various other protein sequences.
Page | 19
Construction of a Phylogenetic tree and to determine the evolutionary relationship based on the protein that was chosen for multiple sequence alignment. The overview of the project is shown in the following flow chart:
Sequence database browsing
Manual input
Protein Sequence file
Protein sequence Analysis
Searching databases for similar sequences
Primary structure: Physico-chemical properties
Sequence Comparison
Secondary Structure Prediction
Pair wise alignment
Identity
Multiple Sequence alignment
Similarity
Phylogenetic Tree construction
Figure 1.5: The overview of the project
Page | 20
Chapter2: Working with protein sequence
Page | 21
2.1 General
W
ith the availability of hundreds of complete genome sequences from both prokaryotes and eukaryotes efforts are now focused o the identification and functional analysis of the proteins encoded by these gnomes. this urgency has resulted in a big burst of fresh informations linked to proteomics. there came the need of a protein sequence databases. Uniprot NREF 50
Uniprot NREF 90
Uniprot NREF 100
Proteome set
IPI
Uniprot knowledgebase: swissprot+TrEMBL
Uniprot archive
Sub/pept ide data
DDBJ/E MBL/G enbank
VEG A
PDB
Patent
data
WGS
EnsE MBL
REF SEQ
Fly Base
Figure 2.1: The flow of data from primary data sources into component databases of universal protein resourse.
2.2 Fetching protein sequence from Database 2.2.1 DATABASE we searched the protein database incorporated with NCBI gateway.It is the NIH protein sequence database, an annotated collection of all publicly available Protein sequences. The complete release notes for the current version of protein database are available on the NCBI ftp site. A new release is made every two months .
Page | 22
Wor mBas e
Methods 1. Search for the desired sequence was started with the NCBI home page (http://www.ncbi.nlm.nih.gov) 2. “Protein” was chosen in the “Search” box and was searched for Chymotrypsin sequence . 3. P00766 was selected from the list and clicked. 4. The information available on the page was read carefully.
5. “FASTA” was selected from Display. 6. The amino acid sequence in FASTA format was saved.
Genpept format
CBI home page (http://www.ncbi.nlm.nih.g ov)
Sequence saved
Search ‘Protein’
FASTA selected from Diaplay
For Chymotrypsin
P00766 is selected
Page | 23
2.2.2
Results
The sequence was retreived and saved in microsoft word format for further use. >gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN
Figure 2.2: FASTA format result of p00766
2.3 Analyzing Protein sequence 2.3.1: Understanding the general physiochemical properties of a protein sequence.
P
roteins are condensation polymers of amino acid residues. however a liner organisation of residues itself do not express much about protein structure as well as protein function. it is the 3D or tertiery native structure (quarternary in case of a multisubunit protein) which depicts a protein best. Though primary structure analysis is not a good methode for functional and structural analysis of the protein, it can provide with some valuable informations regarding poteins behaviour in a solution, its molecular weight, extinction coefficient etc. Thus general physiochemical properties can be a good
indicator to understand protein activities in broader scale.
Page | 24
2.3.1.1 Software ProtParam (web link: www.expasy.org)
P
rotParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) following parameters are revealed by protparam Molecular weight Number of residues Average residue weight Charge Iso-electric point For each physico-chemical class of amino acid: number, molar percent Probability of protein expression in E. coli inclusion bodies Extinction coefficient at 1 mg/ml (A280) Molar extinction coefficient (A280)
2.3.1.2 Method 1. The address of the “European Bioinformatics Institute” http:// www.expasy.org. was written in the address bar of the Internet Explorer. 2. Then the “Toolbox” option was clicked.
3. The “Sequence Analysis” option was chosen.
4. Then from the list, the “Protparam” option was clicked.
Page | 25
5. Then in the box for the sequence, the sequence of “P00766 (swiss-prot accession no)” was pasted. 6. Then, the “Run” command Button was clicked. 7. Then the obtained results were saved on a Microsoft Word document.
www.expasy.org.
Sequence of P00766 was pasted
Compute parameters
Expasy home page
Toolbox
Protparam was selected
Result
Sequence Analysis
Save
2.3.1.3 Results ProtParam User-provided sequence: 10 20 30 40 50 60 CGVPAIQPVL SGLSRIVNGE EAVPGSWPWQ VSLQDKTGFH FCGGSLINEN WVVTAAHCGV 70 80 90 100 110 120 TTSDVVVAGE FDQGSSSEKI QKLKIAKVFK NSKYNSLTIN NDITLLKLST AASFSQTVSA 130 140 150 160 170 180 VCLPSASDDF AAGTTCVTTG WGLTRYTNAN TPDRLQQASL PLLSNTNCKK YWGTKIKDAM 190 200 210 220 230 240 ICAGASGVSS CMGDSGGPLV CKKNGAWTLV GIVSWGSSTC STSTPGVYAR VTALVNWVQQ
TLAAN References and documentation are available. Please note the modified algorithm for extinction coefficient. Number of amino acids: 245 Molecular weight: 25666.1 Theoretical pI: 8.52 Amino acid composition: Ala (A) 22 9.0% Arg (R) 4 1.6% Asn (N) 14 5.7% Asp (D) 9 3.7%
Page | 26
Cys (C) Gln (Q) Glu (E) Gly (G) His (H) Ile (I) Leu (L) Lys (K) Met (M) Phe (F) Pro (P) Ser (S) Thr (T) Trp (W) Tyr (Y) Val (V) Pyl (O) Sec (U) (B) 0 (Z) 0 (X) 0
10 10 5 23 2 10 19 14 2 6 9 28 23 8 4 23 0 0
4.1% 4.1% 2.0% 9.4% 0.8% 4.1% 7.8% 5.7% 0.8% 2.4% 3.7% 11.4% 9.4% 3.3% 1.6% 9.4% 0.0% 0.0% 0.0% 0.0% 0.0%
Total number of negatively charged residues (Asp + Glu): 14 Total number of positively charged residues (Arg + Lys): 18 Atomic composition: Carbon Hydrogen Nitrogen Oxygen Sulfur
C H N O S
1127 1783 307 353 12
Formula: C1127H1783N307O353S12 Total number of atoms: 3582 Extinction coefficients: Extinction coefficients are in units of M-1 cm-1, at 280 nm measured in water. Ext. coefficient 50585 Abs 0.1% (=1 g/l) 1.971, assuming ALL Cys residues appear as half cystines Ext. coefficient 49960 Abs 0.1% (=1 g/l) 1.947, assuming NO Cys residues appear as half cystines Estimated half-life: The N-terminal of the sequence considered is C (Cys). The estimated half-life is: 1.2 hours (mammalian reticulocytes, in vitro). >20 hours (yeast, in vivo). >10 hours (Escherichia coli, in vivo). Instability index: The instability index (II) is computed to be 15.27 This classifies the protein as stable. Aliphatic index: 82.37 Grand average of hydropathicity (GRAVY): 0.051
2.3.2 Searching database for similar sequences 2.3.2.1 Software tools Page | 27
S
tandard Protein-Protein BLAST (BLASTp)
BLASTp is the NCBI-BLAST program for comparing a protein query sequence to a protein database. The original BLAST program was developed at NCBI. It takes protein sequences in FASTA format, GenBank Accession number or GI numbers and compares them against the NCBI Protein databases. BLASTp is used to both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions for similarity. However, when sequence similarity spans the whole sequence, blastp will report a global alignment, which is the preferred result for protein identification purposes. It can be used from NCBI website.
P
osition Specific Iterated BLAST (Psi-BLAST)
PSI-BLAST uses an iterative search in which sequences found in round of searching are used to build score model for the next round searching. Highly conserved positions 12 receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each “iteration” used to refine the profile. This iterative searching strategy results in increased sensitivity. It can be used from NCBI website.
2.3.2.2 Method BLASTp 1. The Home page of the NCBI was reached using the address http://www.ncbi.nih.gov/ 2. The BLAST option was selected. 3. From the Protein portion, the ”Protein-protein BLAST (blastp)” option was selected. 4. Then, in the “Search” box, the sequence of 1GCT was given as input. 5. The “nr” option was selected form the “Choose database” option. 6. “BLOSUM 62” was selected in the “Matrix” box. 7. Then, the “BLAST” command button was clicked. 8. Then the obtained result was saved in a Microsoft Word Document.
Page | 28
CBI home page
BLAST
BLOSUM62 selected from MATRIX option
BLAST run
Format clicked
Standard proteinprotein BLAST
Sequence of 1GCT pasted in Search window
Result saved
Psi-BLAST 1. Starting with NCBI, “BLAST” search was selected and the options on that page were examined. 2. PSI- BLAST was chosen from “Protein BLAST”. 3. Protein (1GCT) sequence, saved previously in FASTA format, was pasted on the “Search” window. 4. From the MATRIX options, “BLOSUM62” was selected. 5. “PSI-BLAST” was chosen. 6. Then “BLAST” was run. 7. The result was then saved.
Page | 29
CBI home page
BLAST
“Format for PSI-
“BLOSUM62” selected
BLAST chosen
from MATRIX option
“BLAST” run
PSI-BLAST
Sequence pasted on “Search” window
Result saved
2.3.2.3 Results BLASTp results
Page | 30
Figure 2.3: Graphical presentation of BLASTp results Page | 31
More such results.......................................................................
Alignments
Select All Get selected sequences Distance tree of results
> ref|XP_608091.3| Length=300
PREDICTED: chymotrypsinogen B1 [Bos taurus]
GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus] (10 or fewer PubMed links) Score = 496 bits (1278), Expect = 5e-139, Method: Compositional matrix adjust. Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%) Query
1
Sbjct
56
Query
61
Sbjct
116
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
115
175
Page | 32
Query
121
Sbjct
176
Query
181
Sbjct
236
Query
241
Sbjct
296
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
240
TLAAN TLAAN TLAAN
235
295
245 300
> sp|P00766.1|CTRA_BOVIN RecName: Full=Chymotrypsinogen A; Contains: RecName: Full=Chymotrypsin A chain A; Contains: RecName: Full=Chymotrypsin A chain B; Contains: RecName: Full=Chymotrypsin A chain C pdb|2CGA|A Chain A, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis And Refinement Of A New Crystal Form At 1.8 Angstroms Resolution pdb|2CGA|B Chain B, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis And Refinement Of A New Crystal Form At 1.8 Angstroms Resolution 30 more sequence titles pdb|1ACB|E Chain E, Crystal And Molecular Structure Of The Bovine AlphaChymotrypsin-Eglin C Complex At 2.0 Angstroms Resolution pdb|1CGI|E Chain E, Three-Dimensional Structure Of The Complexes Between Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human Pancreatic Secretory Trypsin Inhibitor (Kazal-Type) pdb|1CGJ|E Chain E, Three-Dimensional Structure Of The Complexes Between Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human Pancreatic Secretory Trypsin Inhibitor (Kazal-Type) pdb|1EX3|A
Chain A, Crystal Structure Of Bovine Chymotrypsinogen A (Tetragonal)
pdb|1GL1|A Chain A, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL1|B Chain B, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL1|C Chain C, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL0|E Chain E, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-D2v, An Inhibitor From The Insect Locusta Migratoria pdb|1K2I|1 Chain 1, Crystal Structure Of Gamma-Chymotrypsin In Complex With 7- Hydroxycoumarin pdb|1P2M|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2M|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2N|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2N|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2O|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin
Page | 33
pdb|1P2O|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2Q|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2Q|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1OXG|A Chain A, Crystal Structure Of A Complex Formed Between Organic Solvent Treated Bovine Alpha-Chymotrypsin And Its Autocatalytically Produced Highly Potent 14-Residue Peptide At 2.2 Resolution pdb|1T7C|A Chain A, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T7C|C Chain C, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8L|A Chain A, Crystal Structure Of The P1 Met Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8L|C Chain C, Crystal Structure Of The P1 Met Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8M|A Chain A, Crystal Structure Of The P1 His Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8M|C Chain C, Crystal Structure Of The P1 His Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8N|A Chain A, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8N|C Chain C, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8O|A Chain A, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8O|C Chain C, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine Chymotrypsin Complex pdb|1CHG|A Chain A, Chymotrypsinogen,2.5 Angstroms Crystal Structure, Comparison With Alpha-Chymotrypsin,And Implications For Zymogen Activation pdb|1GCD|A Chain A, Refined Crystal Structure Of "aged" And "non-Aged" Organophosphoryl Conjugates Of Gamma-Chymotrypsin Length=245 GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus] (10 or fewer PubMed links) Score = 495 bits (1274), Expect = 2e-138, Method: Compositional matrix adjust. Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%) Query
1
Sbjct
1
Query
61
Sbjct
61
Query
121
Sbjct
121
Query
181
Sbjct
181
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
60
120
180 240 240
Page | 34
Query
241
Sbjct
241
TLAAN TLAAN TLAAN
245 245
> pdb|1GCT|A Chain A, Is Gamma-Chymotrypsin A Tetrapeptide Acyl-Enzyme Adduct Of Gamma-Chymotrypsin? pdb|2GCT|A Chain A, Structure Of Gamma-Chymotrypsin In The Range Ph 2.0 To Ph 10.5 Suggests That Gamma-Chymotrypsin Is A Covalent AcylEnzyme Adduct At Low Ph pdb|1GHB|E Chain E, A Second Active Site In Chymotrypsin? The X-Ray Crystal Structure Of N-Acetyl-D-Tryptophan Bound To Gamma- Chymotrypsin pdb|2GMT|A Chain A, Three-Dimensional Structure Of Chymotrypsin Inactivated With (2s) N-Acetyl-L-Alanyl-L-Phenylalanyl-Chloroethyl Ketone: Implications For The Mechanism Of Inactivation Of Serine Proteases By Chloroketones pdb|3GCH|A Chain A, Chemistry Of Caged Enzymes. Binding Of Photoreversible Cinnamates To Chymotrypsin Length=245 Score = 486 bits (1251), Expect = 6e-136, Method: Compositional matrix adjust. Identities = 241/245 (98%), Positives = 241/245 (98%), Gaps = 0/245 (0%) Query
1
Sbjct
1
Query
61
Sbjct
61
Query
121
Sbjct
121
Query
181
Sbjct
181
Query
241
Sbjct
241
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGL IVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRY ANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ TLAAN TLAAN TLAAN
60
120
180 240 240
245 245
More such results
2.3.2.3.2 PSI BLAST RESULT (After 3 Iterations)
Page | 35
Figure 2.4: Graphical presentation of PSI-BLAST search result
Page | 36
Similar More Results....................................................................... Alignments Select All Get selected sequences Distance tree of results >emb|CAG00821.1| unnamed protein product [Tetraodon nigroviridis] Length=263 Score = 437 bits (1125), Expect = 3e-121, Method: Composition-based stats. Identities = 165/245 (67%), Positives = 194/245 (79%), Gaps = 0/245 (0%) Query
1
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
CGVP I PV++G SRIVNGEEAVP SWPWQVSLQ+ TGFHFCGGSLINENWVVTAAHC V Sbjct
19
CGVPGIPPVITGYSRIVNGEEAVPHSWPWQVSLQEYTGFHFCGGSLINENWVVTAAHCNV
78
Query
61
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
TS
V+ GE D+ S++E IQ +++ +VFK+
YNS TINNDITL+KL++ A
+
VS
Sbjct
79
RTSHRVILGEHDRSSNNENIQVMQVGQVFKHPNYNSYTINNDITLIKLASPAQLNIRVSP
138
Query
121
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
Page | 37
VC+
SD F
G
CVT+GWGLTRY
+TP RLQQ +LPLL+N
C+K+WG+KI D M
Sbjct
139
VCVAETSDVFPGGMKCVTSGWGLTRYNAPDTPPRLQQVALPLLTNEECRKHWGSKITDLM
198
Query
181
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
240
+CAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS
CS S+PGVYARVT L
W+ Q
Sbjct
199
VCAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGFCSVSSPGVYARVTMLRAWMDQ
Query
241
TLAAN
258
245
+AAN Sbjct
259
IIAAN
263
>gb|AAT45254.1| chymotrypsinogen 2-like protein [Sparus aurata] gb|ABE68638.1| chymotrypsinogen II precursor [Sparus aurata] Length=264 Score = 435 bits (1120), Expect = 1e-120, Method: Composition-based stats. Identities = 165/245 (67%), Positives = 192/245 (78%), Gaps = 0/245 (0%) Query
1
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
CG PAI PV++G SRIVNGEEAVP SWPWQVSLQD TGFHFCGGSLINENWVVTAAHC V Sbjct
20
CGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENWVVTAAHCNV
79
Query
61
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
TS
V+ GE D+ S++E IQ +K+ KVFK+
YN
TINNDI L+KL++ A
VS
Sbjct
80
RTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSP
139
Query
121
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
VC+
+D+F
G
CVT+GWGLTRY
+TP
LQQASLPLL+N
C++YWG+KI + M
Sbjct
140
VCVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLM
199
Query
181
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
240
ICAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS TC+ + PGVYARVT L
W+ Q
Sbjct
200
ICAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQ
Query
241
TLAAN
259
245
+AAN Sbjct
260
IIAAN
264
>ref|XP_536782.2| PREDICTED: similar to chymotrypsinogen B1 isoform 1 [Canis familiaris] Length=264 GENE ID: 479650 CTRB1 | chymotrypsinogen B1 [Canis lupus familiaris] Score = 432 bits (1112), Expect = 1e-119, Method: Composition-based stats. Identities = 188/245 (76%), Positives = 211/245 (86%), Gaps = 0/245 (0%)
Page | 38
Query
1
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
60
CGVPAI+PVL+GLSRIVNGE+AVPGSWPWQVSLQD TGFHFCGGSLI+E+WVVTAAHCGV Sbjct
20
CGVPAIEPVLNGLSRIVNGEDAVPGSWPWQVSLQDSTGFHFCGGSLISEDWVVTAAHCGV
79
Query
61
TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
120
TS +VVAGEFDQ SS E IQ LKIA+VFKN K+N
T+ NDITLLKL+T A FS+TVS
Sbjct
80
RTSHLVVAGEFDQSSSEENIQVLKIAEVFKNPKFNMFTVRNDITLLKLATPARFSETVSP
139
Query
121
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
180
VCLP A+D+F
G
CVTTGWG T+Y
TPD+LQQA+LPLLSN
CKK+WG+KI D M
Sbjct
140
VCLPQATDEFPPGLMCVTTGWGRTKYNANKTPDKLQQAALPLLSNAECKKFWGSKITDVM
199
Query
181
ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
240
ICAGASGVSSCMGDSGGPLVC+K+GAWTLVGIVSWGS TCSTS P VY+RVT L+ WVQ+ Sbjct
200
ICAGASGVSSCMGDSGGPLVCQKDGAWTLVGIVSWGSGTCSTSVPAVYSRVTELIPWVQE
Query
241
TLAAN
259
245
LAAN Sbjct
260
ILAAN
264
Similar search results.................................................... Sequences Retrieved from PSI BLAST results in FASTA format: >gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|4CHA| CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|1GCT|CHYMOTRYPSIN*A CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS WGSSTCSTSTPGVYARVTALVNWVQQTLAAN >|CTRB2 protein[Human] MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM
Page | 39
GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN >|CTRL protein[Human] LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN >|Ela3 protein[Mouse] PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN >gi|chymotrypsinogen 2-like protein [Sparus aurata] GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN >gi|Zebrafish [Danio rerio] WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN >pdb|1S0Q|TRYPSINOGEN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC NYVSWIKQTIASN >pdb|3PTN|TRYPSIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC NYVSWIKQTIASN >gi|PRSS2 protein [Bos taurus] MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS >gi|tryptase-III [Human] LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|beta 1 tryptase [Gorilla gorilla] MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta] MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS >pdb|5PTP|HYDROLASE IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDXGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
Page | 40
NYVSWIKQTIASN >pdb|3EST|ELASTASE VVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNNGTEQ YVGVQKIVVHPYWNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQ LAQTLQQAYLPTVDYAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVS RLGCNVTRKPTVFTRVSAYISWINNVIASN >pdb|1ZHR|FactorXI IVGGTASVRGEWPWQVTLHTTSPTQRHLCGGSIIGNQWILTAAHCFYGVESPKILRVYSGILNQAEIAED TSFFGVQEIIIHDQYKMAESGYDIALLKLETTVNYADSQRPISLPSKGDRNVIYTDCWVTGWGYRKLRDK IQNTLQKAKIPLVTNEECQKRYRGHKITHKMICAGYREGGKDACKGDSGGPLSCKHNEVWHLVGITSWGE GCAQRERPGVYTNVVEYVDWILEKTQAV >pdb|1DDJ|PLASMINOGEN SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN >|Mast cell protease6 MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS >gi|899286|Hepsin TSGFFCVDEGRLPHTQRLLEVISVCDCPRGRFLAAICQDCGRRKLPVDRIVGGRDTSLGRWPWQVSLRYD GAHLCGGSLLSGDWVLTAAHCFPERNRVLSRWRVFAGAVAQASPHGLQLGVQAVVYHGGYLPFRDPNSEE NSNDIALVHLSSPLPLTEYIQPVCLPAAGQALVDGKICTVTGWGNTQYYGQQAGVLQEARVPIISNDVCN GADFYGNQIKPKMFCAGYPEGGIDACQGDSGGPFVCEDSISRTPRWRLCGIVSWGTGCALAQKPGVYTKV SDFREWIFQAIKTHSEASGMVTQL >pdb|1SPJ|KALLIKREIN IVGGWECEQHSQPWQAALYHFSTFQCGGILVHRQWVLTAAHCISDNYQLWLGRHNLFDDENTAQFVHVSE SFPHPGFNMSLLENHTRQADEDYSHDLMLLRLTEPADTITDAVKVVELPTEEPEVGSTCLASGWGSIEPE NFSFPDDLQCVDLKILPNDECKKAHVQKVTDFMLCVGHLEGGKDTCVGDSGGPLMCDGVLQGVTSWGYVP CGTPNKPSVAVRVLSYVKWIEDTIAENS >pdb|1HCG|FACTOR X IVGGQECKDGECPWQALLINEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVGDRNTEQEEGGEAVHE VEVVIKHNRFTKETYDFDIAVLRLKTPITFRMNVAPACLPERDWAESTLMTQKTGIVSGFGRTHEKGRQS TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGGPHVTRFKDTYFVTGIVSWGEGCA RKGKYGIYTKVTAFLKWIDRSMKTRGLPKAK >pdb|1HYL|COLLAGENASE IINGYEAYTGLFPYQAGLDITLQDQRRVWCGGSLIDNKWILTAAHCVHDAVSVVVYLGSAVQYEGEAVVN SERIISHSMFNPDTYLNDVALIKIPHVEYTDNIQPIRLPSGEELNNKFENIWATVSGWGQSNTDTVILQY TYNLVIDNDRCAQEYPPGIIVESTICGDTSDGKSPCFGDSGGPFVLSDKNLLIGVVSFVSGAGCESGKPV GFSRVTSYMDWIQQNTGIKF >gi|Cold-Adaption Enzymes [Salmon] IVGGYECKAYSQAHQVSLNSGYHFCGGSLVNENWVVSAAHCYKSRVEVRLGEHNIKVTEGSEQFISSSRV IRHPNYSSYNIDNDIMLIKLSKPATLNTYVQPVALPTSCAPAGTMCTVSGWGNTMSSTADSDKLQCLNIP ILSYSDCNDSYPGMITNAMFCAGYLEGGKDSCQGDSGGPVVCNGELQGVVSWGYGCAEPGNPGVYAKVCI FSDWLTSTMASY
2.3.3 Sequence Alignment Study 2.3.3.1 Pair-wise alignment
S
equence alignment is the procedure of comparing two (pair-wise alignment) (multiple sequence alignment) sequences by searching for a series of individual character patterns that are in the same order in the sequences. Two sequences are Page | 41
by writing them across a page in two rows. Identical or similar characters are placed in the same column, and nonidentical characters can either be placed in the same column as a mismatch or opposite a gap in the other sequence. In an optimal alignment, nonidentical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. Sequences that can be readily aligned in this manner are said to be similar. There are two types of sequence alignment, global and local. In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. Pairwise alignment is the process by which a pair of sequences are compared to one another by sequence alignment technique either global or local. It can also bedotplot
LGPSSKQTGKGS-SRIWDN Global alignment LN-ITKSAGKGAIMRLGDA –------TGKG-------Local alignment -------AGKG--------
Distinction between global and local alignments of two sequences.
2.3.3.1.1 Software/Program BLAST2 sequence
T
his tool produces the alignment of two given sequences using BLAST engine for local alignment. While the standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes the BLAST algorithm for pair wise DNA-DNA or protein-protein sequence comparison. The results of BLAST2 Sequences give information about the similarities and identities of other proteins regarding of the query protein. It also gives a graphical representation of the alignment.
Page | 42
2.3.3.1.2 Methods 1. Starting with NCBI, “BLAST” search was selected 2. “Align two sequences (bl2seq)” was chosen from Special databases. 3. “blastp” was chosen from Program and along with it, “BLOSUM62” was automatically selected in the Matrix options.
4. The query sequence was pasted from the saved fie in the first window
5. The subject sequence was pasted from file in the 2nd window. 6. “Align” clicked.
was
7. The results were saved.
CBI home page
BLAST
Align two sequences (bl2seq)
Align
Query & Subject seq.
“blastp” chosen
pasted in separate window
from Program
Result saved
Page | 43
2.3.3.1.3 Result Pair-wise alignment results were found seperately for sequences. Among those one particular result of p00766 and cold adaptation enzyme is given below--
Figure 2.5: Graphical representation of pair-wise alignment Page | 44
Table 2.1: Pair-wise alignment results for retreived sequences to identify similarities Attemps
Sequence 01
Sequence 02
01.
Sq(p00766) S=human L=245
Sq=Cold adaptation enzyme S=(salmon) L=231
02
Sq(p00766) S=human L=245
03
score
Expect
identities
positives
Gaps
97/231 (41%)
137/231 (59 %)
12/ (5 %)
241/245 (98 %)
241/245 (98 %)
0/245 (0 %)
241/245 ( 98%)
241/245 (98 %)
4/245 ( 1%)
-115 6℮115
199/245 (81 %)
215/245 (87 %)
0/245 (0 %)
164bits (414 )
-38 1℮-38
Sq=pdb[4CHA] S= L=245
494bits (1271 )
-138 5℮-138
Sq(p00766) S=human L=245
Sq=1GT(chymotr ypsin) S= L=245
485bits ( 1249)
04
Sq(p00766) S=human L=245
Sq=CTRB 2 protein S=human L=263
417bits ( 1072)
05
Sq(p00766) S=human L=245
Sq=CTRL protein S= L=269
294bits (752 )
-78 7℮-78
132/246 (53%)
174/246 ( 70%)
1/246 ( 0%)
Sq=Ela 3 protein S=mouse L=255
197bits (502 )
-49 7℮-49
111/253 ( 43%)
153/253 (60 %)
16/253 ( 6%)
Sq(p00766) 06 S=human L=245
2℮135
-135
07
Sq(p00766) S=human L=245
Sq=chymotrysin like protein S=Sparus aureta L=
353bits ( 907)
-96 8℮-96
165/245 ( 67%)
192/245 ( 78%)
0/245 ( 0%)
08
Sq(p00766) S=human L=245
Sq= S=zebra fish L=261
347bits ( 890)
-94 7℮-94
166/247 ( 67%)
197/247 ( 79%)
2/247 ( 0%)
09
Sq(p00766) S=human L=245
Sq=trypsinogen S= L=223
175bits ( 444)
-42 4℮-42
98/232 ( 42%)
140/232 (60%)
11/232 ( 4%)
Page | 45
10
Sq(p00766) S=human L=245
Sq=3PTN(trypsin ) S= L=223
175bits ( 444)
-42 4℮-42
98/232 ( 42%)
140/232 ( 60%)
11/232 ( 4%)
11
Sq(p00766) S=human L=245
Sq=PRSS2 S=Bos taurus L=247
179bits ( 454)
-43 3℮-43
104/233 ( 44%)
139/233 ( 59%)
4/233 ( 2%)
12
Sq(p00766) S=human L=245
Sq=tryptase S=human L=267
166bits ( 428)
-39 2℮-39
92/237 ( 38%)
126/237 ( 53%)
14/237 ( 5%)
13
Sq(p00766) S=human L=245
Sq=beta tryptase S=Gorilla L=275
164bits ( 416)
-39 7℮-39
91/237 ( 38%)
127/237 ( 53%)
14/237 ( 5%)
14
Sq(p00766) S=human L=245
Sq=try p14 S=Macaca mulata L=247
171bits ( 434)
-41 6℮-41
102/233 ( 43%)
134/233 ( 57%)
11/233 ( 4%)
15
Sq(p00766) S=human L=245
Sq=Hydratase S= L=223
173bits ( 439)
-41 1℮-41
97/232 ( 41%)
139/232 ( 59%)
11/232 ( 4%)
Sq=Elastase S= L=240
162bits (409 )
-38 4℮-38
95/241 ( 39%)
137/241 ( 56%)
12/241 ( 4%)
Sq(p00766) 16 S=human L=245
17
Sq(p00766) S=human L=245
Sq=Factor XI S=human L=238
169bits ( 428)
-40 3℮-40
93/238 ( 39%)
125/238 ( 52%)
10/238 ( 4%)
18
Sq(p00766) S=human L=245
Sq=plasminogen S=human L=247
171bits ( 432)
-40 1℮-40
95/253 ( 37%)
137/253 ( 54%)
17/253 ( 6%)
19
Sq(p00766) S=human L=245
Sq=protease S=(Mast cell) L=274
162bits ( 411)
-38 3℮-38
90/239 ( 37%)
127/239 (53 %)
14/239 ( 5%)
Page | 46
20
Sq(p00766) S=human L=245
Sq=Hepsin S=human L=304
159bits ( 403)
-37 2℮-37
95/249 ( 38%)
130/249 ( 52%)
23/243 ( 9%)
21
Sq(p00766) S=human L=245
Sq=Kallekrenin S=human L=238
132bits ( 331)
-29 5℮-29
83/245 (33 %)
123/245 ( 50%)
23/245 ( 9%)
22
Sq(p00766) S=human L=245
Sq=Factor X S=human L=241
120bits (302)
-25 1℮-25
72/237 ( 30%)
117/237 ( 49%)
15/237 ( 6%)
23
Sq(p00766) S=human L=245
Sq=collagenase S=human L=230
98 bits (244)
-19 6℮-19
74/235 (31 %)
117/235 (49 %)
21/235 (8 %)
here, S= source, L= length, sq= sequence
2.3.3.2 Multiple Sequence Alignment
O
ne of the major contribution of molecular biology to evolutionary analysis is the discovery that the DNA sequences of different organisms are often related. Similar genes are conserved across widely divergent species, often performing a similar or even identical function, and at other times, mutating or rearranging to perform an altered function through the forces of natural selection. Thus, many genes are represented in highly conserved forms in organisms. Through simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to alteration may be analyzed. Because the potential for learning about the structure and function of molecules by multiple sequence alignment (msa) is so great, computational methods have received a great deal of attention. In msa, sequences are aligned optimally by bringing the greatest number of similar characters into register in the same column of the alignment, just as described in Chapter 3 for the alignment of two sequences. Computationally, msa presents several difficult challenges. First, finding an optimal alignment of more than two sequences that includes matches, mismatches, and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge. The dynamic programming algorithm used for optimal alignment of pairs of sequences can be extended to three sequences, but for more than three sequences, only a small number of relatively short sequences may be analyzed. Thus, approximate methods are used, including (1) a progressive global alignment of the sequences starting with an alignment of the most alike sequences and then building an alignment by adding more sequences, (2) iterative methods that make an initial alignment of groups of sequences and then revise the alignment to achieve a more reasonable result, (3) alignments based on locally conserved patterns found in the same order in the sequences, and (4) use of statistical methods and probabilistic models of the sequences. A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column of an msa. Finally, the placement and scoring of gaps in the various sequences of an msa presents an additional challenge. The msa of a set of sequences may also be viewed as an evolutionary history of the sequences. If the Page | 47
sequences in the msa align very well, they are likely to be recently derived from a common ancestor sequence. Conversely, a group of poorly aligned sequences share a more complex and distant evolutionary relationship. The task of aligning a set of sequences, some more closely and others less closely related, is identical to that of discovering the evolutionary relationships among the sequences. As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies considerably with sequence similarity. On the one hand, if the amount of sequence variation is minimal, it is quite straightforward to align the sequences, even without the assistance of a computer program. On the other hand, if the amount of sequence variation is great, it may be very difficult to find an optimal alignment of the sequences because so many combinations of substitutions, insertions, and deletions, each predicting a different alignment, are possible.
Figure 2.6: Algorithm of a software performing multiple sequence alignment
Page | 48
2.3.3.2.1 Software/ Tools CLASTALW
C
LUSTALW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid. This program follows the following steps: Perform pair wise alignments of all of the sequences. Use the alignment scores to produce a phylogenic tree using neighbor-joining methods. Align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Evolutionary relationships can also be seen via viewing Cladograms or Phylograms. The sequence alignment is performed in global alignment manner.
JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used as a visualization tool for the PFAM CORBA server and client at the EBI but is available as a general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI clustalw server and the PFAM protein domain database) but is available as a general purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are patterns of shared common history between biological replicators.
2.3.3.2.2 Method 1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen from the drop down menu. 3. From the tools available, “CLUSTALW” was selected for multiple sequence alignment. 4. Full was chosen from the alignment option. 5. “Blosum” was chosen from Matrix. Page | 49
6. “Input” was selected from the output order. 7. The sequences similar to our query sequence protein were pasted in FASTA format in the given window from file. 8. The program was run. 9. “Show Colour” was clicked. 10. The result was saved.
EBI Home page
European Bioinformatics Institute
Parameters: Alignment-Fast Matrix-blosum; output orderInput
Sequences pasted
CLUSTALW
Run
Toolbox
Sequence Analysis
Show
Result
Colour
saved
Page | 50
2.3.2.2.3 Results
C
lustal results are best expressive when the initial gap sequences are omitted. It is because the multiple sequence alignment here is a global alignment process. So after omitting sequences that caused too much gaps to match p00766 sequence we had 14 overall meaningful sequences. That are>gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|4CHA| CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|1GCT|CHYMOTRYPSIN*A CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS WGSSTCSTSTPGVYARVTALVNWVQQTLAAN >|CTRB2 protein[Human] MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN >|CTRL protein[Human] LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN >|Ela3 protein[Mouse] PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN >gi|chymotrypsinogen 2-like protein [Sparus aurata] GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN >gi|Zebrafish [Danio rerio] WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN >gi|PRSS2 protein [Bos taurus] MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS >gi|tryptase-III [Human] LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP
Page | 51
DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|beta 1 tryptase [Gorilla gorilla] MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta] MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS >pdb|1DDJ|PLASMINOGEN SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN >|Mast cell protease6 MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS
Page | 52
CLUSTAL W results
Figure 2.7: Multiple Sequence Alignment(MSA)
Page | 53
JAL view result
Figure 2.8: Multiple Sequence Alignment(MSA) Jalview results
Similar results were found in case of keeping the parameter output order “aligned” in case of “input”
2.3.4 Phylogenetic tree Construction 2.3.4.1 Software/Tools
CLUSTALW
Page | 54
M
ultiple sequence comparisons help highlight weak sequence similarity and shed light on structure, function, or origin. The most widely used programs for global multiple sequence alignment are from the Clustal series of programs. CLUSTALW and CLUSTALX are progressive alignment programs that follow the following steps: Perform pair wise alignments of all of the sequences. Use the alignment scores to produce a phylogenic tree using neighbor-joining methods. Align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. ClustalW is use to align DNA or protein sequences in order to elucidate their relatedness as well as their evolutionary origin.
CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. The initial pair wise alignments are calculated using an enhanced dynamic programming algorithm, and the genetic distances used to create the phylogenetic tree are calculated by dividing the total number of mismatched positions by the total number of matched positions. The resulting evolutionary relationships can be viewed either as cladograms or phylograms, with the option to display branch lengths (or “tree graph distances).
Web link: http://www.ebi.ac.uk JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used as a visualization tool for the Pfam CORBA server and client at the EBI but is available as a general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI clustalw server and the PFAM protein domain database) but is available as a general purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are patterns of shared common history between biological replicators. Web link: http://www.ebi.ac.uk/jalview
Page | 55
2.3.4.2 Methods
Using CLASTALw 1. Starting with the EBI home page http://www.ebi.ac.uk European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen. 3. From the drop down menu “CLUSTALW” was selected. 4. The following parameters were selected from Output and Phylogenetic tree: •
TREE TYPE: nj
•
CORRECTDISTANCE: on
•
IGNORE GAPS: on
5. The multiple sequence alignment result previously obtained from CLUSTALW was pasted. 6. The program was then run for phylogenetic tree construction. 7. To view the phylogenetic tree, “Show as Phylogram Tree” was clicked. 8. The resulting phylogenetic tree was saved.
EBI home page (http://www.ebi.ac.uk)
European Bioinformatics Institute
Tree Type-nj;Correct
CLUSTALW
Toolbox
Sequence Analysis
distance-on; ignore gaps-on
Sequences of MSA pasted
Run
Show as Phylogram Tree
Result saved
Page | 56
Using JalView 1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen. 3. From the tools available, “CLUSTALW” was selected for multiple sequence alignment. 4. The parameters chosen were: “Full” for Alignment, “Blosum” for Matrix and “Input” for Output Order. 5. The sequences were pasted in the given window. 6. The program was run. 7. “JalView” was clicked from Results of search. 8. “Neighbour joining tree using JalView” was chosen from Calculate. 9. The phylogram tree was saved. EBI Home Page (http://www.ebi.ac. uk)
Run
“JalView” from Results
European Bioinformatics Institute
Sequences pasted
Calculate
Toolbox
Parameters: Alignment-“Fast”, Matrix- “Blosum”, OutputOrder-Input “Input”.
eibhbour joining tree using PID
Sequence Analysis
CLUSTALW
Phylogram Tree saved
Page | 57
2.3.4.3 Results Newick file for Phylogenic Tree construction
Figure 2.9: Newick presentation
Cladogram
Fig 2.10: Phylogenic Tree (cladogram) from Homologous sequence of p00766
Phylogram
Page | 58
Fig 2.11: Phylogenic Tree (Phylogram) from Homologous sequence of p00766
Phylogenetic tree using JAL view
Figure 2.12: Phylogenetic tree by JalView
Page | 59
2.3.5 Secondary Structure Prediction 2.3.5.1 Software/ Tools
P
roteins’ secondary structure depend on their primary sequences. Several software can be used to determint secondary structure. Some of them are listed below:
2.3.5.1.1 PSI-Pred PSIPRED is a software tool provided by University College London (UCL).Its widely used software to predict secondary structure from sequence. The PSIPRED protein structure prediction server allows one’s to submit a protein sequence, perform a prediction of one’s choice and receive the results of the prediction via e-mail. PSIPRED is a simple and reliable secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST).It is a highly accurate method for protein secondary structure prediction. 2.3.5.1.2 Neural Network etwork (NN) is a special type of problem solving algorithm based on the parallel architecture of complex animal neuronal organization. Hidden Markov Model is the basis of developing this algorithm. Neural Network simulates human learning process by mimicking networking organization of neuron and synapses. A single neuron, in the computational scheme, is a node in a directed graph, with one or more entering connections designated as input, and a single leaving connection called the output. To form a network, several neurons are assembled and the outputs of some connected to the inputs of others. Some nodes contain connections that provide input to the entire network; some deliver output information from the network to the outside world; and others, that do not interact directly with the outside, are called “hidden” layers.
Fig 2.13: The graphical presentation of HNN Applying this to the interpretation of genotypic information, neural networks are trained using a large database of input (genotype and treatment) data and output (drug response) data. The model is then tested on a testing set of input and output data to see how accurate it is.
2.3.5.1.2 Hierarchal Neural Network (HNN) Hierarchical neural networks consist of multiple neural networks concreted in a form of an acyclic graph. Tree-structured neural architectures are a special type of hierarchical neural network. The networks within the graph can be single neurons or complex neural architectures such as multilayer Page | 60
perceptions or radial basis function networks. Decision trees, hierarchical self-organizing maps, hierarchies of experts, hierarchical or tree-based classifiers are typical applications for hierarchical neural networks.
2.3.5.2 Methods Using PSIPRED: 1. Starting with the Bioinformatics Unit page (http://bioinf.cs.ucl.ac.uk ), “Secondary structure prediction (PSIPRED)” was selected from “Protein Structure Prediction.” 24 2. The sequence of 1GCT was pasted in FASTA format.
3. Email address was entered.
4. Predict was clicked. 5. The results were obtained in email and then saved.
Bioinformatics Unit (http://bioinf.cs.ucl.ac. uk)
Predict
Results of PSIPRED received in email
Secondary structure prediction (PSIPRED)
Sequence of IGCT submitted
Server accessed
Sequence pasted in FASTA format
Results saved
Page | 61
Using HNN 1. By entering the link www.expasy.org, the ExPASy Proteomics Server was accessed.
2. From Tools and software packages, “Secondary and tertiary structure prediction� was selected. 3. HNN was chosen from Secondary structure prediction. 4. 4CHA sequence was pasted in the window. 5. The sequence was submitted to run the program. 6. The result was saved.
www.expasy.org
Secondary and tertiary structure prediction
HNN
Submit Result saved
sequence pasted
Page | 62
2.3.5.3 Results Hierarchical Neural Network result 10 20 30 40 50 60 70 | | | | | | | CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE ccccchchhhhchheeeccccccccccceeeecccccceeeccccccccheeeehhhcccccceeeeeec FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG cccccchhhhhhhhhhhhhhcccccceeeccceeeeeecccccccceeeeeecccccccccccceeeeec WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV ccceeecccccccchhhcccccccccccchcchhhhhhhhhhhccccccccccccccceeeecccceeee GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN eeeecccccccccccchhhhhhhhhhhhhhhhccc Sequence length : 245 HNN : Alpha helix (Hh) : 56 is 22.86% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% 55 is 22.45% Extended strand (Ee) : Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 134 is 54.69% Ambigous states (?) : 0 is 0.00% Other states : 0 is 0.00%
Figure 2.14: Secondary structure by HNN
Page | 63
PSIPRED Result On Tue, 20 Jan 2009 09:03:36 GMT, "Apache" <psipred@cs.ucl.ac.uk> said: PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence PSIPRED HFORMAT (PSIPRED V2.6 by David Jones) Conf: 998765777799997599999999987868999938997897079940998998841313 Pred: CCCCCCCCCCCCCCCEECCEECCCCCCCCEEEEEECCCCEEEEEEEECCCEEEECHHHCC AA: CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 10 20 30 40 50 60 Conf: 787579999626553799748997889998999888888785199997888757697801 Pred: CCCCEEEEEEEECCCCCCCCEEEEEEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCEEC AA: TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA 70 80 90 100 110 120 Conf: 687998776999898999828854678999988635999787299999987088799887 Pred: CCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCEEEEEEEEECCHHHHHHHCCCCCCCCE AA: VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 130 140 150 160 170 180 Conf: 974899833677888994777119989999999860688879988499887997999999 Pred: EECCCCCCCCCCCCCCCEEEECCCCEEEEEEEEEEECCCCCCCCCEEEEEHHHHHHHHHH AA: ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ 190 200 210 220 230 240 Conf: 98559 Pred: HHHCC AA: TLAAN Calculate PostScript, PDF and JPEG graphical output for this result using: http://bioinf3.cs.ucl.ac.uk/cgi-bin/psipred/gra/nph-view2.cgi?id=0eef479d8c802aad.psi
Page | 64
Figure 2.15: Secondary structure by PSI-Pred.
Page | 65
Chapter 3: Discussion
Page | 66
3.1 General
T
he genomic era is characterised by a enormous expansion in the amount of biological information available in the field of molecular biology. The greatest challenge of the molecular biology community is to make sense of the data and exploring meaningful means to exploit those data in practical genomics and proteomics. The result was obvious- using computer to store, retreive and manipulate the data to produce meaningful informations. From central dogma of life we know, informations pass from genome to proteome through transcription and translation. Transcription is the process of encoading DNA to mRNA and Translation is mRNA to Protein.
DNA
RNA
PROTEIN
Bioinformatics tools are mainly being developed targeting the three central biological processes:
1. Determination of protein sequence from DNA sequence 2. Determination of protein structure from its primary sequence 3. Determination of protein function from its 3D structure A database management system (DBMS) is a collection of informations in seperate entities with corresponding attributes linked to it. The software fits data storing, retreiving and manipulation exceedingly well in comparison to manual data storage and management. Our project aim was to get familiar with the basic bioinformatics tool. The protein specimen we took was chymotrypsin. It is one of the most well studied sample in enzyme study. The enzyme is responsible for breaking polypeptides into a smaller fragment. By definition the enzyme is a Protease group of protein. We extracted homologous protein sequences from the database and did Pairwise sequence alignment and multiple sequence alignment to understand the evolutionary relatiobship among the sequences. A evolutionary tree was built on the basis of sequence homology; identity and similarity present in the protein sequences.
3.2 Exploring Database
E
NTREZ is a combined database and search engine composed of-
1. Pubmed: biomedical literature database 2. Pubmed central: free full text journal articles 3. Journals: detailed informations about the journals
4. Mesh: detailed informations about NLMâ&#x20AC;&#x2122;s controlled vocabulary 5. Nucleotide sequence database (GenBank) Page | 67
6. Protein sequence database 7. Genome: whole genome sequence database 8. Structure: 3D macromolecular structure 9. Taxonomy: organisms in GenBank 10. SNP: single nucleotide polymorphism 11. GENE: gene-centered informations 12. Books: online books 13. OMIM: online mendelian inheritence in man 14. Site search: NCBI web and FTP sites 15. UniGene: gene oriented clusters of transcript sequences 16. CDD: conserved protein domain database 17. 3D domain: domain from ENTREZ structure 18. Uni STS: markers and mapping data 19. PopSet: population study datasets 20. GEO: expression and molecular abundance profiles 21. GEO datasets: experimental sets of GEO data.
We used the Protein database of NCBI gateway to retreive the chymotrypsin sequence. However we went through other features of the database and explored various informations about the protein sequence. Genbank informations are kept in Flatfile format which actually is composed of 3 sets of informations-
1. Header 2. Features 3. Sequence Header part is composed of following informations• • • • • • • •
Locus Description Accession no. GI no. Version Source (organism) Organism (in detail) Reference (title, journal, author)
Header part contains following informationsPage | 68
• • •
source Gene RNA
lastly the sequence part contains the protein sequence in FLAT file format. However for input in BLAST we needed FASTA format sequence which starts with a “>” sign followed by a short description an “Enter” and the sequence without any “Space” in “Courier’’ font. Another important learning of the project was to get aquainted with the softwares of the database. We learned use of different BLAST software of the NCBI gateway. The uses are enlisted belowLength 15 residues or longer
5-15 residues
Database Protein
Purpose Identify the query sequence or find protein sequences similar to query Find members of a protein family or build a custom position-specific score matrix Find proteins similar to the query around a given pattern Conserved Find conserved domains in the query Domains Conserved Find conserved domains in the query and identify other proteins with similar Domains domain architectures Nucleotide Find similar proteins in a translated nucleotide databases Protein
Search for peptide motifs
BLAST program Standard Protein BLAST (blastp) PSI-BLAST
PSI-BLAST CD-search (RPS-BLAST) Conserved Domain Architecture Retrieval Tool (CDART) Translated BLAST (tblastn) Search for short, nearly exact matches
Table 3.1: Different uses of BLAST programs.
Apart from this we also used some useful tools such as CLUSTAL W for multiple sequence alignment of EBI gateway.
3.3 Analyzing Protein Sequence
T
he protein sequence was of 245 residues. contained high amount of serine residues 28 in number so is placed in the serine protease kind of protein.
We learned how to search database for homologous sequences using BLASt tools like BLASTp and PSI-BLAst.
Page | 69
We learned about general physiochemical properties of the protein by using Protparam software. Only the sequence was pasted not the FASTA format. The software gave us an approximate pI value by counting Total number of negatively charged residues (Asp + Glu: 14)Total number of positively charged residues (Arg + Lys: 18) and molecular weight by multiplying 110 with the total residue number. Other informations like extinction coefficient could also have been predicted. We performed Pair-wise sequence alignment withhomologous sequences derived by PSIBLAST. By doing so we identified conserved regions in those protein sequences. The query sequence showed chunk of conserved regions with the sequence of Cold adaptation enzyme of Salmon fish very precisely. It helped us identifying the location of enzymeâ&#x20AC;&#x2122;s binding or active sited as those sequences over the evolutionary period remains more or less conserved. We learned that the catalytic triode responsible for enzymatic activity was composed of Serine 195, Histidine 57 and Aspertate 102. An in depth idea of the mechanism of the enzymes reaction revealed initial Covalent modification at the Serine residue. By performing Multiple sequence Alignment we understood the evolutionary relationship even more specifically and was able to generate the relationship as Phylogenetic Tree. The percentage and position of alpha helix, beta sheets were predicted by using different tools for secondary structure prediction like PSIPRED, Hierarchical Neural Network (HNN) . It gave us an idea of the secondary structure of the protein which included more Beta-pleated structure (around 45%) and less Alpha helix structures (around 14%).
3.4 Conclusion As the field of molecular biology is advancing, thousands of new proteins are being discovered. So sequencing of unknown proteins and determination of their structure remain a crude necessity for the researchers. By studying the structure of a known protein, this elementary project has provided us to work with unknown proteins and assuming their functions during advanced research activities.
Page | 70