EXPLORING DATABASE AND ANALYZING PROTEIN SEQUENCE by Aubhishek Zaman

EXPLORING DATABASE AND ANALYZING PROTEIN SEQUENCE COURSE: GEB-207 DEPARTMENT OF GENETIC ENGINEERING AND BIOTECHNOLOGY, UNIVERSITY OF DHAKA.

AUBHISHEK ZAMAN ROLL:O8 4/26/2009

ontents

Chapter 1: About Bioinformatics 1.1 General 1.2 Resources

7-20 08 10

1.2.1 Gateways 1.2.2 Database 1.2.3 Software or Tools

10 12 14

1.3 Application and Importance 1.4 Project Aim

15 18

Chapter2: Working with Protein Sequences 2.1 General 2.2Fetching protein sequence from Database 2.2.1 Database 2.2.2 Method 2.2.3 Result

2.3 Analyzing Protein Sequences

21-65 22 22 22 23 24

2.3.1 Understanding the general, physical, chemical properties of a 24 Protein sequence. 2.3.1.1 Software/ Tools 2.3.1.2 Method 2.3.1.3Result

2.3.2 Searching Database for similar sequences 2.3.2.1Software/ Tools 2.3.2.2 Methods 2.3.2.3 Result

2.3.3 Sequence Alignment Study 2.3.3.1 Pair wise alignment 2.3.3.1.1 Software/ Programs 2.3.3.1.2 Methods 2.3.3.1.3 Results

2.3.3.2 Multiple Sequence Alignment 2.3.3.2.1 Software/ Programs 2.3.3.2.2 Methods 2.3.3.2.3 Results

2.3.4 Phylogenetic tree construction 2.3.4.1 Software/ Tools 2.3.4.2 Methods 2.3.4.3 Result

2.3.5 Secondary Structure Prediction 2.3.5.1 Software/ Tools 2.3.5.2 Methods 2.3.5.3 Result

Chapter 3: Discussion 3.1 General 3.2 Exploring Database 3.3 Analyzing Protein Sequences 3.4 Conclusion

25 25 26

27 27 28 30

41 41 42 43 44

47 49 49 51

54 54 56 58

60 60 61 63

66-70 67 67 68 70

Page | 2

List of abbreviation

ABBREVIATION BLAST

ELABORATION Basic Local Alignment Search Tool

DDBJ

D NA Data Bank of Japan

EBI

European Bionformatics Institute

EMB

European Molecular Biolog Laboratory

Expasy

Expert Protein Analysis System

Hierachical Neural Network

CBI

National Centre for Biotechnological Information

National Cancer Institute

National Institute of Health

United States National Library of Medicine

PDB

Protein DataBank

PSI-BLA ST

Protein Specific Iterated Blast

PSI RED

Protein Secondary Information Prediction

SIB

Swiss Institute of Bioinformatics

URL

Universal Resource Locator

Page | 3

List of figures Figure no. 1.1 1.2 1.3 1.3

Name Of Table Bioinformatics; an interdeciplinary subject Submission and updates between three databases Use of informatics in drug designing. The Catalytic mechanism of Chymotrypsin.

Page No. 8 10 16 19

1.4

The overview of the project

2.1

The flow of data from primary data sources into component databases of universal protein resourse.

2.2

FASTA format result of p00766

2.3

Graphical presentation of BLASTp results

2.4

Graphical presentation of PSI-BLAST search result

2.5

Graphical representation of pair-wise alignment

2.6

Algorithm of a software performing multiple sequence alignment

2.7

Multiple Sequence Alignment(MSA)

2.8

Multiple Sequence Alignment(MSA) Jalview results

2.9

Newick presentation

2.10 2.11

Phylogenic Tree (cladogram) from Homologous sequence of p00766 Phylogenic Tree (Phylogram) from Homologous sequence of p00766

58 59

2.12

Phylogenetic tree by JalView

2.13

The graphical presentation of HNN

2.14

Secondary structure by HNN

2.15

Secondary structure by PSI-Pred.

Page | 4

List of Tables Table no. 1.1 1.2

Name Of Table Tools at EBI Available tools at Bioinformatics Group - University College London

Page No. 11 12

1.3 1.4 1.5 2.1 3.1

Primary Sequence Databases Meta-bases Software used in the project Pair-wise alignment results for retreived sequences to identify similarities Different uses of BLAST programs.

13 14 15 45 69

Page | 5

List of Web Addresses: • • • • •

http://www. ncbi.nlm.nih.gov http://www.ebi.ac.uk http://bioinf.cs.ucl.ac.uk/psipred/ http://www.expasy.org http://www.pdb.org/

Reference Source: • Bioinformatics: a practical guide to the analysis of genes and proteins B.F. Ouelette and A.D. Baxevanis • Discovering Genomics, Proteomics and Bioinformatics A.M. Campbell and L.J. Heyer • Post Genome informatics Minoru Kanesha • Bioinformatics-Sequence And Genome Analysis D.W. Mount • Bioinformatics for Dummies G,M. Claverie and C. Notredame • www.wikipedia.org

Page | 6

Chapter 1: About Bioinformatics

Page | 7

IOINFORMATICS is an interdisciplinary subject. It may be termed as a blend of biological and computational sciences. Bioinformatics involves storing, retreiving and manipulation of biological data using computational texhniques.

Computer Science

Biology

BIOINFORMATICS

Mathematics

Statistics

Figure1.1: Bioinformatics; an interdeciplinary subject

1.1 General

iological data are flooding in at an unprecedented rate. For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries and the SWISSPROT database of protein sequences contained 88,166. On average, the amount of information stored in these databases is doubling every 15 months. Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying informatics techniques (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications. Bio stands for life, informatics comes from the word information. So, Bioinformatics refers to the science that deals with the information that comes from living system. However, bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical Page | 8

techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.

The National Center for Biotechnology Information (NCBI) defines bioinformatics as: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information." The terms bioinformatics and computational biology are often used interchangeably. However bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data. Important sub-disciplines within bioinformatics and computational biology include: â&#x20AC;˘ the development and implementation of tools that enable efficient access to, and use and management of, various types of information â&#x20AC;˘ the development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences

Storing, retreiving and manipulating biological data in a meaningful way to interpret the biological system is the prime objective of Bioinformatics. To do so in the initial phase the data produced by the thousands of research teams all over the world are collected and organized in databases specialized for particular subjects. GDB (Gene Data Bank), SWISS-PROT, GenBank, PDB (Protein Data Bank) etc are some well known examples. As informations kept growing in size and complexities need of specialized tools with diverse algorithmic approach started growing too. It resulted in application of specialized softwares such as BLAST, CLUSTALW, BIOEDIT, SRATCH, Swiss PDB Viewer etc for better data manipulation and sorting out.

Page | 9

1.2. The Resources

esources of Bioinformatics are consisted of The Gateways, Databases and softwares.

ENTREZ NCBI • • • •

submission updates

submission updates GenBANK

EBI

EMBL DDBJ CIB SRS

getentry • •

submission updates

Figure1.2: Data flow for new submission and updates between three databases

1.2.1 Gateway

gateway in Information Technology (IT) is thought to be an open door through which a user collects a specialized information. A gateway can be reached at a specific Universal Resource Locator (URL).

There are several gateways for software and databases that offer access to many of the sites in bioinformatics. The gateways and databases are listed below:

ational Centre for Biotechnology Information ( CBI) Web site: http://www.ncbi.nlm.nih.gov The National Center for Biotechnology Information ( CBI) is part of the United States ational Library of Medicine ( LM), a branch of the National Institutes of Health. The NCBI houses genome sequencing data in GenBank and an index of biomedical research articles in PubMed Central and PubMed, Page | 10

as well as other information relevant to biotechnology. In addition to GenBank. NCBI provides Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), the Unique Human Gene Sequence Collection, a Gene Map of the Human genome, a Taxonomy Browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. All these databases are linked through a unique search and retrieval system, called Entrez., that also include cross-referenced information integrate these resources

European Bioinformatics Institute (EBI) Web Site: http://www.ebi.ac.uk The European Bioinformatics Institute (EBI) is a non-profit academic organisation that forms part of the European Molecular Biology Laboratory (EMBL). The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. The mission of the EBI is to provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress and to contribute to the advancement in molecular biology and genome research through basic investigator-driven research in bioinformatics Table 1.1 Tools at EBI

Tool Align ClustalW CpG Plot/CpGreport GeneMark Genetic Code Viewer Wise2 Mutation Checker Pepstats/Pepwindow/Pepinfo Promoter

wise

Reverse Translator SAPS Transeq

Description Pairwise global and local alignment tool (EMBOSS). Multiple sequence alignments. CpG Island finder and plotting tool (EMBOSS). Gene prediction service. Review of genetic code differences. Compares a protein sequence or a protein profile HMM to a DNA sequence. Sequence validation. EMBOSS programs for basic protein sequence analysis (EMBOSS). Compares two DNA sequences allowing for inversions and translocations, ideal for promoters. Reverse complement checker. Statistics on protein sequences. DNA sequence translation tool (EMBOSS).

ExPASy Molecular Biology Server-Expert Protein Analysis System, Swiss Institute of BioinformaticsWeb Site: http://www.expasy.org The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss Institute of Bioinformatics (SIB) which analyzes protein sequences and structures and two-dimensional gel electrophoresis (2-D Page electrophoresis). The server functions in collaboration with the European Institute of Bioinformatics. ExPASy also produces the protein sequence knowledgebase, UniProtKB/Swiss-prot, and its

Page | 11

computer annotated supplement, UniProtKB/Trembl.

Bioinformatics Group - University College London Web Site: http://www.bioinf.cs.ucl.ac.uk

The Bioinformatics Group was originally founded as the Joint Research Council funded Bioinformatics Unit within the Department of Computer Science at University College London. The group's main aim is to develop and apply state-of-the-art mathematical and computer science techniques to problems now arising in the life sciences, particularly those now appearing in the postgenomic era. Available tools and software are: Table 1.2: Available tools at Bioinformatics Group - University College London Protein Structure Prediction

Threading (THREADER) Ab initio folding simulations Secondary structure prediction (PSIPRED) Protein disorder prediction (DISOPRED) Protein domain prediction (DomPred)

Protein Sequence Analysis

Amino acid substitution matrices Hidden Markov Models (collaboration with N. Goldman, Cambridge, & J. Thorne, NCSU)

Genome Analysis

Genomic Threading Database (GTD) Genomic fold recognition (GenTHREADER) Genome annotation using software agents

Protein Structure Classification

Comparison of structure classifications (CATH/SCOP/FSSP) CATH (collaboration with J. Thornton & C. Orengo, UCL Biochemistry)

Transmembrane Protein Modelling

MEMSAT Folding In Lipid Membranes (FILM)

Biological Applications of Datamining and Machine Learning Techniques

Information extraction for biological research (BioRat)

1.2.2 Databases

database in internet is actually consisted of a Database management system (DBMS) which has two interface- one is for user to use and input and another one is for management in the host computer. A database is compilation of entities in correspondence to its marked out attributes.

Page | 12

Database (or data base) is a collection of data in an organised way so that its contents can easily be accessed, managed, and modified by a computer. It is also called data bank. The most prevalent type of database is the relational database which organizes the data in tables; multiple relations can be mathematically defined between the rows and columns of each table to yield the desired information. An object-oriented database stores data in the form of objects which are organized in hierachical classes that may inherit properties from classes higher in the tree structure. A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof), respectively. Sequences are represented in shorthand, using single letter designations. There are two main functions of biological databases: 1. Make biological data available to scientists. As much as possible of a particular type of information should be available in one single place (book, site, database). Published data may be difficult to find or access, and collecting it from the literature is very time-consuming. And not all data is actually published explicitly in an article (genome sequences!). 2. To make biological data available in computer-readable form. Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step. Databases for bioinformatics are Primary and added-value databases Sequence Vs organism databases ‘Federated’ databases: global computer networks … WWW

Primary or archived databases contain information and annotation of DNA and protein sequences, DNA and protein structures and DNA and protein expression profiles. Secondary or derived databases are so called because they contain the results of analysis on the primary resources including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. Information from the literature is contained in bibliographic databases, such as Medline. The following table represent widely used databases for analyzing DNA and protein sequences as well as databases and types of researches can be performed for DNA, protein structure and protein function. Table1.3: Primary Sequence Databases Databases Nucleic Acid

Software tools NCBI (National Centre for Biotechnology information) - GenBank

Web Site http://www.ncbi.nlm.nih.gov/

EBI (European Bioinformatics Institute) – EMBL

http://www.ebi.ac.uk/

Page | 13

Databases

Protein

Databases

DISC – DNA Information and Stock Center, Japan NCBI – GenPept ExPasy – SwissProt and TrEMBL EBI (European Bioinformatics Institute) – SwissProt, TrEMBL, PIR DISC – DNA Information and Stock Center, Japan

http://www.dna.affrc.go.jp/ http://www.ncbi.nlm.nih.gov/ http://www.expasy.ch/ http://www.ebi.ac.uk/ http://www.dna.affrc.go.jp/

Meta-databases: A meta-database can be considered a database of databases, rather than any one integration project or technology. They collect data from different sources and usually make them available in new and more convenient form, or with an emphasis on a particular disease or organism. Table 1.4: Meta-bases

Name Web Site Entrez (National Center for Biotechnology http://www.ncbi.nlm.nih.gov Information) euGenes (Indiana University) GeneCards (Weizmann Inst.) SOURCE (Stanford University)

http://eugenes.org http://www.genecards.org http://genome-www4.stanford.edu/cgibin/SMD/source/sourceSearch mGen containing four of the world biggest http://www.cyberdatabases GenBank, Refseq, EMBL and indian.com/bioperl/index.html DDBJ - easy and simple program friendly gene extraction Bioinformatic Harvester (Karlsruhe http://harvester.fzk.de Institute of Technology) - Integrating 26 major protein/gene resources. MetaBase(KOBIC) - A user contributed http://BioDatabase.Org database of biological databases.

1.2.3. Software/Tools

oftware tools are computer programs for sequence analysis, database construction and management, evolutionary relations, structural analysis, pathways. The software tools are integrated into databases.

Page | 14

The Bioinformatics Toolbox offers computational molecular biologists and other research scientists an open and extensible environment in which to explore ideas, prototype new algorithms, and build applications in drug research, genetic engineering, and other genomics and proteomics projects. These tools range from a collection of standalone tools with a common data format under a single, slick standalone or webbased interface, to integrative and extensible bioinformatics workflow development environments. The important software programs in Bioinformatics that have been used in our project are given in the following table: Table 1.5: Software used in the project

ame of the Software ProtParam

Application and purpose Source Predict physicochemical http://www.expasy.org/ properties from sequence

BLAST

finds regions similarity sequences

ClustalW

Multiple sequence alignment http://www.ebi.ac.uk/ tool

PSIPRED

Secondary prediction tool

Hierarchical Network

Neural Secondary prediction tool

of local http://www.ncbi.nlm.nih.gov/ between

structure http://bioinf.cs.ucl.ac.uk/psipred/ structure http://www.expasy.org/

1.3 Application of Bioinformatics

ioinformatics is being used in following fields: Gene expression study

Many expression studies have so far focused on devising methods to cluster genes by similarities in expression profiles. This is in order to determine the proteins that are expressed together under different cellular conditions. Briefly, the most common methods are hierarchical clustering, self-organising maps, and K-means clustering. Hierarchical methods originally derived from algorithms to construct phylogenetic trees, and group genes in a bottom-up fashion; genes with the most similar expression profiles are clustered first, and those with more diverse profiles are included iteratively. In contrast, the self-organising map and Kmeans methods employ a top-down approach in which the user pre-defines the number of clusters for the dataset. The clusters are initially assigned randomly, and the genes are regrouped iteratively until they are optimally clustered.

Drug development

One of the earliest medical applications of bioinformatics has been in aiding rational drug design. Figure 1.3 outlines the commonly cited approach, taking the MLH1 gene product as an example drug target. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to Page | 15

mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer (126). Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can then be used to find homologues in model organisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today's medicines.

Figure 1.3: Use of informatics in drug designing. Pharmacogenomics

Clinical medicine will become more personalized with the development of the field of pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyse a patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning. Gene therapy

Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a personâ&#x20AC;&#x2122;s defective genes. Currently, this field is in its infantile stage with clinical trials for many different types of cancer and other diseases ongoing. Page | 16

Detection of Antibiotic-resistant pathogens

Scientists have been examining the genome of Enterococcus faecalis, a leading cause of bacterial infection among hospital patients. They have discovered a region made up of a number of antibioticresistant genes that may transform the bacterium from a harmless gut bacterium to a menacing invader. The discovery of this region could provide useful marker for detecting pathogenic strains and help to control the spread of infection inwards. Agriculture

Bioinformatics tools can be used to sequence the genomes of plants and animals and elucidate the functions of different genes. This specific genetic knowledge could then be used to produce nutrient rich, drought, disease and insect resistant plants and improve the quality of livestock making them healthier, more disease resistant and more productive. Insect resistance

Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means that the amount of insecticides being used can be reduced and hence the nutritional quality of the crops is increased. Improved nutritional quality

Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron and other micronutrients. Scientists have also inserted a gene from yeast into tomato, the result is a plant whose fruit stays longer on the vine and has an extended shelf life. Biotechnology

The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for practical applications in industry and government-funded environmental remediation. These microorganisms thrive in water temperatures above the boiling point and therefore may provide the DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in industrial processes. Microbial genome applications

Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the environment, our bodies, the air, food and water. Traditionally, a variety of microbial properties have been applied in the baking, brewing and food industries. The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. Waste management Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation resistant organism known. Scientists are interested in this organism because of its potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals. Microbial Genome Program (MGP) scientists are determining the DNA sequence of C. crescentus one of the organisms responsible for sewerage treatment.

Page | 17

Maintenance of climatic balance

Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source. Evolutionary Studies

The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary studies can be performed in a quest to determine the tree of life and the last universal common ancestor. Forensic studies

Bioinformatics has created a great opportunity to ease the forensic experiment. It has been guaranteed the highest possible accuracy to detect the right culprit in forensic investigations. Forensic analysis of microbes

Scientists used their genomic tools to help distinguish between the strains of Bacillus anthryacis that was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains. Bioweapon creation

Scientists have recently built the virus Poliomyelitis by entirely artificial means using genomic data available on the internet and materials from a mail order chemical supply.

1.4 Project Aim

he aim of our project was to get introduced with the field of Bioinformatics. More specifically the target was to Be familiar with biological databases and available tools to analyze the information in such databases. Finding the sequence of the protein and study the physicochemical properties. Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Clustering protein sequences into families of related sequences and the development of protein models. Developing methods to predict the structure and/or function and resive the secondery structure.

A well known protein Chymotrypsin (PDB Id- P00766) was studied as the in the project. Chymotrypsin is a proteolytic enzyme. This enzyme catalyzes the hydrolysis of peptide bonds of Page | 18

proteins in the small intestine. It is selective for peptide bonds with aromatic or large aromatic hydrophobic side chains (Tyr, Trp, Phe) on the carboxyl side of this bond. Chymotrypsin also catalyzes the hydrolysis of ester bonds. It is termed as serine Protease because it has a reactive serine residue in its active site. Three amino acid residues have been found to play the key role in catalysis: Ser195, His57 and Asp102. Together these residues are termed as â&#x20AC;&#x153;Catalytic Triadâ&#x20AC;?. Although far apart in the primary structure the protein folding brings these residues close and in correct orientation in tertiary structure. Chymotrypsin was the first discovered Serine protease. Its crystal structure was first resolved by David Blow in 1967. this discovery provided provided a key understanding of the catalytic mechanism of a great variety of enzymes. The mechanism of chymotrypsin action is illustrated in the following page.

Figure1.4: the Mechanism of Action of Chymotrypsin Using bioinformatics tools we have performed a number jobs concerned with Chymotrypsin, such as

Retrieving the sequence of the protein. Determining the physio-chemi chemical properties from the sequence. Performing BLAST search for finding similar sequences. Pair wise and multiple sequence alignment of Chymotrypsin with various other protein sequences.

Page | 19

Construction of a Phylogenetic tree and to determine the evolutionary relationship based on the protein that was chosen for multiple sequence alignment. The overview of the project is shown in the following flow chart:

Sequence database browsing

Manual input

Protein Sequence file

Protein sequence Analysis

Searching databases for similar sequences

Primary structure: Physico-chemical properties

Sequence Comparison

Secondary Structure Prediction

Pair wise alignment

Identity

Multiple Sequence alignment

Similarity

Phylogenetic Tree construction

Figure 1.5: The overview of the project

Page | 20

Chapter2: Working with protein sequence

Page | 21

2.1 General

ith the availability of hundreds of complete genome sequences from both prokaryotes and eukaryotes efforts are now focused o the identification and functional analysis of the proteins encoded by these gnomes. this urgency has resulted in a big burst of fresh informations linked to proteomics. there came the need of a protein sequence databases. Uniprot NREF 50

Uniprot NREF 90

Uniprot NREF 100

Proteome set

IPI

Uniprot knowledgebase: swissprot+TrEMBL

Uniprot archive

Sub/pept ide data

DDBJ/E MBL/G enbank

VEG A

PDB

Patent

data

WGS

EnsE MBL

REF SEQ

Fly Base

Figure 2.1: The flow of data from primary data sources into component databases of universal protein resourse.

2.2 Fetching protein sequence from Database 2.2.1 DATABASE we searched the protein database incorporated with NCBI gateway.It is the NIH protein sequence database, an annotated collection of all publicly available Protein sequences. The complete release notes for the current version of protein database are available on the NCBI ftp site. A new release is made every two months .

Page | 22

Wor mBas e

Methods 1. Search for the desired sequence was started with the NCBI home page (http://www.ncbi.nlm.nih.gov) 2. “Protein” was chosen in the “Search” box and was searched for Chymotrypsin sequence . 3. P00766 was selected from the list and clicked. 4. The information available on the page was read carefully.

5. “FASTA” was selected from Display. 6. The amino acid sequence in FASTA format was saved.

Genpept format

CBI home page (http://www.ncbi.nlm.nih.g ov)

Sequence saved

Search ‘Protein’

FASTA selected from Diaplay

For Chymotrypsin

P00766 is selected

Page | 23

2.2.2

Results

The sequence was retreived and saved in microsoft word format for further use. >gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN

Figure 2.2: FASTA format result of p00766

2.3 Analyzing Protein sequence 2.3.1: Understanding the general physiochemical properties of a protein sequence.

roteins are condensation polymers of amino acid residues. however a liner organisation of residues itself do not express much about protein structure as well as protein function. it is the 3D or tertiery native structure (quarternary in case of a multisubunit protein) which depicts a protein best. Though primary structure analysis is not a good methode for functional and structural analysis of the protein, it can provide with some valuable informations regarding poteins behaviour in a solution, its molecular weight, extinction coefficient etc. Thus general physiochemical properties can be a good

indicator to understand protein activities in broader scale.

Page | 24

2.3.1.1 Software ProtParam (web link: www.expasy.org)

rotParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) following parameters are revealed by protparam Molecular weight Number of residues Average residue weight Charge Iso-electric point For each physico-chemical class of amino acid: number, molar percent Probability of protein expression in E. coli inclusion bodies Extinction coefficient at 1 mg/ml (A280) Molar extinction coefficient (A280)

2.3.1.2 Method 1. The address of the “European Bioinformatics Institute” http:// www.expasy.org. was written in the address bar of the Internet Explorer. 2. Then the “Toolbox” option was clicked.

3. The “Sequence Analysis” option was chosen.

4. Then from the list, the “Protparam” option was clicked.

Page | 25

5. Then in the box for the sequence, the sequence of “P00766 (swiss-prot accession no)” was pasted. 6. Then, the “Run” command Button was clicked. 7. Then the obtained results were saved on a Microsoft Word document.

www.expasy.org.

Sequence of P00766 was pasted

Compute parameters

Expasy home page

Toolbox

Protparam was selected

Result

Sequence Analysis

Save

2.3.1.3 Results ProtParam User-provided sequence: 10 20 30 40 50 60 CGVPAIQPVL SGLSRIVNGE EAVPGSWPWQ VSLQDKTGFH FCGGSLINEN WVVTAAHCGV 70 80 90 100 110 120 TTSDVVVAGE FDQGSSSEKI QKLKIAKVFK NSKYNSLTIN NDITLLKLST AASFSQTVSA 130 140 150 160 170 180 VCLPSASDDF AAGTTCVTTG WGLTRYTNAN TPDRLQQASL PLLSNTNCKK YWGTKIKDAM 190 200 210 220 230 240 ICAGASGVSS CMGDSGGPLV CKKNGAWTLV GIVSWGSSTC STSTPGVYAR VTALVNWVQQ

TLAAN References and documentation are available. Please note the modified algorithm for extinction coefficient. Number of amino acids: 245 Molecular weight: 25666.1 Theoretical pI: 8.52 Amino acid composition: Ala (A) 22 9.0% Arg (R) 4 1.6% Asn (N) 14 5.7% Asp (D) 9 3.7%

Page | 26

Cys (C) Gln (Q) Glu (E) Gly (G) His (H) Ile (I) Leu (L) Lys (K) Met (M) Phe (F) Pro (P) Ser (S) Thr (T) Trp (W) Tyr (Y) Val (V) Pyl (O) Sec (U) (B) 0 (Z) 0 (X) 0

10 10 5 23 2 10 19 14 2 6 9 28 23 8 4 23 0 0

4.1% 4.1% 2.0% 9.4% 0.8% 4.1% 7.8% 5.7% 0.8% 2.4% 3.7% 11.4% 9.4% 3.3% 1.6% 9.4% 0.0% 0.0% 0.0% 0.0% 0.0%

Total number of negatively charged residues (Asp + Glu): 14 Total number of positively charged residues (Arg + Lys): 18 Atomic composition: Carbon Hydrogen Nitrogen Oxygen Sulfur

C H N O S

1127 1783 307 353 12

Formula: C1127H1783N307O353S12 Total number of atoms: 3582 Extinction coefficients: Extinction coefficients are in units of M-1 cm-1, at 280 nm measured in water. Ext. coefficient 50585 Abs 0.1% (=1 g/l) 1.971, assuming ALL Cys residues appear as half cystines Ext. coefficient 49960 Abs 0.1% (=1 g/l) 1.947, assuming NO Cys residues appear as half cystines Estimated half-life: The N-terminal of the sequence considered is C (Cys). The estimated half-life is: 1.2 hours (mammalian reticulocytes, in vitro). >20 hours (yeast, in vivo). >10 hours (Escherichia coli, in vivo). Instability index: The instability index (II) is computed to be 15.27 This classifies the protein as stable. Aliphatic index: 82.37 Grand average of hydropathicity (GRAVY): 0.051

2.3.2 Searching database for similar sequences 2.3.2.1 Software tools Page | 27

tandard Protein-Protein BLAST (BLASTp)

BLASTp is the NCBI-BLAST program for comparing a protein query sequence to a protein database. The original BLAST program was developed at NCBI. It takes protein sequences in FASTA format, GenBank Accession number or GI numbers and compares them against the NCBI Protein databases. BLASTp is used to both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions for similarity. However, when sequence similarity spans the whole sequence, blastp will report a global alignment, which is the preferred result for protein identification purposes. It can be used from NCBI website.

osition Specific Iterated BLAST (Psi-BLAST)

PSI-BLAST uses an iterative search in which sequences found in round of searching are used to build score model for the next round searching. Highly conserved positions 12 receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each “iteration” used to refine the profile. This iterative searching strategy results in increased sensitivity. It can be used from NCBI website.

2.3.2.2 Method BLASTp 1. The Home page of the NCBI was reached using the address http://www.ncbi.nih.gov/ 2. The BLAST option was selected. 3. From the Protein portion, the ”Protein-protein BLAST (blastp)” option was selected. 4. Then, in the “Search” box, the sequence of 1GCT was given as input. 5. The “nr” option was selected form the “Choose database” option. 6. “BLOSUM 62” was selected in the “Matrix” box. 7. Then, the “BLAST” command button was clicked. 8. Then the obtained result was saved in a Microsoft Word Document.

Page | 28

CBI home page

BLAST

BLOSUM62 selected from MATRIX option

BLAST run

Format clicked

Standard proteinprotein BLAST

Sequence of 1GCT pasted in Search window

Result saved

Psi-BLAST 1. Starting with NCBI, “BLAST” search was selected and the options on that page were examined. 2. PSI- BLAST was chosen from “Protein BLAST”. 3. Protein (1GCT) sequence, saved previously in FASTA format, was pasted on the “Search” window. 4. From the MATRIX options, “BLOSUM62” was selected. 5. “PSI-BLAST” was chosen. 6. Then “BLAST” was run. 7. The result was then saved.

Page | 29

CBI home page

BLAST

“Format for PSI-

“BLOSUM62” selected

BLAST chosen

from MATRIX option

“BLAST” run

PSI-BLAST

Sequence pasted on “Search” window

Result saved

2.3.2.3 Results BLASTp results

Page | 30

Figure 2.3: Graphical presentation of BLASTp results Page | 31

More such results.......................................................................

Alignments

Select All Get selected sequences Distance tree of results

> ref|XP_608091.3| Length=300

PREDICTED: chymotrypsinogen B1 [Bos taurus]

GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus] (10 or fewer PubMed links) Score = 496 bits (1278), Expect = 5e-139, Method: Compositional matrix adjust. Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%) Query

Sbjct

Query

Sbjct

116

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

115

175

Page | 32

Query

121

Sbjct

176

Query

181

Sbjct

236

Query

241

Sbjct

296

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ

240

TLAAN TLAAN TLAAN

235

295

245 300

> sp|P00766.1|CTRA_BOVIN RecName: Full=Chymotrypsinogen A; Contains: RecName: Full=Chymotrypsin A chain A; Contains: RecName: Full=Chymotrypsin A chain B; Contains: RecName: Full=Chymotrypsin A chain C pdb|2CGA|A Chain A, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis And Refinement Of A New Crystal Form At 1.8 Angstroms Resolution pdb|2CGA|B Chain B, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis And Refinement Of A New Crystal Form At 1.8 Angstroms Resolution 30 more sequence titles pdb|1ACB|E Chain E, Crystal And Molecular Structure Of The Bovine AlphaChymotrypsin-Eglin C Complex At 2.0 Angstroms Resolution pdb|1CGI|E Chain E, Three-Dimensional Structure Of The Complexes Between Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human Pancreatic Secretory Trypsin Inhibitor (Kazal-Type) pdb|1CGJ|E Chain E, Three-Dimensional Structure Of The Complexes Between Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human Pancreatic Secretory Trypsin Inhibitor (Kazal-Type) pdb|1EX3|A

Chain A, Crystal Structure Of Bovine Chymotrypsinogen A (Tetragonal)

pdb|1GL1|A Chain A, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL1|B Chain B, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL1|C Chain C, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-C, An Inhibitor From The Insect Locusta Migratoria pdb|1GL0|E Chain E, Structure Of The Complex Between Bovine Alpha-Chymotrypsin And Pmp-D2v, An Inhibitor From The Insect Locusta Migratoria pdb|1K2I|1 Chain 1, Crystal Structure Of Gamma-Chymotrypsin In Complex With 7- Hydroxycoumarin pdb|1P2M|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2M|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2N|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2N|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2O|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin

Page | 33

pdb|1P2O|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2Q|A Chain A, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1P2Q|C Chain C, Structural Consequences Of Accommodation Of Four NonCognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin And Chymotrypsin pdb|1OXG|A Chain A, Crystal Structure Of A Complex Formed Between Organic Solvent Treated Bovine Alpha-Chymotrypsin And Its Autocatalytically Produced Highly Potent 14-Residue Peptide At 2.2 Resolution pdb|1T7C|A Chain A, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T7C|C Chain C, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8L|A Chain A, Crystal Structure Of The P1 Met Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8L|C Chain C, Crystal Structure Of The P1 Met Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8M|A Chain A, Crystal Structure Of The P1 His Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8M|C Chain C, Crystal Structure Of The P1 His Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8N|A Chain A, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8N|C Chain C, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8O|A Chain A, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine Chymotrypsin Complex pdb|1T8O|C Chain C, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine Chymotrypsin Complex pdb|1CHG|A Chain A, Chymotrypsinogen,2.5 Angstroms Crystal Structure, Comparison With Alpha-Chymotrypsin,And Implications For Zymogen Activation pdb|1GCD|A Chain A, Refined Crystal Structure Of "aged" And "non-Aged" Organophosphoryl Conjugates Of Gamma-Chymotrypsin Length=245 GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus] (10 or fewer PubMed links) Score = 495 bits (1274), Expect = 2e-138, Method: Compositional matrix adjust. Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%) Query

Sbjct

Query

Sbjct

Query

121

Sbjct

121

Query

181

Sbjct

181

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ

120

180 240 240

Page | 34

Query

241

Sbjct

241

TLAAN TLAAN TLAAN

245 245

> pdb|1GCT|A Chain A, Is Gamma-Chymotrypsin A Tetrapeptide Acyl-Enzyme Adduct Of Gamma-Chymotrypsin? pdb|2GCT|A Chain A, Structure Of Gamma-Chymotrypsin In The Range Ph 2.0 To Ph 10.5 Suggests That Gamma-Chymotrypsin Is A Covalent AcylEnzyme Adduct At Low Ph pdb|1GHB|E Chain E, A Second Active Site In Chymotrypsin? The X-Ray Crystal Structure Of N-Acetyl-D-Tryptophan Bound To Gamma- Chymotrypsin pdb|2GMT|A Chain A, Three-Dimensional Structure Of Chymotrypsin Inactivated With (2s) N-Acetyl-L-Alanyl-L-Phenylalanyl-Chloroethyl Ketone: Implications For The Mechanism Of Inactivation Of Serine Proteases By Chloroketones pdb|3GCH|A Chain A, Chemistry Of Caged Enzymes. Binding Of Photoreversible Cinnamates To Chymotrypsin Length=245 Score = 486 bits (1251), Expect = 6e-136, Method: Compositional matrix adjust. Identities = 241/245 (98%), Positives = 241/245 (98%), Gaps = 0/245 (0%) Query

Sbjct

Query

Sbjct

Query

121

Sbjct

121

Query

181

Sbjct

181

Query

241

Sbjct

241

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGL IVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRY ANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM VCLPSASDDFAAGTTCVTTGWGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ TLAAN TLAAN TLAAN

120

180 240 240

245 245

More such results

2.3.2.3.2 PSI BLAST RESULT (After 3 Iterations)

Page | 35

Figure 2.4: Graphical presentation of PSI-BLAST search result

Page | 36

Similar More Results....................................................................... Alignments Select All Get selected sequences Distance tree of results >emb|CAG00821.1| unnamed protein product [Tetraodon nigroviridis] Length=263 Score = 437 bits (1125), Expect = 3e-121, Method: Composition-based stats. Identities = 165/245 (67%), Positives = 194/245 (79%), Gaps = 0/245 (0%) Query

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

CGVP I PV++G SRIVNGEEAVP SWPWQVSLQ+ TGFHFCGGSLINENWVVTAAHC V Sbjct

CGVPGIPPVITGYSRIVNGEEAVPHSWPWQVSLQEYTGFHFCGGSLINENWVVTAAHCNV

Query

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

V+ GE D+ S++E IQ +++ +VFK+

YNS TINNDITL+KL++ A

Sbjct

RTSHRVILGEHDRSSNNENIQVMQVGQVFKHPNYNSYTINNDITLIKLASPAQLNIRVSP

138

Query

121

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

Page | 37

VC+

SD F

CVT+GWGLTRY

+TP RLQQ +LPLL+N

C+K+WG+KI D M

Sbjct

139

VCVAETSDVFPGGMKCVTSGWGLTRYNAPDTPPRLQQVALPLLTNEECRKHWGSKITDLM

198

Query

181

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ

240

+CAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS

CS S+PGVYARVT L

W+ Q

Sbjct

199

VCAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGFCSVSSPGVYARVTMLRAWMDQ

Query

241

TLAAN

258

245

+AAN Sbjct

259

IIAAN

263

>gb|AAT45254.1| chymotrypsinogen 2-like protein [Sparus aurata] gb|ABE68638.1| chymotrypsinogen II precursor [Sparus aurata] Length=264 Score = 435 bits (1120), Expect = 1e-120, Method: Composition-based stats. Identities = 165/245 (67%), Positives = 192/245 (78%), Gaps = 0/245 (0%) Query

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

CG PAI PV++G SRIVNGEEAVP SWPWQVSLQD TGFHFCGGSLINENWVVTAAHC V Sbjct

CGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENWVVTAAHCNV

Query

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

V+ GE D+ S++E IQ +K+ KVFK+

TINNDI L+KL++ A

Sbjct

RTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSP

139

Query

121

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

VC+

+D+F

CVT+GWGLTRY

+TP

LQQASLPLL+N

C++YWG+KI + M

Sbjct

140

VCVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLM

199

Query

181

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ

240

ICAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS TC+ + PGVYARVT L

W+ Q

Sbjct

200

ICAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQ

Query

241

TLAAN

259

245

+AAN Sbjct

260

IIAAN

264

>ref|XP_536782.2| PREDICTED: similar to chymotrypsinogen B1 isoform 1 [Canis familiaris] Length=264 GENE ID: 479650 CTRB1 | chymotrypsinogen B1 [Canis lupus familiaris] Score = 432 bits (1112), Expect = 1e-119, Method: Composition-based stats. Identities = 188/245 (76%), Positives = 211/245 (86%), Gaps = 0/245 (0%)

Page | 38

Query

CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV

CGVPAI+PVL+GLSRIVNGE+AVPGSWPWQVSLQD TGFHFCGGSLI+E+WVVTAAHCGV Sbjct

CGVPAIEPVLNGLSRIVNGEDAVPGSWPWQVSLQDSTGFHFCGGSLISEDWVVTAAHCGV

Query

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA

120

TS +VVAGEFDQ SS E IQ LKIA+VFKN K+N

T+ NDITLLKL+T A FS+TVS

Sbjct

RTSHLVVAGEFDQSSSEENIQVLKIAEVFKNPKFNMFTVRNDITLLKLATPARFSETVSP

139

Query

121

VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM

180

VCLP A+D+F

CVTTGWG T+Y

TPD+LQQA+LPLLSN

CKK+WG+KI D M

Sbjct

140

VCLPQATDEFPPGLMCVTTGWGRTKYNANKTPDKLQQAALPLLSNAECKKFWGSKITDVM

199

Query

181

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ

240

ICAGASGVSSCMGDSGGPLVC+K+GAWTLVGIVSWGS TCSTS P VY+RVT L+ WVQ+ Sbjct

200

ICAGASGVSSCMGDSGGPLVCQKDGAWTLVGIVSWGSGTCSTSVPAVYSRVTELIPWVQE

Query

241

TLAAN

259

245

LAAN Sbjct

260

ILAAN

264

Similar search results.................................................... Sequences Retrieved from PSI BLAST results in FASTA format: >gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|4CHA| CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|1GCT|CHYMOTRYPSIN*A CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS WGSSTCSTSTPGVYARVTALVNWVQQTLAAN >|CTRB2 protein[Human] MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM

Page | 39

GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN >|CTRL protein[Human] LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN >|Ela3 protein[Mouse] PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN >gi|chymotrypsinogen 2-like protein [Sparus aurata] GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN >gi|Zebrafish [Danio rerio] WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN >pdb|1S0Q|TRYPSINOGEN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC NYVSWIKQTIASN >pdb|3PTN|TRYPSIN IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC NYVSWIKQTIASN >gi|PRSS2 protein [Bos taurus] MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS >gi|tryptase-III [Human] LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|beta 1 tryptase [Gorilla gorilla] MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta] MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS >pdb|5PTP|HYDROLASE IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDXGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC

Page | 40

NYVSWIKQTIASN >pdb|3EST|ELASTASE VVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNNGTEQ YVGVQKIVVHPYWNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQ LAQTLQQAYLPTVDYAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVS RLGCNVTRKPTVFTRVSAYISWINNVIASN >pdb|1ZHR|FactorXI IVGGTASVRGEWPWQVTLHTTSPTQRHLCGGSIIGNQWILTAAHCFYGVESPKILRVYSGILNQAEIAED TSFFGVQEIIIHDQYKMAESGYDIALLKLETTVNYADSQRPISLPSKGDRNVIYTDCWVTGWGYRKLRDK IQNTLQKAKIPLVTNEECQKRYRGHKITHKMICAGYREGGKDACKGDSGGPLSCKHNEVWHLVGITSWGE GCAQRERPGVYTNVVEYVDWILEKTQAV >pdb|1DDJ|PLASMINOGEN SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN >|Mast cell protease6 MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS >gi|899286|Hepsin TSGFFCVDEGRLPHTQRLLEVISVCDCPRGRFLAAICQDCGRRKLPVDRIVGGRDTSLGRWPWQVSLRYD GAHLCGGSLLSGDWVLTAAHCFPERNRVLSRWRVFAGAVAQASPHGLQLGVQAVVYHGGYLPFRDPNSEE NSNDIALVHLSSPLPLTEYIQPVCLPAAGQALVDGKICTVTGWGNTQYYGQQAGVLQEARVPIISNDVCN GADFYGNQIKPKMFCAGYPEGGIDACQGDSGGPFVCEDSISRTPRWRLCGIVSWGTGCALAQKPGVYTKV SDFREWIFQAIKTHSEASGMVTQL >pdb|1SPJ|KALLIKREIN IVGGWECEQHSQPWQAALYHFSTFQCGGILVHRQWVLTAAHCISDNYQLWLGRHNLFDDENTAQFVHVSE SFPHPGFNMSLLENHTRQADEDYSHDLMLLRLTEPADTITDAVKVVELPTEEPEVGSTCLASGWGSIEPE NFSFPDDLQCVDLKILPNDECKKAHVQKVTDFMLCVGHLEGGKDTCVGDSGGPLMCDGVLQGVTSWGYVP CGTPNKPSVAVRVLSYVKWIEDTIAENS >pdb|1HCG|FACTOR X IVGGQECKDGECPWQALLINEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVGDRNTEQEEGGEAVHE VEVVIKHNRFTKETYDFDIAVLRLKTPITFRMNVAPACLPERDWAESTLMTQKTGIVSGFGRTHEKGRQS TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGGPHVTRFKDTYFVTGIVSWGEGCA RKGKYGIYTKVTAFLKWIDRSMKTRGLPKAK >pdb|1HYL|COLLAGENASE IINGYEAYTGLFPYQAGLDITLQDQRRVWCGGSLIDNKWILTAAHCVHDAVSVVVYLGSAVQYEGEAVVN SERIISHSMFNPDTYLNDVALIKIPHVEYTDNIQPIRLPSGEELNNKFENIWATVSGWGQSNTDTVILQY TYNLVIDNDRCAQEYPPGIIVESTICGDTSDGKSPCFGDSGGPFVLSDKNLLIGVVSFVSGAGCESGKPV GFSRVTSYMDWIQQNTGIKF >gi|Cold-Adaption Enzymes [Salmon] IVGGYECKAYSQAHQVSLNSGYHFCGGSLVNENWVVSAAHCYKSRVEVRLGEHNIKVTEGSEQFISSSRV IRHPNYSSYNIDNDIMLIKLSKPATLNTYVQPVALPTSCAPAGTMCTVSGWGNTMSSTADSDKLQCLNIP ILSYSDCNDSYPGMITNAMFCAGYLEGGKDSCQGDSGGPVVCNGELQGVVSWGYGCAEPGNPGVYAKVCI FSDWLTSTMASY

2.3.3 Sequence Alignment Study 2.3.3.1 Pair-wise alignment

equence alignment is the procedure of comparing two (pair-wise alignment) (multiple sequence alignment) sequences by searching for a series of individual character patterns that are in the same order in the sequences. Two sequences are Page | 41

by writing them across a page in two rows. Identical or similar characters are placed in the same column, and nonidentical characters can either be placed in the same column as a mismatch or opposite a gap in the other sequence. In an optimal alignment, nonidentical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. Sequences that can be readily aligned in this manner are said to be similar. There are two types of sequence alignment, global and local. In global alignment, an attempt is made to align the entire sequence, using as many characters as possible, up to both ends of each sequence. Sequences that are quite similar and approximately the same length are suitable candidates for global alignment. In local alignment, stretches of sequence with the highest density of matches are aligned, thus generating one or more islands of matches or subalignments in the aligned sequences. Local alignments are more suitable for aligning sequences that are similar along some of their lengths but dissimilar in others, sequences that differ in length, or sequences that share a conserved region or domain. Pairwise alignment is the process by which a pair of sequences are compared to one another by sequence alignment technique either global or local. It can also bedotplot

LGPSSKQTGKGS-SRIWDN Global alignment LN-ITKSAGKGAIMRLGDA â&#x20AC;&#x201C;------TGKG-------Local alignment -------AGKG--------

Distinction between global and local alignments of two sequences.

2.3.3.1.1 Software/Program BLAST2 sequence

his tool produces the alignment of two given sequences using BLAST engine for local alignment. While the standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes the BLAST algorithm for pair wise DNA-DNA or protein-protein sequence comparison. The results of BLAST2 Sequences give information about the similarities and identities of other proteins regarding of the query protein. It also gives a graphical representation of the alignment.

Page | 42

2.3.3.1.2 Methods 1. Starting with NCBI, “BLAST” search was selected 2. “Align two sequences (bl2seq)” was chosen from Special databases. 3. “blastp” was chosen from Program and along with it, “BLOSUM62” was automatically selected in the Matrix options.

4. The query sequence was pasted from the saved fie in the first window

5. The subject sequence was pasted from file in the 2nd window. 6. “Align” clicked.

was

7. The results were saved.

CBI home page

BLAST

Align two sequences (bl2seq)

Align

Query & Subject seq.

“blastp” chosen

pasted in separate window

from Program

Result saved

Page | 43

2.3.3.1.3 Result Pair-wise alignment results were found seperately for sequences. Among those one particular result of p00766 and cold adaptation enzyme is given below--

Figure 2.5: Graphical representation of pair-wise alignment Page | 44

Table 2.1: Pair-wise alignment results for retreived sequences to identify similarities Attemps

Sequence 01

Sequence 02

01.

Sq(p00766) S=human L=245

Sq=Cold adaptation enzyme S=(salmon) L=231

Sq(p00766) S=human L=245

score

Expect

identities

positives

Gaps

97/231 (41%)

137/231 (59 %)

12/ (5 %)

241/245 (98 %)

0/245 (0 %)

241/245 ( 98%)

241/245 (98 %)

4/245 ( 1%)

-115 6℮115

199/245 (81 %)

215/245 (87 %)

0/245 (0 %)

164bits (414 )

-38 1℮-38

Sq=pdb[4CHA] S= L=245

494bits (1271 )

-138 5℮-138

Sq(p00766) S=human L=245

Sq=1GT(chymotr ypsin) S= L=245

485bits ( 1249)

Sq(p00766) S=human L=245

Sq=CTRB 2 protein S=human L=263

417bits ( 1072)

Sq(p00766) S=human L=245

Sq=CTRL protein S= L=269

294bits (752 )

-78 7℮-78

132/246 (53%)

174/246 ( 70%)

1/246 ( 0%)

Sq=Ela 3 protein S=mouse L=255

197bits (502 )

-49 7℮-49

111/253 ( 43%)

153/253 (60 %)

16/253 ( 6%)

Sq(p00766) 06 S=human L=245

2℮135

-135

Sq(p00766) S=human L=245

Sq=chymotrysin like protein S=Sparus aureta L=

353bits ( 907)

-96 8℮-96

165/245 ( 67%)

192/245 ( 78%)

0/245 ( 0%)

Sq(p00766) S=human L=245

Sq= S=zebra fish L=261

347bits ( 890)

-94 7℮-94

166/247 ( 67%)

197/247 ( 79%)

2/247 ( 0%)

Sq(p00766) S=human L=245

Sq=trypsinogen S= L=223

175bits ( 444)

-42 4℮-42

98/232 ( 42%)

140/232 (60%)

11/232 ( 4%)

Page | 45

Sq(p00766) S=human L=245

Sq=3PTN(trypsin ) S= L=223

175bits ( 444)

-42 4℮-42

98/232 ( 42%)

140/232 ( 60%)

11/232 ( 4%)

Sq(p00766) S=human L=245

Sq=PRSS2 S=Bos taurus L=247

179bits ( 454)

-43 3℮-43

104/233 ( 44%)

139/233 ( 59%)

4/233 ( 2%)

Sq(p00766) S=human L=245

Sq=tryptase S=human L=267

166bits ( 428)

-39 2℮-39

92/237 ( 38%)

126/237 ( 53%)

14/237 ( 5%)

Sq(p00766) S=human L=245

Sq=beta tryptase S=Gorilla L=275

164bits ( 416)

-39 7℮-39

91/237 ( 38%)

127/237 ( 53%)

14/237 ( 5%)

Sq(p00766) S=human L=245

Sq=try p14 S=Macaca mulata L=247

171bits ( 434)

-41 6℮-41

102/233 ( 43%)

134/233 ( 57%)

11/233 ( 4%)

Sq(p00766) S=human L=245

Sq=Hydratase S= L=223

173bits ( 439)

-41 1℮-41

97/232 ( 41%)

139/232 ( 59%)

11/232 ( 4%)

Sq=Elastase S= L=240

162bits (409 )

-38 4℮-38

95/241 ( 39%)

137/241 ( 56%)

12/241 ( 4%)

Sq(p00766) 16 S=human L=245

Sq(p00766) S=human L=245

Sq=Factor XI S=human L=238

169bits ( 428)

-40 3℮-40

93/238 ( 39%)

125/238 ( 52%)

10/238 ( 4%)

Sq(p00766) S=human L=245

Sq=plasminogen S=human L=247

171bits ( 432)

-40 1℮-40

95/253 ( 37%)

137/253 ( 54%)

17/253 ( 6%)

Sq(p00766) S=human L=245

Sq=protease S=(Mast cell) L=274

162bits ( 411)

-38 3℮-38

90/239 ( 37%)

127/239 (53 %)

14/239 ( 5%)

Page | 46

Sq(p00766) S=human L=245

Sq=Hepsin S=human L=304

159bits ( 403)

-37 2℮-37

95/249 ( 38%)

130/249 ( 52%)

23/243 ( 9%)

Sq(p00766) S=human L=245

Sq=Kallekrenin S=human L=238

132bits ( 331)

-29 5℮-29

83/245 (33 %)

123/245 ( 50%)

23/245 ( 9%)

Sq(p00766) S=human L=245

Sq=Factor X S=human L=241

120bits (302)

-25 1℮-25

72/237 ( 30%)

117/237 ( 49%)

15/237 ( 6%)

Sq(p00766) S=human L=245

Sq=collagenase S=human L=230

98 bits (244)

-19 6℮-19

74/235 (31 %)

117/235 (49 %)

21/235 (8 %)

here, S= source, L= length, sq= sequence

2.3.3.2 Multiple Sequence Alignment

ne of the major contribution of molecular biology to evolutionary analysis is the discovery that the DNA sequences of different organisms are often related. Similar genes are conserved across widely divergent species, often performing a similar or even identical function, and at other times, mutating or rearranging to perform an altered function through the forces of natural selection. Thus, many genes are represented in highly conserved forms in organisms. Through simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to alteration may be analyzed. Because the potential for learning about the structure and function of molecules by multiple sequence alignment (msa) is so great, computational methods have received a great deal of attention. In msa, sequences are aligned optimally by bringing the greatest number of similar characters into register in the same column of the alignment, just as described in Chapter 3 for the alignment of two sequences. Computationally, msa presents several difficult challenges. First, finding an optimal alignment of more than two sequences that includes matches, mismatches, and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge. The dynamic programming algorithm used for optimal alignment of pairs of sequences can be extended to three sequences, but for more than three sequences, only a small number of relatively short sequences may be analyzed. Thus, approximate methods are used, including (1) a progressive global alignment of the sequences starting with an alignment of the most alike sequences and then building an alignment by adding more sequences, (2) iterative methods that make an initial alignment of groups of sequences and then revise the alignment to achieve a more reasonable result, (3) alignments based on locally conserved patterns found in the same order in the sequences, and (4) use of statistical methods and probabilistic models of the sequences. A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column of an msa. Finally, the placement and scoring of gaps in the various sequences of an msa presents an additional challenge. The msa of a set of sequences may also be viewed as an evolutionary history of the sequences. If the Page | 47

sequences in the msa align very well, they are likely to be recently derived from a common ancestor sequence. Conversely, a group of poorly aligned sequences share a more complex and distant evolutionary relationship. The task of aligning a set of sequences, some more closely and others less closely related, is identical to that of discovering the evolutionary relationships among the sequences. As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies considerably with sequence similarity. On the one hand, if the amount of sequence variation is minimal, it is quite straightforward to align the sequences, even without the assistance of a computer program. On the other hand, if the amount of sequence variation is great, it may be very difficult to find an optimal alignment of the sequences because so many combinations of substitutions, insertions, and deletions, each predicting a different alignment, are possible.

Figure 2.6: Algorithm of a software performing multiple sequence alignment

Page | 48

2.3.3.2.1 Software/ Tools CLASTALW

LUSTALW is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid. This program follows the following steps: Perform pair wise alignments of all of the sequences. Use the alignment scores to produce a phylogenic tree using neighbor-joining methods. Align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Evolutionary relationships can also be seen via viewing Cladograms or Phylograms. The sequence alignment is performed in global alignment manner.

JalView

alview is a multiple alignment editor written entirely in java. It was initially to be used as a visualization tool for the PFAM CORBA server and client at the EBI but is available as a general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI clustalw server and the PFAM protein domain database) but is available as a general purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are patterns of shared common history between biological replicators.

2.3.3.2.2 Method 1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen from the drop down menu. 3. From the tools available, “CLUSTALW” was selected for multiple sequence alignment. 4. Full was chosen from the alignment option. 5. “Blosum” was chosen from Matrix. Page | 49

6. “Input” was selected from the output order. 7. The sequences similar to our query sequence protein were pasted in FASTA format in the given window from file. 8. The program was run. 9. “Show Colour” was clicked. 10. The result was saved.

EBI Home page

European Bioinformatics Institute

Parameters: Alignment-Fast Matrix-blosum; output orderInput

Sequences pasted

CLUSTALW

Run

Toolbox

Sequence Analysis

Show

Result

Colour

saved

Page | 50

2.3.2.2.3 Results

lustal results are best expressive when the initial gap sequences are omitted. It is because the multiple sequence alignment here is a global alignment process. So after omitting sequences that caused too much gaps to match p00766 sequence we had 14 overall meaningful sequences. That are>gi|117615|sp|P00766| CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|4CHA| CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN >pdb|1GCT|CHYMOTRYPSIN*A CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS WGSSTCSTSTPGVYARVTALVNWVQQTLAAN >|CTRB2 protein[Human] MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN >|CTRL protein[Human] LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN >|Ela3 protein[Mouse] PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN >gi|chymotrypsinogen 2-like protein [Sparus aurata] GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN >gi|Zebrafish [Danio rerio] WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN >gi|PRSS2 protein [Bos taurus] MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS >gi|tryptase-III [Human] LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP

Page | 51

DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|beta 1 tryptase [Gorilla gorilla] MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP >gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta] MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS >pdb|1DDJ|PLASMINOGEN SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN >|Mast cell protease6 MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS

Page | 52

CLUSTAL W results

Figure 2.7: Multiple Sequence Alignment(MSA)

Page | 53

JAL view result

Figure 2.8: Multiple Sequence Alignment(MSA) Jalview results

Similar results were found in case of keeping the parameter output order “aligned” in case of “input”

2.3.4 Phylogenetic tree Construction 2.3.4.1 Software/Tools

CLUSTALW

Page | 54

ultiple sequence comparisons help highlight weak sequence similarity and shed light on structure, function, or origin. The most widely used programs for global multiple sequence alignment are from the Clustal series of programs. CLUSTALW and CLUSTALX are progressive alignment programs that follow the following steps: Perform pair wise alignments of all of the sequences. Use the alignment scores to produce a phylogenic tree using neighbor-joining methods. Align the sequences sequentially, guided by the phylogenetic relationships indicated by the tree. ClustalW is use to align DNA or protein sequences in order to elucidate their relatedness as well as their evolutionary origin.

CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. The initial pair wise alignments are calculated using an enhanced dynamic programming algorithm, and the genetic distances used to create the phylogenetic tree are calculated by dividing the total number of mismatched positions by the total number of matched positions. The resulting evolutionary relationships can be viewed either as cladograms or phylograms, with the option to display branch lengths (or â&#x20AC;&#x153;tree graph distances).

Web link: http://www.ebi.ac.uk JalView

alview is a multiple alignment editor written entirely in java. It was initially to be used as a visualization tool for the Pfam CORBA server and client at the EBI but is available as a general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI clustalw server and the PFAM protein domain database) but is available as a general purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are patterns of shared common history between biological replicators. Web link: http://www.ebi.ac.uk/jalview

Page | 55

2.3.4.2 Methods

Using CLASTALw 1. Starting with the EBI home page http://www.ebi.ac.uk European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen. 3. From the drop down menu “CLUSTALW” was selected. 4. The following parameters were selected from Output and Phylogenetic tree: •

TREE TYPE: nj

•

CORRECTDISTANCE: on

•

IGNORE GAPS: on

5. The multiple sequence alignment result previously obtained from CLUSTALW was pasted. 6. The program was then run for phylogenetic tree construction. 7. To view the phylogenetic tree, “Show as Phylogram Tree” was clicked. 8. The resulting phylogenetic tree was saved.

EBI home page (http://www.ebi.ac.uk)

European Bioinformatics Institute

Tree Type-nj;Correct

CLUSTALW

Toolbox

Sequence Analysis

distance-on; ignore gaps-on

Sequences of MSA pasted

Run

Show as Phylogram Tree

Result saved

Page | 56

Using JalView 1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics Institute was selected. 2. “Toolbox” was clicked and then “Sequence Analysis” was chosen. 3. From the tools available, “CLUSTALW” was selected for multiple sequence alignment. 4. The parameters chosen were: “Full” for Alignment, “Blosum” for Matrix and “Input” for Output Order. 5. The sequences were pasted in the given window. 6. The program was run. 7. “JalView” was clicked from Results of search. 8. “Neighbour joining tree using JalView” was chosen from Calculate. 9. The phylogram tree was saved. EBI Home Page (http://www.ebi.ac. uk)

Run

“JalView” from Results

European Bioinformatics Institute

Sequences pasted

Calculate

Toolbox

Parameters: Alignment-“Fast”, Matrix- “Blosum”, OutputOrder-Input “Input”.

eibhbour joining tree using PID

Sequence Analysis

CLUSTALW

Phylogram Tree saved

Page | 57

2.3.4.3 Results Newick file for Phylogenic Tree construction

Figure 2.9: Newick presentation

Cladogram

Fig 2.10: Phylogenic Tree (cladogram) from Homologous sequence of p00766

Phylogram

Page | 58

Fig 2.11: Phylogenic Tree (Phylogram) from Homologous sequence of p00766

Phylogenetic tree using JAL view

Figure 2.12: Phylogenetic tree by JalView

Page | 59

2.3.5 Secondary Structure Prediction 2.3.5.1 Software/ Tools

roteins’ secondary structure depend on their primary sequences. Several software can be used to determint secondary structure. Some of them are listed below:

2.3.5.1.1 PSI-Pred PSIPRED is a software tool provided by University College London (UCL).Its widely used software to predict secondary structure from sequence. The PSIPRED protein structure prediction server allows one’s to submit a protein sequence, perform a prediction of one’s choice and receive the results of the prediction via e-mail. PSIPRED is a simple and reliable secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST).It is a highly accurate method for protein secondary structure prediction. 2.3.5.1.2 Neural Network etwork (NN) is a special type of problem solving algorithm based on the parallel architecture of complex animal neuronal organization. Hidden Markov Model is the basis of developing this algorithm. Neural Network simulates human learning process by mimicking networking organization of neuron and synapses. A single neuron, in the computational scheme, is a node in a directed graph, with one or more entering connections designated as input, and a single leaving connection called the output. To form a network, several neurons are assembled and the outputs of some connected to the inputs of others. Some nodes contain connections that provide input to the entire network; some deliver output information from the network to the outside world; and others, that do not interact directly with the outside, are called “hidden” layers.

Fig 2.13: The graphical presentation of HNN Applying this to the interpretation of genotypic information, neural networks are trained using a large database of input (genotype and treatment) data and output (drug response) data. The model is then tested on a testing set of input and output data to see how accurate it is.

2.3.5.1.2 Hierarchal Neural Network (HNN) Hierarchical neural networks consist of multiple neural networks concreted in a form of an acyclic graph. Tree-structured neural architectures are a special type of hierarchical neural network. The networks within the graph can be single neurons or complex neural architectures such as multilayer Page | 60

perceptions or radial basis function networks. Decision trees, hierarchical self-organizing maps, hierarchies of experts, hierarchical or tree-based classifiers are typical applications for hierarchical neural networks.

2.3.5.2 Methods Using PSIPRED: 1. Starting with the Bioinformatics Unit page (http://bioinf.cs.ucl.ac.uk ), “Secondary structure prediction (PSIPRED)” was selected from “Protein Structure Prediction.” 24 2. The sequence of 1GCT was pasted in FASTA format.

3. Email address was entered.

4. Predict was clicked. 5. The results were obtained in email and then saved.

Bioinformatics Unit (http://bioinf.cs.ucl.ac. uk)

Predict

Results of PSIPRED received in email

Secondary structure prediction (PSIPRED)

Sequence of IGCT submitted

Server accessed

Sequence pasted in FASTA format

Results saved

Page | 61

Using HNN 1. By entering the link www.expasy.org, the ExPASy Proteomics Server was accessed.

2. From Tools and software packages, â&#x20AC;&#x153;Secondary and tertiary structure predictionâ&#x20AC;? was selected. 3. HNN was chosen from Secondary structure prediction. 4. 4CHA sequence was pasted in the window. 5. The sequence was submitted to run the program. 6. The result was saved.

www.expasy.org

Secondary and tertiary structure prediction

HNN

Submit Result saved

sequence pasted

Page | 62

2.3.5.3 Results Hierarchical Neural Network result 10 20 30 40 50 60 70 | | | | | | | CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE ccccchchhhhchheeeccccccccccceeeecccccceeeccccccccheeeehhhcccccceeeeeec FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG cccccchhhhhhhhhhhhhhcccccceeeccceeeeeecccccccceeeeeecccccccccccceeeeec WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV ccceeecccccccchhhcccccccccccchcchhhhhhhhhhhccccccccccccccceeeecccceeee GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN eeeecccccccccccchhhhhhhhhhhhhhhhccc Sequence length : 245 HNN : Alpha helix (Hh) : 56 is 22.86% 310 helix (Gg) : 0 is 0.00% Pi helix (Ii) : 0 is 0.00% Beta bridge (Bb) : 0 is 0.00% 55 is 22.45% Extended strand (Ee) : Beta turn (Tt) : 0 is 0.00% Bend region (Ss) : 0 is 0.00% Random coil (Cc) : 134 is 54.69% Ambigous states (?) : 0 is 0.00% Other states : 0 is 0.00%

Figure 2.14: Secondary structure by HNN

Page | 63

PSIPRED Result On Tue, 20 Jan 2009 09:03:36 GMT, "Apache" <psipred@cs.ucl.ac.uk> said: PSIPRED PREDICTION RESULTS Key Conf: Confidence (0=low, 9=high) Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence PSIPRED HFORMAT (PSIPRED V2.6 by David Jones) Conf: 998765777799997599999999987868999938997897079940998998841313 Pred: CCCCCCCCCCCCCCCEECCEECCCCCCCCEEEEEECCCCEEEEEEEECCCEEEECHHHCC AA: CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 10 20 30 40 50 60 Conf: 787579999626553799748997889998999888888785199997888757697801 Pred: CCCCEEEEEEEECCCCCCCCEEEEEEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCEEC AA: TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA 70 80 90 100 110 120 Conf: 687998776999898999828854678999988635999787299999987088799887 Pred: CCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCEEEEEEEEECCHHHHHHHCCCCCCCCE AA: VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 130 140 150 160 170 180 Conf: 974899833677888994777119989999999860688879988499887997999999 Pred: EECCCCCCCCCCCCCCCEEEECCCCEEEEEEEEEEECCCCCCCCCEEEEEHHHHHHHHHH AA: ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ 190 200 210 220 230 240 Conf: 98559 Pred: HHHCC AA: TLAAN Calculate PostScript, PDF and JPEG graphical output for this result using: http://bioinf3.cs.ucl.ac.uk/cgi-bin/psipred/gra/nph-view2.cgi?id=0eef479d8c802aad.psi

Page | 64

Figure 2.15: Secondary structure by PSI-Pred.

Page | 65

Chapter 3: Discussion

Page | 66

3.1 General

he genomic era is characterised by a enormous expansion in the amount of biological information available in the field of molecular biology. The greatest challenge of the molecular biology community is to make sense of the data and exploring meaningful means to exploit those data in practical genomics and proteomics. The result was obvious- using computer to store, retreive and manipulate the data to produce meaningful informations. From central dogma of life we know, informations pass from genome to proteome through transcription and translation. Transcription is the process of encoading DNA to mRNA and Translation is mRNA to Protein.

DNA

RNA

PROTEIN

Bioinformatics tools are mainly being developed targeting the three central biological processes:

1. Determination of protein sequence from DNA sequence 2. Determination of protein structure from its primary sequence 3. Determination of protein function from its 3D structure A database management system (DBMS) is a collection of informations in seperate entities with corresponding attributes linked to it. The software fits data storing, retreiving and manipulation exceedingly well in comparison to manual data storage and management. Our project aim was to get familiar with the basic bioinformatics tool. The protein specimen we took was chymotrypsin. It is one of the most well studied sample in enzyme study. The enzyme is responsible for breaking polypeptides into a smaller fragment. By definition the enzyme is a Protease group of protein. We extracted homologous protein sequences from the database and did Pairwise sequence alignment and multiple sequence alignment to understand the evolutionary relatiobship among the sequences. A evolutionary tree was built on the basis of sequence homology; identity and similarity present in the protein sequences.

3.2 Exploring Database

NTREZ is a combined database and search engine composed of-

1. Pubmed: biomedical literature database 2. Pubmed central: free full text journal articles 3. Journals: detailed informations about the journals

4. Mesh: detailed informations about NLMâ&#x20AC;&#x2122;s controlled vocabulary 5. Nucleotide sequence database (GenBank) Page | 67

6. Protein sequence database 7. Genome: whole genome sequence database 8. Structure: 3D macromolecular structure 9. Taxonomy: organisms in GenBank 10. SNP: single nucleotide polymorphism 11. GENE: gene-centered informations 12. Books: online books 13. OMIM: online mendelian inheritence in man 14. Site search: NCBI web and FTP sites 15. UniGene: gene oriented clusters of transcript sequences 16. CDD: conserved protein domain database 17. 3D domain: domain from ENTREZ structure 18. Uni STS: markers and mapping data 19. PopSet: population study datasets 20. GEO: expression and molecular abundance profiles 21. GEO datasets: experimental sets of GEO data.

We used the Protein database of NCBI gateway to retreive the chymotrypsin sequence. However we went through other features of the database and explored various informations about the protein sequence. Genbank informations are kept in Flatfile format which actually is composed of 3 sets of informations-

1. Header 2. Features 3. Sequence Header part is composed of following informations• • • • • • • •

Locus Description Accession no. GI no. Version Source (organism) Organism (in detail) Reference (title, journal, author)

Header part contains following informationsPage | 68

• • •

source Gene RNA

lastly the sequence part contains the protein sequence in FLAT file format. However for input in BLAST we needed FASTA format sequence which starts with a “>” sign followed by a short description an “Enter” and the sequence without any “Space” in “Courier’’ font. Another important learning of the project was to get aquainted with the softwares of the database. We learned use of different BLAST software of the NCBI gateway. The uses are enlisted belowLength 15 residues or longer

5-15 residues

Database Protein

Purpose Identify the query sequence or find protein sequences similar to query Find members of a protein family or build a custom position-specific score matrix Find proteins similar to the query around a given pattern Conserved Find conserved domains in the query Domains Conserved Find conserved domains in the query and identify other proteins with similar Domains domain architectures Nucleotide Find similar proteins in a translated nucleotide databases Protein

Search for peptide motifs

BLAST program Standard Protein BLAST (blastp) PSI-BLAST

PSI-BLAST CD-search (RPS-BLAST) Conserved Domain Architecture Retrieval Tool (CDART) Translated BLAST (tblastn) Search for short, nearly exact matches

Table 3.1: Different uses of BLAST programs.

Apart from this we also used some useful tools such as CLUSTAL W for multiple sequence alignment of EBI gateway.

3.3 Analyzing Protein Sequence

he protein sequence was of 245 residues. contained high amount of serine residues 28 in number so is placed in the serine protease kind of protein.

We learned how to search database for homologous sequences using BLASt tools like BLASTp and PSI-BLAst.

Page | 69

We learned about general physiochemical properties of the protein by using Protparam software. Only the sequence was pasted not the FASTA format. The software gave us an approximate pI value by counting Total number of negatively charged residues (Asp + Glu: 14)Total number of positively charged residues (Arg + Lys: 18) and molecular weight by multiplying 110 with the total residue number. Other informations like extinction coefficient could also have been predicted. We performed Pair-wise sequence alignment withhomologous sequences derived by PSIBLAST. By doing so we identified conserved regions in those protein sequences. The query sequence showed chunk of conserved regions with the sequence of Cold adaptation enzyme of Salmon fish very precisely. It helped us identifying the location of enzymeâ&#x20AC;&#x2122;s binding or active sited as those sequences over the evolutionary period remains more or less conserved. We learned that the catalytic triode responsible for enzymatic activity was composed of Serine 195, Histidine 57 and Aspertate 102. An in depth idea of the mechanism of the enzymes reaction revealed initial Covalent modification at the Serine residue. By performing Multiple sequence Alignment we understood the evolutionary relationship even more specifically and was able to generate the relationship as Phylogenetic Tree. The percentage and position of alpha helix, beta sheets were predicted by using different tools for secondary structure prediction like PSIPRED, Hierarchical Neural Network (HNN) . It gave us an idea of the secondary structure of the protein which included more Beta-pleated structure (around 45%) and less Alpha helix structures (around 14%).

3.4 Conclusion As the field of molecular biology is advancing, thousands of new proteins are being discovered. So sequencing of unknown proteins and determination of their structure remain a crude necessity for the researchers. By studying the structure of a known protein, this elementary project has provided us to work with unknown proteins and assuming their functions during advanced research activities.

Page | 70