PDF Intelligent computing in bioinformatics 10th international conference icic 2014 taiyuan china au
10th International Conference ICIC 2014
Taiyuan China August 3 6 2014
Proceedings 1st Edition De-Shuang Huang
Visit to download the full and correct content document: https://textbookfull.com/product/intelligent-computing-in-bioinformatics-10th-internatio nal-conference-icic-2014-taiyuan-china-august-3-6-2014-proceedings-1st-edition-de-s huang-huang/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
Intelligent Computing Theory 10th International Conference ICIC 2014 Taiyuan China August 3 6 2014 Proceedings 1st Edition De-Shuang Huang
Distributed Computing and Internet Technology 10th International Conference ICDCIT 2014 Bhubaneswar India February 6 9 2014 Proceedings 1st Edition Gérard Berry
Predicting the Outer/Inner BetaStrands in Protein Beta Sheets Based on the Random Forest Algorithm
Li Tang1,2,*, Zheng Zhao1, Lei Zhang3,*, Tao Zhang3,**, and Shan Gao3,**
1 School of Computer Science and Technology, Tianjin University, Tianjin, P.R. China tangli0831@yeah.net
2 Information Science and Technology Department, Tianjin University of Finance and Economics, Tianjin, P.R. China
3 Key Lab. of Bioactive Materials, Ministry of Education and The College of Life Sciences, Nankai University, Tianjin, P.R. China {zhni,zhangtao,gao_shan}@nankai.edu.cn
Abstract. The beta sheet, as one of the three common second form of regular secondary structure in proteins plays an important role in protein function. The best strands in a beta sheet can be classified into the outer or inner strands. Considering the protein primary sequences have determinant information to arrange the strands in the beta sheet topology, we introduce an approach by using the random forest algorithm to predict outer or inner arrangement of a beta strand. We use nine features to describe a strand based on the hydrophobicity, the hydrophilicity, the side-chain mass and other properties of the beta strands. The random forest classifiers reach the best prediction accuracy 89.45% with 10-fold cross-validation among five machine learning methods. This result demonstrates that there are significant differences between the outer beta strands and the inner ones in beta sheets. The finding in this study can be used to arrange beta strands in a beta sheet without any prior structure information. It can also help better understanding the mechanisms of protein beta sheet formation.
Keywords: beta sheet, beta strand, protein secondary structure, random forest algorithm.
1 Introduction
Protein secondary structure is an important bridge to understand the protein’s threedimensional structure from its amino acid sequence[1-3]. Investigation of the protein secondary structure helps the determination of the protein structure, as well as the design of new proteins[4]. The beta sheet (also β-pleated sheet) is one of the three common second form of regular secondary structure in proteins. The statistical data of the PDB database[5] showed more than 75% of proteins with known structures
* These authors contributed equally to this paper.
contain beta sheets. To predict these beta sheet containing proteins, assigning beta strands to a beta sheet can reduce the search space in the ab initio methods[13, 45]. Moreover, beta sheets play some important roles in protein functions, particularly in the formation of the protein aggregation observed in many human diseases, notably the Alzheimer's disease.
Fig. 1. Illustration of beta strands pairs and configurations. Arrows show the amide (N) to carbonyl (C) direction of beta strands. Hydrogen bonds are represented by dotted lines.
In a beta sheet, beta strands are paired by the interactive hydrogen bonds in parallel or antiparallel arrangement (Fig.1). A beta sheet forms a topology (Fig.2b), which can be described by three components: the group of beta strands in the beta sheet, the orders of these beta strands on the sequence level (Fig.3), and the configuration of beta strand pairs (parallel or antiparallel). The order of beta strands arranged in a beta sheet topology differs with the order of beta strands on sequence level (Fig.2). As described in the Protein Data Bank Contents Guide[6], beta strands are listed and numbered according to their orders on the sequence level. In this study, the first and the last strand in a beta sheet are defined as outer strands, whereas the other strands are defined as inner strands (Fig.3).
Many studies have been proposed to reach the different levels of understanding the beta sheet topology. The mechanisms and rules of beta sheet formation are investigated and simulated by theoretical and experimental method[7-10]. Some efforts focus on the prediction of residue contact maps, which can be used to construct the beta sheet topology[11, 12]. Other researcher predicted the parallel/antiparallel beta strand pairs[13-15], based on the non-random distribution and pairing
preferences of amino acids in parallel and antiparallel beta strand pairs[16-19]. Utilizing machine learning algorithms, several methods are proposed to predict the topology of some certain kinds of beta sheets[2, 20, 21]. Although much achievements have been acquired in some aspects of beta sheet studies, the mechanisms of beta strands to form beta sheets have not yet to be fully understood[7].
Fig. 2. (a) Seven strands of protein 1VJG in sequence order. (b) Beta-sheet topology of protein 1VJG. (c) Protein 1VJG rendered in Rasmol.
Considering the two outer beta strands take the starting and terminal location in one beta sheet, we suggest beta strands probably have different conservative properties in the outer and inner strands on the sequenced level. In this study, we predict the outer/inner beta strands in beta sheets using the Random Forest (see Materials and Methods), extracting the features from the protein primary sequences.
Fig. 3. Illustration of beta strands number in the beta sheets of protein 1VJG in the PDB file. The number 1 and 5 marked with circles denote the outer beta strands; whereas the number marked with the box denote inner ones. The number 1 strand, ranging from the sequence number of residue 43 to 51, corresponds to the second strand in sequence order.
2 Materials and Methods
2.1
Datasets
The protein structure dataset we used is from a database server named PISCES, established by Wang et al[22, 23]. For precisely examining the accuracy of the classification via a cross-validation, an appropriate cutoff threshold of sequence identity is necessary to avoid the redundancy and homology bias[24, 25]. PISCES utilizes a combination method of PSI-BLAST and structure-based alignments to determine sequence identities, and products lists of sequences from the Protein Data Bank (PDB) using a number of entry- and chain-specific criteria and mutual sequence identity according to the needs of the study. In our investigation, a non-redundant dataset (cullpdb_pc25_res2.0_R0.25_d090516_chains4260) with the sequence identity percentage cut-off 25% is used. Crystal structures have a resolution of 2.0 Å or better and an R-factor below 0.25. We import the set into the Sheet Pair Database [26] for easier data management and screening. Many incorrectness samples such as protein chains that contain non-standard amino acids or disordered regions[27, 28], any patterns with a chain break or heteroatom are excluded. We treat the outer beta strands as positive samples and the inner ones as negative samples (Fig. 3). In the final dataset, there are 1,205 proteins, of which contained 11,424 outer beta strands and 13,285 inner ones.
2.2 Feature Extraction
Protein folding is a collaborative process but mainly driven by the hydrophobic interaction[29]. The balance of the interaction between hydrophobicity and hydrophilicity is a notable feature of the stability of protein structure[29,30]. The previous studies also showed that the amino acid hydrophobicity and molecular size are two important factors that cause differences of amino acid conservative[31]. In this study, we used nine features to describe a beta strand. Seven of nice features are based on three physical and chemical properties of the amino acids, which are the hydrophobicity value (H1) from Tanford[32], the hydrophilicity value (H2) from Hopp and Woods[33], and the mass of side chain of amino acid(M). The other two features are from the Pseudo-Amino Acid Composition (PseAAC), which was originally introduced by Chou for the prediction of protein subcellular localization and membrane protein type[34]. The PseAAC includes not only the main feature of amino acid composition but also sequence order information beyond amino acid composition. It can represent a protein sequence comprehensively with additional sequence order effects reflected by a series of sequence correlation factors with different tiers of correlation.
A beta strand chain can be represented by a 9-dimension numeric vector } … { X 9 8 7 3 2 1 x x x x x x = . Each value in the vector can be calculated by such formulas:
M and R H R H are the hydrophobicity value, hydrophilicity value, and side-chain mass of the amino acid Ri after the standard conversion, respectively.
The element 8x is the first-tier correlation factor defined in PseAAC[34].Since the length of a beta strand sequence is usually not long, only the first-tier correlation factor is calculated which can reflect the sequence order correlation between all the most contiguous residues along a beta strand sequence[34]. Correspondingly, the element 9x is the 21st component of PseAAC. In Eq.(9), i f is the normalized occurrence frequency of the 20 amino acids in the beta strand and w is the weight factor for the sequence order effect and is set to default value 0.05 in the current study [34]. In Eq.(8) and Eq.(9), ) , ( 1+ Θ i i R R is calculated by the following equation as described in PseAAC:
The standard conversion from ) R ( H i 0 1 to ) ( 1 Ri H is described by the forrmula (11) as below. ) R ( M and ) R ( H ), R ( H i 0 i 0 2 i 0 1 are the raw values of the amino acid hydrophobicity, hydrophilicity and side-chain mass.
2.3 The Random Forest Algorithm
Random Forest (RF), as a machine learning algorithm, was originally introduced by Breiman[35]. RF generates decision trees by randomly sampling subspaces of the input features, and then makes the final decisions by a majority voting from these
trees. RF has a good predictive performance even though the dataset has much noise [36]. With the increased number of decision trees, RF avoids the overfitting problem or the dependence on the training data sets[35]. In view of the good characteristics of Random Forest, it has been applied successfully to deal with many classification or prediction problems in varied biological fields[37-41]. In this study, we used theWeka software package[42] to implement the RF classification of the outer/inner beta strands. There are three parameters to run RF in Weka developer version 3.6.2, which are I: the number of trees constructed in the forest; K: the number of features calculated to define each of the nodes in a tree; and S:random number seed. In this study, we used the default setting without model selection.
2.4 Performance Measures
To assess the performance of classifiers we used the following measures: the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), the number of false negatives (FN) , sensitivity of positive examples(Sn+), specificity of positive examples(Sp+), sensitivity of negative examples(Sn-), specificity of negative examples(Sp-), accuracy(Ac) and Matthews correlation coefficient(MCC), as described in[43].
3 Results
In order to compare the prediction performance of different algorithms with that of Random Forest (RF), BayesNet, support vector machine (SVM), multilayer perceptron (MLP) and K-nearest-neighbor (IBk) were used to classify the outer/inner beta strands with the default parameter setting. The SVM algorithm was implemented by the LibSVM 2.86 [44], while the other three algorithms were implemented in Weka. Ten-fold cross-validation test was used to evaluate the accuracy of each prediction algorithm. The prediction accuracy of outer/inner beta strands in beta sheets reaches 89.45% Ac and 0.79 MCC by using RF (see 2.3). From Table 1, we can see that RF reaches the best performance among five algorithms. The prediction accuracy of RF is about 2% higher than the K-nearest-neighbor classifier, significantly ahead of the Naive Bayes, SVM and MLP classifiers.
Table 1. The prediction result of theouter/inner beta strands
Another random document with no related content on Scribd:
Voi minua virpi vieno, kukka kurja kuihtuvainen.
KYLLIKIN KUOLO.
Kaplahat ne narahtaa ja ääntää tuonen rengin reessä.
Rauhan, viihdytyksen valkamahan Kulkee herkkä henki.
LEMMINKÄISEN LAULUJA.
ENSIMMÄINEN OSA.
I.
Jo löysin sun, oi saaren impi sorja ja kaihojeni kaunis Kyllikki. Nyt olen viittaustes nöyrä orja, ja sulon silmäykses kerjuri.
Tuo silmäykses kumman kuutamainen oi vielä kerran minuhun se luo, ja lemmenmalja täysi, jumalainen vain minun yksin juodakseni suo.
Mun silmiesi säikkyä suo juoda ja uida haaveisessa katseessas. Sä taidat Ahdin sankariksi luoda ja nostaa suureksi sun sulollas.
Sun silmiesi sinisessä vyössä, mun Kyllikkini kaunoinen niin karkeloi kuin kesäyössä tuo auer hellän haaveinen, mi laulaa
II.
lemmen armautta ja suven suurta suloa ja riemun täyttä rikkautta ja ikuisia iloja.
Se paltahilta sinirantain ja siimeksistä salojen käy onnen kukkasia kantain ja tuoksehia toiveiden. Siks’ silmäis siniauteretta mun aina salli katsella. Ei elä kasvi kastehetta, en minä ilman sinua.
Hän soutaa tuolla aalloilla veen impi ilonkukka. ja häll¹ on silmä siintoisa ja viherjäinen tukka.
Hän soutaa suloaatoksin ja leutomielin laulaa ja painaa päätä aaltoihin ja valkeaista kaulaa.
Hän painaa päätä aaltoihin ja sinisilmän sulkee, ja mieli luokse Nyyrikin nyt rannikolle kulkee.
Ja tyttö muistaa armastaan, mi seisoi koivun alla ja soitti tuohitorveaan vain äänell’ ihanalla.
Ja valkoviitta poikasen niin kirkkahana hohti, kun katse, hellä, haaveinen nous’ aallon neittä kohti.
III.
Veen immyt aaltovuoteella näin uinuu unelmissa, mut
Niin on onni ollaksesi, kaunis kehräelläksesi sinerväisellä rukilla unten utuista lankaa utuisesta kuontalosta, utu-impien keralla.
Elä itke impyeni, Kultalintuni kujerra.
XI.
Sinä nyt olet mun.
Mä saan sun kutrejas koskettaa
ja henkiä huultesi purppuraa.
Olet mun.
Olet vaimoni mun ja vaalija vuoteeni valkoisen, mun keijuni hento ja herttainen.
Olet mun.
Kesä-illat nää me aitan hämyssä haaveillaan, kun sade se soittavi harppuaan yli maan.
Sade soittavi vaan, ja kuuluvi kahina tammien ja askelet yöttären huntuisen yli maan.
Oi nukkuos jo!
Minä valvon untasi varjellen, nuku kätteni turvihin armainen, nuku jo!
XII.
Oi onni, aarteheni verraton, sä helmi hohtavainen, kallis. Oi, ett'et karkaisi sä luotani, en sitä konsanaan mä sallis.
Mun helmeni, mun pyhä onneni, oi, ettet särkyväinen oisi, ja jospa heloisena hohtehes kautt’ aikain varjeltua voisi.
XIII.
Onko onni vaan kupla, vetten päällä, kupla heleä, hetken kestävä?
Onko onni vaan unten kevyt keiju, jota etsien harhaa ihminen?
Onko onni vaan virvatulten väike, joka häipyy pois kun sen luona ois?
Onko onni vaan haave kuolematon, kaiho ikuinen, rinnass’ ihmisen?
XIV.
Mä turhaa enää korviani telkin, kun ennätin jo kuulla pahimman. Kun sukkamieliseks’ mun sanoin pelkin he sai ja rintahani riehunnan.
Nyt käsi Kyllikin, mi kerran kaasi mun maljahani simaa kuohuvaa, on kuollut minulle kuin kolkko paasi, min juurta joka aalto huuhtoaa.
Ja silmä petollinen, viekas poski mä luoksenne en palaa konsanaan. On sisässäni kuohuvainen koski ja luonnonraivoni on valloillaan.
Se riemuni suuri ja suloinen jo vaaleni syksyn tullen. Nyt mietin kurjana kalveten, miks’ Kyllikki kylmä on mullen.
Ja syöntäni okahat pistelee, ja vaiva rintaani raastaa.
Ja tuloset tielläni himmenee, ja sieluuni syöpyy saastaa.
Pois tahdon mä sotahan kaukaiseen kotiveräjän kuuluvilta.
Mä kaipaan välkettä vieraan veen, mulle musta on kotini silta.
Mä kaipaan välkettä vieraan veen ja hurmetta rannikoilla.
Käy matkani sotahan kaukaiseen, siks’ ilkaten soittelen soilla.
Mä suolla soittelen mennessäin ja painan kannusrautaa.
Kas Kauko se uhmaten ajaa näin, ja syömensä itkut hautaa.
TOINEN OSA.
Lemminkäinen on tehnyt retken Pohjolaan, joutunut Tuonen jokeen, josta äiti hänet ylös haravoi ja loitsii terveeksi. Lemminkäisen kotiintullessa on Kyllikki jo kuollut.
Erheeni on musta, musta. —
Kuka suopi sovitusta? —
Ääni soimaa, ääni soimaa. —
Mistä saan mä nousun voimaa?
Tahdon tuskistani nousta.
Kuka jännittäisi jousta?
Kuka terästäisi tahdon, tekis terveheksi Ahdon?
Kyllikkini marja, mairut laita tänne uskon airut! —
Etkö kuule kuikutusta, Ahti raukan rukousta?
Et sä vastaa. — Olet vainaa. —
Taakkani se kasvaa, painaa. — Äiti ainut, kultamuru, suista Lemminkäisen suru.
Käy hapsin hajaisin ja jaloin paljahin mun Kyllikkini tuonen nientä. Ja pajunlehviä hän taittaa vihreitä ja sitoo seppelettä pientä.
Hän seisoo yksinään ja sitoo seppeltään, hän sitä solmiaapi mulle. — Kai kerran joutunen mä tuonen niemellen ja kunnahalle kaihotulle.
Ah silloin Kyllikki mä olen luonasi ja sulta seppeleeni perin.
Ja lemmen ruusut nuo taas umpujansa luo ja hohtelevat punaterin.
— Mä näen haamusi ah armas Kyllikki, kun yksinäni täällä kuljen. Mut kerran joutunen mä tuonen niemellen ja sinut syleilyyni suljen.
VII.
Mä seison ja katson taaksepäin mun eloni taivalta pitkin, miten monta mä riemua riemuitsin, miten monta mä itkua itkin.
Ja kuin tulirinnoin mä innostuin ja suureksi tähtäsin työni. —
Ja kuinka mä itseeni’ hiivin taas ja vavisten valvoin yöni. —
Mä seison, ja ilta mun yllättää, ja lempeesti lännetär henkää — Te muistelot lepohon vaipukaa ja unholan maille menkää!
Nyt tahdon katsoa eteenpäin läpi kuoleman tumman verhon. Syvä kaipaus kantavi kauvas mun kuin siivillä sinisen perhon.
Ja kaipaus rintani riehakkaan se kasvaa, kasvaa aina. Ja kun se mun rintani aateloi, ei elämä taakkana paina.
LOPPUSANAT.
Me etsimme elämän liljaa tältä puolelta tuonelan veen ja niitämme kyynelviljaa ja kaihoja sydämeen.
Tää elämän kukkaissilta ei liljoja kannakaan.
Sen kurjilta kunnahilta vilukukkaset puhkee vaan.
Mut etsijän katse kantaa yli tuonelan virran suun kohin lepojen nurmirantaa ja ruskoja valjun kuun. —
Soi tuonelan vetten tenho. Kukat solmitaan köynnökseen, kunis saapuvi kaihottu venho, vie etsijän aarteineen.
*** END OF THE PROJECT GUTENBERG EBOOK KYLLIKKI JA LEMMINKÄINEN ***
Updated editions will replace the previous one—the old editions will be renamed.
Creating the works from print editions not protected by U.S. copyright law means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties. Special rules, set forth in the General Terms of Use part of this license, apply to copying and distributing Project Gutenberg™ electronic works to protect the PROJECT GUTENBERG™ concept and trademark. Project Gutenberg is a registered trademark, and may not be used if you charge for an eBook, except by following the terms of the trademark license, including paying royalties for use of the Project Gutenberg trademark. If you do not charge anything for copies of this eBook, complying with the trademark license is very easy. You may use this eBook for nearly any purpose such as creation of derivative works, reports, performances and research. Project Gutenberg eBooks may be modified and printed and given away—you may do practically ANYTHING in the United States with eBooks not protected by U.S. copyright law. Redistribution is subject to the trademark license, especially commercial redistribution.
START: FULL LICENSE
THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free distribution of electronic works, by using or distributing this work (or any other work associated in any way with the phrase “Project Gutenberg”), you agree to comply with all the terms of the Full Project Gutenberg™ License available with this file or online at www.gutenberg.org/license.
Section 1. General Terms of Use and Redistributing Project Gutenberg™ electronic works
1.A. By reading or using any part of this Project Gutenberg™ electronic work, you indicate that you have read, understand, agree to and accept all the terms of this license and intellectual property (trademark/copyright) agreement. If you do not agree to abide by all the terms of this agreement, you must cease using and return or destroy all copies of Project Gutenberg™ electronic works in your possession. If you paid a fee for obtaining a copy of or access to a Project Gutenberg™ electronic work and you do not agree to be bound by the terms of this agreement, you may obtain a refund from the person or entity to whom you paid the fee as set forth in paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be used on or associated in any way with an electronic work by people who agree to be bound by the terms of this agreement. There are a few things that you can do with most Project Gutenberg™ electronic works even without complying with the full terms of this agreement. See paragraph 1.C below. There are a lot of things you can do with Project Gutenberg™ electronic works if you follow the terms of this agreement and help preserve free future access to Project Gutenberg™ electronic works. See paragraph 1.E below.